Raghavendra K T [Wed, 5 Oct 2011 00:43:49 +0000 (11:43 +1100)]
memcg: rename mem variable to memcg
The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg". Rename all mem variables to memcg in source
file.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new subsystem to limit the number of running tasks, similar to the
NR_PROC rlimit but in the scope of a cgroup.
The user can set an upper bound limit that is checked every time a task
forks in a cgroup or is moved into a cgroup with that subsystem binded.
The primary goal is to protect against forkbombs that explode inside a
container. The traditional NR_PROC rlimit is not efficient in that case
because if we run containers in parallel under the same user, one of these
could starve all the others by spawning a high number of tasks close to
the user wide limit.
This is a prevention against forkbombs, so it's not deemed to cure the
effects of a forkbomb when the system is in a state where it's not
responsive. It's aimed at preventing from ever reaching that state and
stop the spreading of tasks early. While defining the limit on the
allowed number of tasks, it's up to the user to find the right balance
between the resource its containers may need and what it can afford to
provide.
As it's totally dissociated from the rlimit NR_PROC, both can be
complementary: the cgroup task counter can set an upper bound per
container and the rlmit can be an upper bound on the overall set of
containers.
Also this subsystem can be used to kill all the tasks in a cgroup without
races against concurrent forks, by setting the limit of tasks to 0, any
further forks can be rejected. This is a good way to kill a forkbomb in a
container, or simply kill any container without the need to retry an
unbound number of times.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Let the subsystem's fork callback return an error value so that they can
cancel a fork. This is going to be used by the task counter subsystem to
implement the limit.
Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
cgroups: pull up res counter charge failure interpretation to caller
res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.
However it's up to the caller to interpret this failure and return the
appropriate error value. The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.
Fix this by returning -1 when res_counter_charge() fails.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
cgroups: ability to stop res charge propagation on bounded ancestor
Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.
For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups. Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter. This is going to be a
requirement for the coming max number of task subsystem.
To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
cgroups: new cancel_attach_task() subsystem callback
To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no way to cancel
the attachment individually for each member of the process group.
This is going to be needed for the max number of tasks susbystem that is
coming.
To prepare for this integration, call a new cancel_attach_task() callback
on each task of the group until we reach the member that failed to attach.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Provide an API to inherit a counter value from a parent. This can be
useful to implement cgroup.clone_children on a resource counter.
Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Steven Rostedt [Wed, 5 Oct 2011 00:43:45 +0000 (11:43 +1100)]
cgroup/kmemleak: Annotate alloc_page() for cgroup allocations
When the cgroup base was allocated with kmalloc, it was necessary to
annotate the variable with kmemleak_not_leak(). But because it has
recently been changed to be allocated with alloc_page() (which skips
kmemleak checks) causes a warning on boot up.
I was triggering this output:
allocated 8388608 bytes of page_cgroup
please try 'cgroup_disable=memory' option if you don't want memory cgroups
kmemleak: Trying to color unknown object at 0xf5840000 as Grey
Pid: 0, comm: swapper Not tainted 3.0.0-test #12
Call Trace:
[<c17e34e6>] ? printk+0x1d/0x1f^M
[<c10e2941>] paint_ptr+0x4f/0x78
[<c178ab57>] kmemleak_not_leak+0x58/0x7d
[<c108ae9f>] ? __rcu_read_unlock+0x9/0x7d
[<c1cdb462>] kmemleak_init+0x19d/0x1e9
[<c1cbf771>] start_kernel+0x346/0x3ec
[<c1cbf1b4>] ? loglevel+0x18/0x18
[<c1cbf0aa>] i386_start_kernel+0xaa/0xb0
After a bit of debugging I tracked the object 0xf840000 (and others) down
to the cgroup code. The change from allocating base with kmalloc to
alloc_page() has the base not calling kmemleak_alloc() which adds the
pointer to the object_tree_root, but kmemleak_not_leak() adds it to the
crt_early_log[] table. On kmemleak_init(), the entry is found in the
early_log[] but not the object_tree_root, and this error message is
displayed.
If alloc_page() fails then it defaults back to vmalloc() which still uses
the kmemleak_alloc() which makes us still need the kmemleak_not_leak()
call. The solution is to call the kmemleak_alloc() directly if the
alloc_page() succeeds.
Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Andrew Morton <akpm@google.com>
Ben Blum [Wed, 5 Oct 2011 00:43:44 +0000 (11:43 +1100)]
cgroups: don't attach task to subsystem if migration failed
If a task has exited to the point it has called cgroup_exit() already,
then we can't migrate it to another cgroup anymore.
This can happen when we are attaching a task to a new cgroup between the
call to ->can_attach_task() on subsystems and the migration that is
eventually tried in cgroup_task_migrate().
In this case cgroup_task_migrate() returns -ESRCH and we don't want to
attach the task to the subsystems because the attachment to the new cgroup
itself failed.
Fix this by only calling ->attach_task() on the subsystems if the cgroup
migration succeeded.
Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ben Blum [Wed, 5 Oct 2011 00:43:44 +0000 (11:43 +1100)]
cgroups: more safe tasklist locking in cgroup_attach_proc
Fix unstable tasklist locking in cgroup_attach_proc.
According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures
proper exclusion.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Anders [Wed, 5 Oct 2011 00:43:43 +0000 (11:43 +1100)]
rtc: add initial support for mcp7941x parts
Add initial support for the microchip mcp7941x series of real time clocks.
The mcp7941x series is generally compatible with the ds1307 and ds1337 rtc
devices from dallas semiconductor. minor differences include a backup
battery enable bit, and the polarity of the oscillator enable bit.
Signed-off-by: David Anders <danders.dev@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Reviewed-by: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Waychison [Wed, 5 Oct 2011 00:43:42 +0000 (11:43 +1100)]
oprofilefs: handle zero-length writes
Currently in oprofilefs, files that use ulong_fops mis-handle writes of
zero length. A count of 0 causes oprofilefs_ulong_from_user to return 0
(success), which then leads to oprofile_set_ulong being called to stuff
"value" into file->private_data without it being initialized.
Fix this by moving the check for a zero-length write up into
ulong_write_file.
Signed-off-by: Mike Waychison <mikew@google.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Neil Armstrong [Wed, 5 Oct 2011 00:43:42 +0000 (11:43 +1100)]
init/do_mounts_rd.c: fix ramdisk identification for padded cramfs
When a cramfs ramdisk padded with 512 bytes is given to the kernel, the
current identify_ramdisk_image function fails to identify it.
Tested with a padded cramfs image on an ARM based board.
Signed-off-by: Neil Armstrong <narmstrong@neotion.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Davidlohr Bueso <dave@gnu.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Kosina [Wed, 5 Oct 2011 00:43:41 +0000 (11:43 +1100)]
binfmt_elf: fix PIE execution with randomization disabled
The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().
Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.
Based on original patch by H.J. Lu and Josh Boyer.
Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Andrew Morton <akpm@google.com>
Jason Baron [Wed, 5 Oct 2011 00:43:41 +0000 (11:43 +1100)]
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nelson Elhage <nelhage@ksplice.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nelson Elhage [Wed, 5 Oct 2011 00:43:41 +0000 (11:43 +1100)]
epoll: fix spurious lockdep warnings
epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd. This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings. Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.
Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:
--------------------8<--------------------
int main(void) {
int e1, e2;
struct epoll_event evt = {
.events = EPOLLIN
};
Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 5 Oct 2011 00:43:40 +0000 (11:43 +1100)]
lib-crc-add-slice-by-8-algorithm-to-crc32c-fix
don't include asm/msr.h
Cc: Bob Pearson <rpearson@systemfabricworks.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Roland Dreier <roland@kernel.org> Cc: frank zago <fzago@systemfabricworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
frank zago [Wed, 5 Oct 2011 00:43:40 +0000 (11:43 +1100)]
lib/crc: add slice by 8 algorithm to crc32.c
Add support for slice by 8 to existing crc32 algorithm. Also modify
gen_crc32table.c to only produce table entries that are actually used.
The parameters CRC_LE_BITS and CRC_BE_BITS determine the number of bits in
the input array that are processed during each step. Generally the more
bits the faster the algorithm is but the more table data required.
Using an x86_64 Opteron machine running at 2100MHz the following table was
collected with a pre-warmed cache by computing the crc 1000 times on a
buffer of 4096 bytes.
BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
default was 8 which actually selected the 32 bit algorithm. In
this version the value 8 is used to select the standard
8 bit algorithm and two new values: 32 and 64 are introduced
to select the slice by 4 and slice by 8 algorithms respectively.
Where Size is the size of crc32.o's text segment which includes
code and table data when both LE and BE versions are set to BITS.
The current version of crc32.c by default uses the slice by 4 algorithm
which requires about 2.8 cycles per byte. The slice by 8 algorithm is
roughly 2X faster and enables packet processing at over 1GB/sec on a
typical 2-3GHz system.
Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com> Cc: Roland Dreier <roland@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 5 Oct 2011 00:43:36 +0000 (11:43 +1100)]
lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef
These variables are only used when CONFIG_HOTPLUG_CPU is enabled, they are
ifdef'ed everywhere else. So don't define them when CONFIG_HOTPLUG_CPU is
not enabled.
Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: H Hartley Sweeten <hartleys@visionengravers.com> Cc: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lib/bitmap.c: quiet sparse noise about address space
__bitmap_parse() and __bitmap_parselist() both take a pointer to a kernel
buffer as a parameter and then cast it to a pointer to user buffer for use
in cases when the parameter is_user indicates that the buffer is actually
located in user space. This casting, and the casts in the callers,
results in sparse noise like the following:
warning: incorrect type in initializer (different address spaces)
expected char const [noderef] <asn:1>*ubuf
got char const *buf
warning: cast removes address space of expression
Since these casts are intentional, use __force to quiet the noise.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Len Brown <len.brown@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Wed, 5 Oct 2011 00:43:35 +0000 (11:43 +1100)]
lib/spinlock_debug.c: print owner on spinlock lockup
When SPIN_BUG_ON is triggered, the lock owner information is reported.
But it is omitted when spinlock lockup is detected.
This information is useful especially on the architectures which don't
implement trigger_all_cpu_backtrace() that is called just after detecting
lockup. So report it and also avoid message format duplication.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexey Dobriyan [Wed, 5 Oct 2011 00:43:35 +0000 (11:43 +1100)]
lib/kstrtox: common code between kstrto*() and simple_strto*() functions
Currently termination logic (\0 or \n\0) is hardcoded in _kstrtoull(),
avoid that for code reuse between kstrto*() and simple_strtoull().
Essentially, make them different only in termination logic.
simple_strtoull() (and scanf(), BTW) ignores integer overflow, that's a
bug we currently don't have guts to fix, making KSTRTOX_OVERFLOW hack
necessary.
Almost forgot: patch shrinks code size by about ~80 bytes on x86_64.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
My GPIOs are on an I2C port expander, so we must use the *_cansleep()
variant of the GPIO functions. This is was not being done in
create_gpio_led().
We can change gpio_get_value() to gpio_get_value_cansleep() because it is
only called from the platform_driver probe function, which is a context
where we can sleep.
Only tested on my gpio_cansleep() system, but it seems safe for all
systems.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Trent Piepho <tpiepho@gmail.com> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Wed, 5 Oct 2011 00:43:34 +0000 (11:43 +1100)]
drivers/leds/leds-renesas-tpu.c: move Renesas TPU LED driver platform data
Use the platform_data include directory for the TPU LED driver, as
suggested by Paul Mundt.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Wed, 5 Oct 2011 00:43:34 +0000 (11:43 +1100)]
drivers/leds/leds-renesas-tpu.c: update driver to use workqueue
Use a workqueue in the Renesas TPU LED driver to allow the Runtime PM code
to sleep.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wolfram Sang [Wed, 5 Oct 2011 00:43:33 +0000 (11:43 +1100)]
drivers/leds/leds-lm3530.c: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Wed, 5 Oct 2011 00:43:33 +0000 (11:43 +1100)]
leds-renesas-tpu-led-driver-v2-fix
include linux/module.h
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Wed, 5 Oct 2011 00:43:32 +0000 (11:43 +1100)]
leds: Renesas TPU LED driver
Add V2 of the LED driver for a single timer channel for the TPU hardware
block commonly found in Renesas SoCs.
The driver has been written with optimal Power Management in mind, so to
save power the LED is driven as a regular GPIO pin in case of maximum
brightness and power off which allows the TPU hardware to be idle and
which in turn allows the clocks to be stopped and the power domain to be
turned off transparently.
Any other brightness level requires use of the TPU hardware in PWM mode.
TPU hardware device clocks and power are managed through Runtime PM.
System suspend and resume is known to be working - during suspend the LED
is set to off by the generic LED code.
The TPU hardware timer is equipeed with a 16-bit counter together with an
up-to-divide-by-64 prescaler which makes the hardware suitable for
brightness control. Hardware blink is unsupported.
The LED PWM waveform has been verified with a Fluke 123 Scope meter on a
sh7372 Mackerel board. Tested with experimental sh7372 A3SP power domain
patches. Platform device bind/unbind tested ok.
V2 has been tested on the DS2 LED of the sh73a0-based AG5EVM.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Brown [Wed, 5 Oct 2011 00:43:31 +0000 (11:43 +1100)]
backlight: fix broken regulator API usage in l4f00242t03
The regulator support in the l4f00242t03 is very non-idiomatic. Rather
than requesting the regulators based on the device name and the supply
names used by the device the driver requires boards to pass system
specific supply names around through platform data. The driver also
conditionally requests the regulators based on this platform data, adding
unneeded conditional code to the driver.
Fix this by removing the platform data and converting to the standard
idiom, also updating all in tree users of the driver. As no datasheet
appears to be available for the LCD I'm guessing the names for the
supplies based on the existing users and I've no ability to do anything
more than compile test.
The use of regulator_set_voltage() in the driver is also problematic,
since fixed voltages are required the expectation would be that the
voltages would be fixed in the constraints set by the machines rather than
manually configured by the driver, but is less problematic.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wolfram Sang [Wed, 5 Oct 2011 00:43:31 +0000 (11:43 +1100)]
video/backlight: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hans Verkuil [Wed, 5 Oct 2011 00:43:30 +0000 (11:43 +1100)]
poll: add poll_requested_events() function
In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem.
Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table_struct pointer.
The solution is to set the qproc field to NULL in poll_table_struct once
poll() matches the events, not the poll_table_struct pointer itself. That
way drivers can obtain the mask through a new poll_requested_events
inline.
The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).
Since eventpoll always leaves the key field at ~0 instead of using the
requested events mask, that source was changed as well to properly fill in
the key field.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Reviewed-by: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WARNING: externs should be avoided in .c files
#99: FILE: arch/alpha/boot/misc.c:28:
+extern __printf(1, 2) long srm_printk(const char *, ...);
ERROR: space required after that ';' (ctx:VxV)
#178: FILE: arch/powerpc/boot/ps3.c:39:
+static inline __printf(1, 2) int DBG(const char *fmt, ...) {return 0;}
^
ERROR: "foo* bar" should be "foo *bar"
#225: FILE: arch/s390/include/asm/debug.h:175:
+debug_sprintf_event(debug_info_t* id, int level, char *string, ...);
ERROR: space required after that ',' (ctx:VxV)
#237: FILE: arch/s390/include/asm/debug.h:216:
+debug_sprintf_exception(debug_info_t *id, int level, char *string,...);
^
WARNING: space prohibited between function name and open parenthesis '('
#494: FILE: fs/ext2/ext2.h:139:
+void ext2_error (struct super_block *, const char *, const char *, ...);
WARNING: printk() should include KERN_ facility level
#719: FILE: fs/partitions/ldm.c:63:
+ printk("%s%s(): %pV\n", level, function, &vaf);
WARNING: space prohibited between function name and open parenthesis '('
#721: FILE: fs/partitions/ldm.c:65:
+ va_end (args);
WARNING: space prohibited between function name and open parenthesis '('
#750: FILE: fs/ufs/ufs.h:121:
+void ufs_warning (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#752: FILE: fs/ufs/ufs.h:123:
+void ufs_error (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#754: FILE: fs/ufs/ufs.h:125:
+void ufs_panic (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#1074: FILE: include/linux/ext3_fs.h:941:
+void ext3_error (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#1083: FILE: include/linux/ext3_fs.h:944:
+void ext3_abort (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#1085: FILE: include/linux/ext3_fs.h:946:
+void ext3_warning (struct super_block *, const char *, const char *, ...);
WARNING: do not add new typedefs
#1178: FILE: include/linux/kdb.h:119:
+typedef __printf(1, 2) int (*kdb_printf_t)(const char *, ...);
ERROR: "foo * bar" should be "foo *bar"
#1203: FILE: include/linux/kernel.h:299:
+extern __printf(2, 3) int sprintf(char * buf, const char * fmt, ...);
After merging the akpm tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:
In file included from arch/powerpc/boot/stdio.c:12:0:
arch/powerpc/boot/stdio.h:10:17: error: expected declaration specifiers or '...' before numeric constant
arch/powerpc/boot/stdio.h:10:20: error: expected declaration specifiers or '...' before numeric constant
arch/powerpc/boot/stdio.h:10:8: warning: return type defaults to 'int'
arch/powerpc/boot/stdio.h:10:8: warning: function declaration isn't a prototype
arch/powerpc/boot/stdio.h: In function '__printf':
arch/powerpc/boot/stdio.h:14:17: error: expected declaration specifiers or '...' before numeric constant
arch/powerpc/boot/stdio.h:14:20: error: expected declaration specifiers or '...' before numeric constant
arch/powerpc/boot/stdio.h:14:23: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'int'
arch/powerpc/boot/stdio.h:16:12: error: storage class specified for parameter 'vsprintf' Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
William Douglas [Wed, 5 Oct 2011 00:43:28 +0000 (11:43 +1100)]
printk: remove bounds checking for log_prefix
Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false).
Since the code being updated works because strtoul bombs out (endp isn't
updated) and 0 is returned anyway just remove the check and don't change
the behavior of the function.
Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
William Douglas [Wed, 5 Oct 2011 00:43:28 +0000 (11:43 +1100)]
printk: fix bounds checking for log_prefix
Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false). It should be testing to see if the character less than '0' or
greater than '9' instead. This patch makes that change.
The code being changed worked because strtoul bombs out (endp isn't
updated) and 0 is returned anyway.
Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jason Baron [Wed, 5 Oct 2011 00:43:26 +0000 (11:43 +1100)]
dynamic_debug: fix undefined reference to `__netdev_printk'
Dynamic debug recently added support for netdev_printk. It uses
__netdev_printk() to support this functionality. However, when CONFIG_NET
is not set, we get the following error:
lib/built-in.o: In function `__dynamic_netdev_dbg':
(.text+0x9fda): undefined reference to `__netdev_printk'
Fix this by making the call to netdev_printk() contingent upon CONFIG_NET.
We could have fixed this by defining netdev_printk() to a 'no-op' in the
!CONFIG_NET case. However, this is not consistent with how the networking
layer uses netdev_printk. For example, CONFIG_NET is not set,
netdev_printk() does not have a 'no-op' definition defined.
Signed-off-by: Jason Baron <jbaron@redhat.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Greg KH <greg@kroah.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@google.com>
Jason Baron [Wed, 5 Oct 2011 00:43:26 +0000 (11:43 +1100)]
dynamic_debug: use a single printk() to emit messages
We were using KERN_CONT to combine messages with their prefix. However,
KERN_CONT is not smp safe, in the sense that it can interleave messages.
This interleaving can result in printks coming out at the wrong loglevel.
With the high frequency of printks that dynamic debug can produce this is
not desirable.
So make dynamic_emit_prefix() fill a char buf[64] instead of doing a
printk directly. If we enable printing out of function, module, line, or
pid info, they are placed in this 64 byte buffer. In my testing 64 bytes
was enough size to fulfill all requests. Even if it's not, we can match
up the printk itself to see where it's from, so to me this is no big deal.
[akpm@linux-foundation.org: convert dangerous macro to C] Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Greg KH <greg@kroah.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@google.com>
Vasily Averin [Wed, 5 Oct 2011 00:43:25 +0000 (11:43 +1100)]
watchdog: move watchdog_*_all_cpus under CONFIG_SYSCTL
Fix compilation warnings for CONFIG_SYSCTL=n:
fixed compilation warnings in case of disabled CONFIG_SYSCTL
kernel/watchdog.c:483:13: warning: `watchdog_enable_all_cpus' defined but not used
kernel/watchdog.c:500:13: warning: `watchdog_disable_all_cpus' defined but not used
these functions are static and are used only in sysctl handler, so move
them inside #ifdef CONFIG_SYSCTL too
Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Andrew Morton <akpm@google.com>
stop_machine: make stop_machine safe and efficient to call early
Make stop_machine() safe to call early in boot, before SMP has been set
up, by simply calling the callback function directly if there's only one
CPU online.
[ Fixes from AKPM:
- add comment
- local_irq_flags, not save_flags
- also call hard_irq_disable() for systems which need it
Tejun suggested using an explicit flag rather than just looking at
the online cpu count. ]
Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Andrew Morton <akpm@google.com>
Mark Brown [Wed, 5 Oct 2011 00:43:23 +0000 (11:43 +1100)]
lis3lv02d: make regulator API usage unconditional
The regulator API contains a range of features for stubbing itself out
when not in use and for transparently restricting the actual effect of
regulator API calls where they can't be supported on a particular system
so that drivers don't need to individually implement this. Simplify the
driver slightly by making use of this idiom.
The only in tree user is ecovec24 which does not use the regulator API.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Éric Piel <eric.piel@tremplin-utc.net> Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com> Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>From my POV, it looks like the hardware is not working as expected
and returns a bogus data rate. The driver doesn't check the result
and directly uses it as some sort of divisor in some places:
msleep(lis3->pwron_delay / lis3lv02d_get_odr());
Under this circumstances, this could very well cause the
"divide by zero" exception from above.
For now, I fixed it the easiest and most obvious way:
Check if the result is sane and if it isn't use a sane default
instead. I went for "100" in the latter case, simply because
/sys/devices/platform/lis3lv02d/rate returns it on a successful
boot.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net> Cc: Matthew Garrett <mjg@redhat.com> Cc: Witold Pilat <witold.pilat@gmail.com> Cc: Lyall Pearce <lyall.pearce@hp.com> Cc: Malte Starostik <m-starostik@versanet.de> Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Cc: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Andrew Morton <akpm@google.com>
drivers/hwmon/hwmon.c: convert idr to ida and use ida_simple_get()
A straightforward looking use of idr for a device id.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pavel Emelyanov [Wed, 5 Oct 2011 00:43:18 +0000 (11:43 +1100)]
fs/pipe.c: add ->statfs callback for pipefs
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire
pipfs up so that the statfs succeeds.
This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.
When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has. Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Wed, 5 Oct 2011 00:43:17 +0000 (11:43 +1100)]
intel_idle: fix API misuse
smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle. Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Cree [Wed, 5 Oct 2011 00:43:17 +0000 (11:43 +1100)]
alpha: wire up sendmmsg syscall
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Cree [Wed, 5 Oct 2011 00:43:17 +0000 (11:43 +1100)]
alpha: wire up accept4 syscall
Somehow wiring up the accept4 syscall on Alpha was missed long ago.
This commit rectifies that oversight.
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Lynch [Wed, 5 Oct 2011 00:43:16 +0000 (11:43 +1100)]
hpet: factor timer allocate from open
The current implementation of the /dev/hpet driver couples opening the
device with allocating one of the (scarce) timers (aka comparators). This
is a limitation in that the main counter may be valuable to applications
seeking a high-resolution timer who have no use for the interrupt
generating functionality of the comparators.
This patch alters the open semantics so that when the device is opened, no
timer is allocated. Operations that depend on a timer being in context
implicitly attempt allocating a timer, to maintain backward compatibility.
There is also an IOCTL (HPET_ALLOC_TIMER _IO) added so that the
allocation may be done explicitly. (I prefer the explicit open then
allocate pattern but don't know how practical it would be to require all
existing code to be changed.)
/dev/hpet is accessed via mmap(). This is the only interface of /dev/hpet
that is actually used in practice.
[akpm@linux-foundation.org: coding-style tweaks]
[arnd@arndb.de: fix build] Signed-off-by: Magnus Lynch <maglyx@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Acked-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Wed, 5 Oct 2011 00:43:15 +0000 (11:43 +1100)]
selinuxfs: remove custom hex_to_bin()
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eric Paris <eparis@parisplace.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Wed, 5 Oct 2011 00:43:15 +0000 (11:43 +1100)]
vmscan: add barrier to prevent evictable page in unevictable list
When a race between putback_lru_page() and shmem_lock with lock=0 happens,
progrom execution order is as follows, but clear_bit in processor #1 could
be reordered right before spin_unlock of processor #1. Then, the page
would be stranded on the unevictable list.
spin_lock
SetPageLRU
spin_unlock
clear_bit(AS_UNEVICTABLE)
spin_lock
if PageLRU()
if !test_bit(AS_UNEVICTABLE)
move evictable list
smp_mb
if !test_bit(AS_UNEVICTABLE)
move evictable list
spin_unlock
But, pagevec_lookup() in scan_mapping_unevictable_pages() has
rcu_read_[un]lock() so it could protect reordering before reaching
test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
But it's a unexpected side effect and we should solve this problem
properly.
This patch adds a barrier after mapping_clear_unevictable.
I didn't meet this problem but just found during review.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@google.com>
warning: symbol 'default_policy' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stephen Wilson <wilsons@start.ca> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers,com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tomi Valkeinen <tomi.valkeinen@nokia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 5 Oct 2011 00:43:13 +0000 (11:43 +1100)]
mm: disable user interface to manually rescue unevictable pages
At one point, anonymous pages were supposed to go on the unevictable list
when no swap space was configured, and the idea was to manually rescue
those pages after adding swap and making them evictable again. But
nowadays, swap-backed pages on the anon LRU list are not scanned without
available swap space anyway, so there is no point in moving them to a
separate list anymore.
The manual rescue could also be used in case pages were stranded on the
unevictable list due to race conditions. But the code has been around for
a while now and newly discovered bugs should be properly reported and
dealt with instead of relying on such a manual fixup.
In addition to the lack of a usecase, the sysfs interface to rescue pages
from a specific NUMA node has been broken since its introduction, so it's
unlikely that anybody ever relied on that.
This patch removes the functionality behind the sysctl and the
node-interface and emits a one-time warning when somebody tries to access
either of them.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reported-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Wed, 5 Oct 2011 00:43:13 +0000 (11:43 +1100)]
vmscan.c: fix invalid strict_strtoul() check in write_scan_unevictable_node()
write_scan_unevictable_node() checks the value req returned by
strict_strtoul() and returns 1 if req is 0.
However, when strict_strtoul() returns 0, it means successful conversion
of buf to unsigned long.
Due to this, the function was not proceeding to scan the zones for
unevictable pages even though we write a valid value to the
scan_unevictable_pages sys file.
Change this check slightly to check for invalid value in buf as well as 0
value stored in res after successful conversion via strict_strtoul. In
both cases, we do not perform the scanning of this node's zones.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kyungmin Park [Wed, 5 Oct 2011 00:43:12 +0000 (11:43 +1100)]
mm: compaction: make compact_zone_order() static
There's no compact_zone_order() user outside file scope, so make it static.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dean Nelson [Wed, 5 Oct 2011 00:43:12 +0000 (11:43 +1100)]
HWPOISON: convert pr_debug()s to pr_info()s
Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
by Andi Kleen converted a number of pr_debug()s to pr_info()s.
About the same time additional code with pr_debug()s was added by two
other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
soft offlining for hugepage"). And these pr_debug()s failed to get
converted to pr_info()s.
This patch converts them as well. And does some minor related whitespace
cleanup.
Signed-off-by: Dean Nelson <dnelson@redhat.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tao Ma [Wed, 5 Oct 2011 00:43:11 +0000 (11:43 +1100)]
fs/buffer.c: add device information for error output in __find_get_block_slow()
On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.
If the device information is given, we can know the name of the sick
volume. Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Wed, 5 Oct 2011 00:43:11 +0000 (11:43 +1100)]
mm/mmap.c: eliminate the ret variable from mm_take_all_locks()
The ret variable is really not needed in mm_take_all_locks().
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>