Kees Cook [Wed, 16 Nov 2011 23:41:51 +0000 (10:41 +1100)]
ramoops: update parameters only after successful init
If a platform device exists on the system, but ramoops fails to attach to
it, the module parameters are overridden before ramoops can fall back and
try to use passed module parameters. Move update to end of init routine.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Marco Stornelli <marco.stornelli@gmail.com> Cc: Sergiu Iordache <sergiu@chromium.org> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Wed, 16 Nov 2011 23:41:50 +0000 (10:41 +1100)]
dio: optimize cache misses in the submission path
Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.
The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.
- Reorganize the code to only fetch block dev state when actually
needed.
Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.
- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.
This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tao Ma [Wed, 16 Nov 2011 23:41:49 +0000 (10:41 +1100)]
fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned. But
actually it still has some corner cases that can't be coverd. See the
following example:
dio_write foo -s 1024 -w 4096
(direct write 4096 bytes at offset 1024). The same goes if the offset
isn't aligned to fs_blocksize.
In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096). The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4). So we'd better call
get_block just once with the proper fs_count.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Manfred Spraul [Wed, 16 Nov 2011 23:41:49 +0000 (10:41 +1100)]
ipc/sem.c: alternatives to preempt_disable()
ipc/sem.c uses a custom wakeup scheme that relies on preempt_disable().
On -RT, this causes increased latencies and debug warnings.
The patch adds two additional schemes:
- one built around a completion - could be better for -RT kernels
- one built around a spinlock - unfortunately it's broken
- and the current one
My preferred solution would be the spinlock implementation: RT would use
premptible spinlocks, mainline normal spinlocks. Thus both get the
optimal implementation without any special code in ipc/sem.c.
Unfortunately, I don't see how it could be fixed.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When kdump is loaded, kexec detects the current memory configuration and
stores it in the pre-allocated ELF core header. Therefore, for kdump it
is necessary to reload the kdump kernel with kexec when the memory
configuration changes (e.g. for online/offline hotplug memory).
In order to do this automatically, udev rules should be used. This kernel
patch adds udev events for "online" and "offline". Together with this
kernel patch, the following udev rules for online/offline have to be added
to "/etc/udev/rules.d/98-kexec.rules":
Michael Holzheu [Wed, 16 Nov 2011 23:41:48 +0000 (10:41 +1100)]
kdump: fix crash_kexec()/smp_send_stop() race in panic
When two CPUs call panic at the same time there is a possible race
condition that can stop kdump. The first CPU calls crash_kexec() and the
second CPU calls smp_send_stop() in panic() before crash_kexec() finished
on the first CPU. So the second CPU stops the first CPU and therefore
kdump fails:
1st CPU:
panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
2nd CPU:
panic()->crash_kexec()->kexec_mutex already held by 1st CPU
->smp_send_stop()-> stop 1st CPU (stop kdump)
This patch fixes the problem by introducing a spinlock in panic that
allows only one CPU to process crash_kexec() and the subsequent panic
code.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
One result of this bug is that the memory chunk can never be set offline
using memory hotplug. With this patch I insert a new "System RAM"
resource for the released memory. Then the upper example looks like the
following:
node_to_cpumask() has been replaced by cpumask_of_node(), and wholly
removed since commit 29c337a0 ("cpumask: remove obsolete node_to_cpumask
now everyone uses cpumask_of_node").
So update the comments for setup_node_to_cpumask_map().
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tejun Heo [Wed, 16 Nov 2011 23:41:46 +0000 (10:41 +1100)]
workqueue: make alloc_workqueue() take printf fmt and args for name
alloc_workqueue() currently expects the passed in @name pointer to remain
accessible. This is inconvenient and a bit silly given that the whole wq
is being dynamically allocated. This patch updates alloc_workqueue() and
friends to take printf format string instead of opaque string and matching
varargs at the end. The name is allocated together with the wq and
formatted.
alloc_ordered_workqueue() is converted to a macro to unify varargs
handling with alloc_workqueue(), and, while at it, add comment to
alloc_workqueue().
None of the current in-kernel users pass in string with '%' as constant
name and this change shouldn't cause any problem.
Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
WARNING: line over 80 characters
#286: FILE: fs/proc/base.c:2433:
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
WARNING: line over 80 characters
#351: FILE: fs/proc/base.c:2498:
+ fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
WARNING: line over 80 characters
#352: FILE: fs/proc/base.c:2499:
+ if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
WARNING: line over 80 characters
#360: FILE: fs/proc/base.c:2507:
+ for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
WARNING: line over 80 characters
#368: FILE: fs/proc/base.c:2515:
+ info.len = snprintf(info.name, sizeof(info.name),
WARNING: line over 80 characters
#424: FILE: fs/proc/base.c:3179:
+ DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
WARNING: line over 80 characters
#437: FILE: include/linux/mm.h:1497:
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
total: 0 errors, 7 warnings, 387 lines checked
./patches/procfs-introduce-the-proc-pid-map_files-directory.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pavel Emelyanov [Wed, 16 Nov 2011 23:41:46 +0000 (10:41 +1100)]
procfs: introduce the /proc/<pid>/map_files/ directory
This one behaves similarly to the /proc/<pid>/fd/ one - it contains
symlinks one for each mapping with file, the name of a symlink is
"vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
results in a file that point exactly to the same inode as them vma's one.
For example the ls -l of some arbitrary /proc/<pid>/map_files/
1. When dumping a task mappings we do know exact file that is mapped
by particular region. We do this by opening
/proc/$pid/map_files/$address symlink the way we do with file
descriptors.
2. This also helps in determining which anonymous shared mappings are
shared with each other by comparing the inodes of them.
3. When restoring a set of processes in case two of them has a mapping
shared, we map the memory by the 1st one and then open its
/proc/$pid/map_files/$address file and map it by the 2nd task.
Using /proc/$pid/maps for this is quite inconvenient since it brings
repeatable re-reading and reparsing for this text file which slows down
restore procedure significantly. Also as being pointed in (3) it is a way
easier to use top level shared mapping in children as
/proc/$pid/map_files/$address when needed.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com> Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 16 Nov 2011 23:41:45 +0000 (10:41 +1100)]
mm: memcg: remove unused node/section info from pc->flags fix
Fix non-CONFIG_SPARSEMEM build, which failed with
mm/page_cgroup.c: In function `alloc_node_page_cgroup':
mm/page_cgroup.c:44: error: `start_pfn' undeclared (first use in this function)
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:45 +0000 (10:41 +1100)]
mm: memcg: remove unused node/section info from pc->flags
To find the page corresponding to a certain page_cgroup, the pc->flags
encoded the node or section ID with the base array to compare the pc
pointer to.
Now that the per-memory cgroup LRU lists link page descriptors directly,
there is no longer any code that knows the struct page_cgroup of a PFN but
not the struct page.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:44 +0000 (10:41 +1100)]
mm: make per-memcg LRU lists exclusive
Now that all code that operated on global per-zone LRU lists is converted
to operate on per-memory cgroup LRU lists instead, there is no reason to
keep the double-LRU scheme around any longer.
The pc->lru member is removed and page->lru is linked directly to the
per-memory cgroup LRU lists, which removes two pointers from a descriptor
that exists for every page frame in the system.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:44 +0000 (10:41 +1100)]
mm: collect LRU list heads into struct lruvec
Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU lists
and does not care about the container itself.
Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:43 +0000 (10:41 +1100)]
mm: vmscan: convert global reclaim to per-memcg LRU lists
The global per-zone LRU lists are about to go away on memcg-enabled
kernels, global reclaim must be able to find its pages on the per-memcg
LRU lists.
Since the LRU pages of a zone are distributed over all existing memory
cgroups, a scan target for a zone is complete when all memory cgroups are
scanned for their proportional share of a zone's memory.
The forced scanning of small scan targets from kswapd is limited to zones
marked unreclaimable, otherwise kswapd can quickly overreclaim by
force-scanning the LRU lists of multiple memory cgroups.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:43 +0000 (10:41 +1100)]
mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
root_mem_cgroup, lacking a configurable limit, was never subject to limit
reclaim, so the pages charged to it could be kept off its LRU lists. They
would be found on the global per-zone LRU lists upon physical memory
pressure and it made sense to avoid uselessly linking them to both lists.
The global per-zone LRU lists are about to go away on memcg-enabled
kernels, with all pages being exclusively linked to their respective
per-memcg LRU lists. As a result, pages of the root_mem_cgroup must also
be linked to its LRU lists again. This is purely about the LRU list,
root_mem_cgroup is still not charged.
The overhead is temporary until the double-LRU scheme is going away
completely.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:43 +0000 (10:41 +1100)]
mm: move memcg hierarchy reclaim to generic reclaim code
Memory cgroup limit reclaim and traditional global pressure reclaim will
soon share the same code to reclaim from a hierarchical tree of memory
cgroups.
In preparation of this, move the two right next to each other in
shrink_zone().
The mem_cgroup_hierarchical_reclaim() polymath is split into a soft limit
reclaim function, which still does hierarchy walking on its own, and a
limit (shrinking) reclaim function, which relies on generic reclaim code
to walk the hierarchy.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Memory cgroup limit reclaim currently picks one memory cgroup out of the
target hierarchy, remembers it as the last scanned child, and reclaims all
zones in it with decreasing priority levels.
The new hierarchy reclaim code will pick memory cgroups from the same
hierarchy concurrently from different zones and priority levels, it
becomes necessary that hierarchy roots not only remember the last scanned
child, but do so for each zone and priority level.
Until now, we reclaimed memcgs like this:
mem = mem_cgroup_iter(root)
for each priority level:
for each zone in zonelist:
reclaim(mem, zone)
But subsequent patches will move the memcg iteration inside the loop over
the zones:
for each priority level:
for each zone in zonelist:
mem = mem_cgroup_iter(root)
reclaim(mem, zone)
And to keep with the original scan order - memcg -> priority -> zone - the
last scanned memcg has to be remembered per zone and per priority level.
Furthermore, global reclaim will be switched to the hierarchy walk as
well. Different from limit reclaim, which can just recheck the limit
after some reclaim progress, its target is to scan all memcgs for the
desired zone pages, proportional to the memcg size, and so reliably
detecting a full hierarchy round-trip will become crucial.
Currently, the code relies on one reclaimer encountering the same memcg
twice, but that is error-prone with concurrent reclaimers. Instead, use a
generation counter that is increased every time the child with the highest
ID has been visited, so that reclaimers can stop when the generation
changes.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:42 +0000 (10:41 +1100)]
mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
Memory cgroup hierarchies are currently handled completely outside of the
traditional reclaim code, which is invoked with a single memory cgroup as
an argument for the whole call stack.
Subsequent patches will switch this code to do hierarchical reclaim, so
there needs to be a distinction between a) the memory cgroup that is
triggering reclaim due to hitting its limit and b) the memory cgroup that
is being scanned as a child of a).
This patch introduces a struct mem_cgroup_zone that contains the
combination of the memory cgroup and the zone being scanned, which is then
passed down the stack instead of the zone argument.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 16 Nov 2011 23:41:41 +0000 (10:41 +1100)]
mm: vmscan: distinguish global reclaim from global LRU scanning
The traditional zone reclaim code is scanning the per-zone LRU lists
during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
lists when reclaiming on behalf of a memory cgroup limit.
Subsequent patches will convert the traditional reclaim code to reclaim
exclusively from the per-memory cgroup LRU lists. As a result, using the
predicate for which LRU list is scanned will no longer be appropriate to
tell global reclaim from limit reclaim.
This patch adds a global_reclaim() predicate to tell direct/kswapd reclaim
from memory cgroup limit reclaim and substitutes it in all places where
currently scanning_global_lru() is used for that.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Memory control groups are currently bolted onto the side of
traditional memory management in places where better integration would
be preferrable. To reclaim memory, for example, memory control groups
maintain their own LRU list and reclaim strategy aside from the global
per-zone LRU list reclaim. But an extra list head for each existing
page frame is expensive and maintaining it requires additional code.
This patchset disables the global per-zone LRU lists on memory cgroup
configurations and converts all its users to operate on the per-memory
cgroup lists instead. As LRU pages are then exclusively on one list,
this saves two list pointers for each page frame in the system:
page_cgroup array size with 4G physical memory
vanilla: [ 0.000000] allocated 31457280 bytes of page_cgroup
patched: [ 0.000000] allocated 15728640 bytes of page_cgroup
At the same time, system performance for various workloads is
unaffected:
100G sparse file cat, 4G physical memory, 10 runs, to test for code
bloat in the traditional LRU handling and kswapd & direct reclaim
paths, without/with the memory controller configured in
4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
swap on SSD, 10 runs, to test for regressions in kswapd & direct
reclaim using per-memcg LRU lists with multiple memcgs and multiple
allocators within each memcg
The task counter subsystem has been written assuming that
can_attach_task/attach_task/cancel_attach_task calls are serialized per
task. This is true when we attach only one task but not when we attach a
whole thread group, in which case the sequence is:
for each thread
if (can_attach_task() < 0)
goto rollback
for each_thread
attach_task()
rollback:
for each thread
cancel_attach_task()
The common ancestor, searched on task_counter_attach_task(), can thus
change between each of these calls for a given task. This breaks if some
tasks in the thread group are not in the same cgroup origin. The uncharge
made in attach_task() or the rollback in cancel_attach_task() there would
have an erroneous propagation.
This can even break seriously is some scenario. For example there
with $PID beeing the pid of a multithread process:
On the last move, attach_task() is called on the thread leader with
the wrong common_ancestor, leading to a crash because we uncharge
a res_counter that doesn't exist:
To solve this, keep the original cgroup of each thread in the thread
group cached in the flex array and pass it to can_attach_task()/attach_task()
and cancel_attach_task() so that the correct common ancestor between the old
and new cgroup can be safely retrieved for each task.
This is inspired by a previous patch from Li Zefan:
"[PATCH] cgroups: don't cache common ancestor in task counter subsys".
Reported-by: Ben Blum <bblum@andrew.cmu.edu> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new subsystem to limit the number of running tasks, similar to the
NR_PROC rlimit but in the scope of a cgroup.
The user can set an upper bound limit that is checked every time a task
forks in a cgroup or is moved into a cgroup with that subsystem binded.
The primary goal is to protect against forkbombs that explode inside a
container. The traditional NR_PROC rlimit is not efficient in that case
because if we run containers in parallel under the same user, one of these
could starve all the others by spawning a high number of tasks close to
the user wide limit.
This is a prevention against forkbombs, so it's not deemed to cure the
effects of a forkbomb when the system is in a state where it's not
responsive. It's aimed at preventing from ever reaching that state and
stop the spreading of tasks early. While defining the limit on the
allowed number of tasks, it's up to the user to find the right balance
between the resource its containers may need and what it can afford to
provide.
As it's totally dissociated from the rlimit NR_PROC, both can be
complementary: the cgroup task counter can set an upper bound per
container and the rlmit can be an upper bound on the overall set of
containers.
Also this subsystem can be used to kill all the tasks in a cgroup without
races against concurrent forks, by setting the limit of tasks to 0, any
further forks can be rejected. This is a good way to kill a forkbomb in a
container, or simply kill any container without the need to retry an
unbound number of times.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let the subsystem's fork callback return an error value so that they can
cancel a fork. This is going to be used by the task counter subsystem to
implement the limit.
Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cgroups: pull up res counter charge failure interpretation to caller
res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.
However it's up to the caller to interpret this failure and return the
appropriate error value. The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.
Fix this by returning -1 when res_counter_charge() fails.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cgroups: ability to stop res charge propagation on bounded ancestor
Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.
For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups. Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter. This is going to be a
requirement for the coming max number of task subsystem.
To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cgroups: new cancel_attach_task() subsystem callback
To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no way to cancel
the attachment individually for each member of the process group.
This is going to be needed for the max number of tasks susbystem that is
coming.
To prepare for this integration, call a new cancel_attach_task() callback
on each task of the group until we reach the member that failed to attach.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Provide an API to inherit a counter value from a parent. This can be
useful to implement cgroup.clone_children on a resource counter.
Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jason Baron [Wed, 16 Nov 2011 23:41:36 +0000 (10:41 +1100)]
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nelson Elhage <nelhage@ksplice.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joakim Tjernlund [Wed, 16 Nov 2011 23:41:36 +0000 (10:41 +1100)]
crc32: optimize inner loop
Taking a pointer reference to each row in the crc table matrix, one can
reduce the inner loop with a few insn's
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Cc: Bob Pearson <rpearson@systemfabricworks.com> Cc: Frank Zago <fzago@systemfabricworks.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Wed, 16 Nov 2011 23:41:35 +0000 (10:41 +1100)]
checkpatch: update signature "might be better as" warning
email header lines can look like signature tags. It's valid to have
multiple email recipients on a single line but not valid to have multiple
signatures on a single line.
Validate signatures only when not in the email headers.
Clear the $in_commit_log flag when the patch filename appears.
Add '-' to the valid chars in a message header for headers
like "Message-Id:" and "In-Reply-To:".
Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ian Campbell [Wed, 16 Nov 2011 23:41:35 +0000 (10:41 +1100)]
get_maintainers.pl: follow renames when looking up commit signers
I happen to have had a commit to various network drivers since the big
renaming/reorg which happened to drivers/net recently. This means that I
now appear to be in the top few commit signers (by %age) for many of them
so am getting sent all sorts of stuff and people who are involved with the
driver are not. e.g. (to pick one at random):
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:5/7=71%)
Ian Campbell <ian.campbell@citrix.com> (commit_signer:2/7=29%)
Eric Dumazet <eric.dumazet@gmail.com> (commit_signer:1/7=14%)
Jeff Kirsher <jeffrey.t.kirsher@intel.com> (commit_signer:1/7=14%)
Jiri Pirko <jpirko@redhat.com> (commit_signer:1/7=14%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)
With the following patch the renames are followed and the result appears
much more sensible:
Andi Kleen [Wed, 16 Nov 2011 23:41:34 +0000 (10:41 +1100)]
brlocks/lglocks: clean up code
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason I can see to not just use common
utility functions that get pointers to the lglock.
Since there are at least two users it makes sense to share this code in a
library.
This will also make it later possible to dynamically allocate lglocks.
In general the users now look more like normal function calls with
pointers, not magic macros.
The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jesper Juhl [Wed, 16 Nov 2011 23:41:34 +0000 (10:41 +1100)]
audit: always follow va_copy() with va_end()
A call to va_copy() should always be followed by a call to va_end() in the
same function. In kernel/autit.c::audit_log_vformat() this is not always
done. This patch makes sure va_end() is always called.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Wed, 16 Nov 2011 23:41:32 +0000 (10:41 +1100)]
mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL
While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL besides the fact that we have support for it.
However setting that option will increase the size of struct page by eight
bytes on 64 bit, which we certainly do not want. Also, it doesn't make
sense that a present cpu feature should increase the size of struct page.
Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.
This patch:
If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which can
be selected if a double word aligned struct page is required. Also update
x86 Kconfig so that it should work as before.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
WARNING: please, no spaces at the start of a line
#57: FILE: arch/m68k/amiga/config.c:515:
+ __noreturn;$
total: 0 errors, 1 warnings, 106 lines checked
./patches/treewide-convert-uses-of-attrib_noreturn-to-__noreturn.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Wed, 16 Nov 2011 23:41:29 +0000 (10:41 +1100)]
intel_idle: fix API misuse
smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle. Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Lynch [Wed, 16 Nov 2011 23:41:29 +0000 (10:41 +1100)]
hpet: factor timer allocate from open
The current implementation of the /dev/hpet driver couples opening the
device with allocating one of the (scarce) timers (aka comparators). This
is a limitation in that the main counter may be valuable to applications
seeking a high-resolution timer who have no use for the interrupt
generating functionality of the comparators.
This patch alters the open semantics so that when the device is opened, no
timer is allocated. Operations that depend on a timer being in context
implicitly attempt allocating a timer, to maintain backward compatibility.
There is also an IOCTL (HPET_ALLOC_TIMER _IO) added so that the
allocation may be done explicitly. (I prefer the explicit open then
allocate pattern but don't know how practical it would be to require all
existing code to be changed.)
/dev/hpet is accessed via mmap(). This is the only interface of /dev/hpet
that is actually used in practice.
[akpm@linux-foundation.org: coding-style tweaks]
[arnd@arndb.de: fix build] Signed-off-by: Magnus Lynch <maglyx@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Acked-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Wed, 16 Nov 2011 23:41:28 +0000 (10:41 +1100)]
thp: reduce khugepaged freezing latency
Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups. A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.
khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.
Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Jiri Slaby <jslaby@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@suse.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
WARNING: line over 80 characters
#42: FILE: mm/page_alloc.c:3464:
+ /* Blocks with reserved pages will never free, skip them. */
WARNING: line over 80 characters
#61: FILE: mm/page_alloc.c:3477:
+ set_pageblock_migratetype(page, MIGRATE_RESERVE);
WARNING: line over 80 characters
#62: FILE: mm/page_alloc.c:3478:
+ move_freepages_block(zone, page, MIGRATE_RESERVE);
total: 0 errors, 3 warnings, 44 lines checked
./patches/mm-reduce-the-amount-of-work-done-when-updating-min_free_kbytes.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 16 Nov 2011 23:41:27 +0000 (10:41 +1100)]
mm: reduce the amount of work done when updating min_free_kbytes
When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.
The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary. This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 16 Nov 2011 23:41:26 +0000 (10:41 +1100)]
mm: avoid livelock on !__GFP_FS allocations
This patch seems to have gotten lost in the cracks and the discussion on
alternatives that started here https://lkml.org/lkml/2011/10/25/24 petered
out without any alternative patches being posted. Lacking a viable
alternative patch, I'm reposting this patch because AFAIK, this bug still
exists.
Colin Cross reported;
Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.
In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set. The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.
The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended. Hence, this patch special cases the suspend case to
fail the page allocation if reclaim cannot make progress. This might
cause suspend to abort but that is better than a livelock.
[mgorman@suse.de: Rework fix to be suspend specific] Reported-by: Colin Cross <ccross@android.com> Tested-by: Colin Cross <ccross@android.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Isaacson <adi@hexapodia.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 16 Nov 2011 23:41:25 +0000 (10:41 +1100)]
mm: do not stall in synchronous compaction for THP allocations
Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this have
been intermittent and not reliable to reproduce locally but;
Andy Isaacson reported a problem copying to VFAT on SD Card
https://lkml.org/lkml/2011/11/7/2
In this case, it was stuck in munmap for betwen 20 and 60
seconds in compaction. It is also possible that khugepaged
was holding mmap_sem on this process if CONFIG_NUMA was set.
Johannes Weiner reported stalls on USB
https://lkml.org/lkml/2011/7/25/378
In this case, there is no stack trace but it looks like the
same problem. The USB stick may have been using NTFS as a
filesystem based on other work done related to writing back
to USB around the same time.
Internally in SUSE, I received a bug report related to stalls in firefox
when using Java and Flash heavily while copying from NFS
to VFAT on USB. It has not been confirmed to be the same problem
but if it looks like a duck and quacks like a duck.....
In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync migraton
with __GFP_NO_KSWAPD] on the grounds that it was uncertain it was
beneficial.
While user-visible stalls do not happen for me when writing to USB, I
setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that allocate
transparent huge pages. I then logged when processes were stalled for
more than 1 second, recorded a stack strace and did some analysis to
aggregate unique "stall events" which revealed
Time stalled in this event: 47369 ms
Event count: 20
usemem sleep_on_page 3690 ms
usemem sleep_on_page 2148 ms
usemem sleep_on_page 1534 ms
usemem sleep_on_page 1518 ms
usemem sleep_on_page 1225 ms
usemem sleep_on_page 2205 ms
usemem sleep_on_page 2399 ms
usemem sleep_on_page 2398 ms
usemem sleep_on_page 3760 ms
usemem sleep_on_page 1861 ms
usemem sleep_on_page 2948 ms
usemem sleep_on_page 1515 ms
usemem sleep_on_page 1386 ms
usemem sleep_on_page 1882 ms
usemem sleep_on_page 1850 ms
usemem sleep_on_page 3715 ms
usemem sleep_on_page 3716 ms
usemem sleep_on_page 4846 ms
usemem sleep_on_page 1306 ms
usemem sleep_on_page 1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30
The stall times are approximate at best but the estimates represent 25% of
the worst stalls and even if the estimates are off by a factor of 10, it's
severe.
This patch once again prevents sync migration for transparent hugepage
allocations as it is preferable to fail a THP allocation than stall.
It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD to
look less like a special case. This would prevent THP allocation using
sync compaction but it would have other side-effects. There are existing
users of __GFP_NORETRY that are doing high-order allocations and while
they can handle allocation failure, it seems reasonable that they continue
to use sync compaction unless there is a deliberate reason to change that.
To help clarify this for the future, this patch updates the comment for
__GFP_NO_KSWAPD.
If accepted, this is a -stable candidate.
Reported-by: Andy Isaacson <adi@hexapodia.org> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jacobo Giralt [Wed, 16 Nov 2011 23:41:25 +0000 (10:41 +1100)]
mm: migrate: one less atomic operation
migrate_page_move_mapping() drops a reference from the old page after
unfreezing its counter. Both operations can be merged into a single
atomic operation by directly unfreezing to one less reference.
The same applies to migrate_huge_page_move_mapping().
Signed-off-by: Jacobo Giralt <jacobo.giralt@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Wed, 16 Nov 2011 23:41:24 +0000 (10:41 +1100)]
mm-add-extra-free-kbytes-tunable-update
All the fixes suggested by Andrew Morton. Not much of a changelog
since the patch should probably be folded into
mm-add-extra-free-kbytes-tunable.patch
Thank you for pointing these out, Andrew.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Wed, 16 Nov 2011 23:41:23 +0000 (10:41 +1100)]
mm: add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount of
memory free, by increasing the gap between each zone's min and low
watermarks.
This is useful for realtime applications that call system calls and have a
bound on the number of allocations that happen in any short time period.
In this application, extra_free_kbytes would be left at an amount equal to
or larger than than the maximum number of allocations that happen in any
burst.
It may also be useful to reduce the memory use of virtual machines
(temporarily?), in a way that does not cause memory fragmentation like
ballooning does.
Testing results from Satoru Moriya:
: I ran some sample workloads and measure memory allocation latency
: (latency of __alloc_page_nodemask()).
: The test is like following:
:
: - CPU: 1 socket, 4 core
: - Memory: 4GB
:
: - Background load:
: $ dd if=3D/dev/zero of=3D/tmp/tmp1
: $ dd if=3D/dev/zero of=3D/tmp/tmp2
: $ dd if=3D/dev/zero of=3D/tmp/tmp3
:
: - Main load:
: $ mapped-file-stream 1 $((1024 * 1024 * 640)) --(*)
:
: (*) This is made by Johannes Weiner
: https://lkml.org/lkml/2010/8/30/226
:
: It allocates/access 640MByte memory at a burst.
:
: The result is follwoing:
:
: | | extra |
: | default | kbytes |
: --------------------------------------------------------------
: min_free_kbytes | 8113 | 8113 |
: extra_free_kbytes | 0 | 640*1024 | (KB)
: --------------------------------------------------------------
: worst latency | 517.762 | 20.775 | (usec)
: --------------------------------------------------------------
: vmstat result | | |
: nr_vmscan_write | 0 | 0 |
: pgsteal_dma | 0 | 0 |
: pgsteal_dma32 | 143667 | 144882 |
: pgsteal_normal | 31486 | 27001 |
: pgsteal_movable | 0 | 0 |
: pgscan_kswapd_dma | 0 | 0 |
: pgscan_kswapd_dma32 | 138617 | 156351 |
: pgscan_kswapd_normal | 30593 | 27955 |
: pgscan_kswapd_movable | 0 | 0 |
: pgscan_direct_dma | 0 | 0 |
: pgscan_direct_dma32 | 5050 | 0 |
: pgscan_direct_normal | 896 | 0 |
: pgscan_direct_movable | 0 | 0 |
: kswapd_steal | 169207 | 171883 |
: kswapd_inodesteal | 0 | 0 |
: kswapd_low_wmark_hit_quickly | 43 | 45 |
: kswapd_high_wmark_hit_quickly | 1 | 0 |
: allocstall | 32 | 0 |
:
:
: As you can see, in the default case there were 32 direct reclaim
: (allocstal= l) and its worst latency was 517.762 usecs. This value may be
: larger if a process would sleep or issue I/O in the direct reclaim path.
: OTOH, ii the other case where I add extra free bytes, there were no direct
: reclaim and its worst latency was 20.775 usecs.
:
: In this test case, we can avoid direct reclaim and keep a latency low.
Signed-off-by: Rik van Riel<riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Tested-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit v2.6.36-5896-gd065bd8 "mm: retry page fault when blocking on
disk transfer" we usually wait in page-faults without mmap_sem held, so
all swap-token logic was broken, because it based on using
rwsem_is_locked(&mm->mmap_sem) as sign of in progress page-faults.
Add an atomic counter of in progress page-faults for mm to the mm_struct
with swap-token.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
mm_page_free_batched
Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
all freed pages, not only for directly freed. So, let's name it properly.
For pages freed via page-list we also trigger mm_page_free_batched event.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch adds helper free_hot_cold_page_list() to free list of 0-order
pages. It frees pages directly from list without temporary page-vector.
It also calls trace_mm_pagevec_free() to simulate pagevec_free()
behaviour.
vmscan: activate executable pages after first usage
Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once").
Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file pages.
Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.
Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access. Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.
After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.
I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory. In
this situation major-pagefaults are very costly, because all containers
share the same page. In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)
These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.
Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Wed, 16 Nov 2011 23:41:14 +0000 (10:41 +1100)]
drivers/scsi/sg.c: convert to kstrtoul_from_user()
Instead of open coding this function use kstrtoul_from_user() directly.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jesper Juhl [Wed, 16 Nov 2011 23:41:13 +0000 (10:41 +1100)]
drivers/scsi/aacraid/commctrl.c: fix mem leak in aac_send_raw_srb()
We leak in drivers/scsi/aacraid/commctrl.c::aac_send_raw_srb() :
We allocate memory:
...
struct user_sgmap* usg;
usg = kmalloc(actual_fibsize - sizeof(struct aac_srb)
+ sizeof(struct sgmap), GFP_KERNEL);
and then neglect to free it:
...
for (i = 0; i < usg->count; i++) {
u64 addr;
void* p;
if (usg->sg[i].count >
((dev->adapter_info.options &
AAC_OPT_NEW_COMM) ?
(dev->scsi_host_ptr->max_sectors << 9) :
65536)) {
rcode = -EINVAL;
goto cleanup;
... this 'goto' makes 'usg' go out of scope and leak the memory we
allocated.
Other exits properly kfree(usg), it's just here it is neglected.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Randy Dunlap [Wed, 16 Nov 2011 23:41:13 +0000 (10:41 +1100)]
drivers/scsi/megaraid.c: fix sparse warnings
Fix sparse warnings of right shift bigger than source value size:
drivers/scsi/megaraid.c:311:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:313:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:317:67: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:319:67: warning: right shift by bigger than source value
Patch suggestion from email by Al Viro:
"Since both are claimed to be strings, I really suspect that this >> 8 is
misspelled >> 4 and they have a character followed by pair of two-digit
packed decimals in there..."
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Neela Syam Kolli <megaraidlinux@lsi.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For headers that get exported to userland and make use of u32 style
type names, it is advised to include linux/types.h.
This fixes a headers_check warning.
Signed-off-by: Alexander Shishkin <virtuoso@slind.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Wed, 16 Nov 2011 23:41:12 +0000 (10:41 +1100)]
ocfs2: avoid unaligned access to dqc_bitmap
The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(),
ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These
are wrapper macros for ext2_*_bit() which need to take an unsigned long
aligned address (though some architectures are able to handle unaligned
address correctly)
So some 64bit architectures may not be able to access the dqc_bitmap
correctly.
This avoids such unaligned access by using another wrapper functions for
ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to
handle unaligned bitmap access.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Wed, 16 Nov 2011 23:41:12 +0000 (10:41 +1100)]
ext4: use proper little-endian bitops
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4. Only two ext4_{set,clear}_bit() calls check the return value. The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().
This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.
This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.
This also removes unused ext4_find_first_zero_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christine Chan [Wed, 16 Nov 2011 23:41:11 +0000 (10:41 +1100)]
kernel/timer.c: use debugobjects to catch deletion of uninitialized timers
del_timer_sync() calls debug_object_assert_init() to assert that a timer
has been initialized before calling lock_timer_base(). lock_timer_base()
would spin forever on a NULL(uninit-ed) base. The check is added to
del_timer() to prevent silent failure, even though it would not get stuck
in an infinite loop.
[sboyd@codeaurora.org: remove WARN, intialize timer function] Signed-off-by: Christine Chan <cschan@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christine Chan [Wed, 16 Nov 2011 23:41:11 +0000 (10:41 +1100)]
debugobjects: extend to assert that an object is initialized
Calling del_timer_sync() on an uninitialized timer leads to a never ending
loop in lock_timer_base() that spins checking for a non-NULL timer base.
Add an assertion to debugobjects to catch usage of uninitialized objects
so that we can initialize timers in the del_timer_sync() path before it
calls lock_timer_base().
[sboyd@codeaurora.org: clarify commit message] Signed-off-by: Christine Chan <cschan@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Wed, 16 Nov 2011 23:41:11 +0000 (10:41 +1100)]
debugobjects: be smarter about static objects
Remove the WARN_ON() in timer_fixup_activate() and actually use the return
code from fixup to tell the debugobjects code to print a warning. This
provides better diagnostic information via a nice debugobjects warning
instead of a simple WARN_ON(1) in the timer code with no information as to
what is wrong. We also assign a dummy timer callback so that if the timer
is actually set to fire we don't oops.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Christine Chan <cschan@codeaurora.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Amerigo Wang <amwang@redhat.com>
ERROR: Macros with complex values should be enclosed in parenthesis
#87: FILE: include/linux/ipc_namespace.h:126:
+#define DFLT_MSGSIZEMAX 1024*1024
ERROR: Macros with complex values should be enclosed in parenthesis
#88: FILE: include/linux/ipc_namespace.h:127:
+#define HARD_MSGSIZEMAX 16*1024*1024
total: 2 errors, 0 warnings, 75 lines checked
./patches/ipc-mqueue-update-maximums-for-the-mqueue-subsystem.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Doug Ledford <dledford@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ipc/mqueue.c: In function 'mqueue_get_inode':
ipc/mqueue.c:154:4: error: implicit declaration of function 'vmalloc'
ipc/mqueue.c:154:19: warning: assignment makes pointer from integer without=
a cast
ipc/mqueue.c: In function 'mqueue_evict_inode':
ipc/mqueue.c:278:3: error: implicit declaration of function 'vfree'
Caused by commit 8a53f9442429 ("ipc/mqueue: update maximums for the
mqueue subsystem"). See Rule 1 in Documentation/SubmitChecklist.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Doug Ledford <dledford@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>