Michal Hocko [Tue, 26 Mar 2013 23:24:35 +0000 (10:24 +1100)]
memcg: further simplify mem_cgroup_iter
mem_cgroup_iter basically does two things currently. It takes care of the
house keeping (reference counting, raclaim cookie) and it iterates through
a hierarchy tree (by using cgroup generic tree walk). The code would be
much more easier to follow if we move the iteration outside of the
function (to __mem_cgrou_iter_next) so the distinction is more clear.
This patch doesn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Tue, 26 Mar 2013 23:24:35 +0000 (10:24 +1100)]
memcg: simplify mem_cgroup_iter
Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.
It will be much easier if we got rid off css variable altogether and only
rely on memcg. In order to do that the iteration part has to skip dead
nodes. This sounds natural to me and as a nice side effect we will get a
simple invariant that memcg is always alive when non-NULL and all nodes
have been visited otherwise.
We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Tue, 26 Mar 2013 23:24:35 +0000 (10:24 +1100)]
memcg: relax memcg iter caching
Now that per-node-zone-priority iterator caches memory cgroups rather than
their css ids we have to be careful and remove them from the iterator when
they are on the way out otherwise they might live for unbounded amount of
time even though their group is already gone (until the global/targeted
reclaim triggers the zone under priority to find out the group is dead and
let it to find the final rest).
We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.
This number would be stored into iterator everytime when a memcg is
cached. If the iter count doesn't match the curent walker root's one we
will start from the root again. The group counter is incremented upwards
the hierarchy every time a group is removed.
The iter_lock can be dropped because racing iterators cannot leak the
reference anymore as the reference count is not elevated for last_visited
when it is cached.
Locking rules got a bit complicated by this change though. The iterator
primarily relies on rcu read lock which makes sure that once we see a
valid last_visited pointer then it will be valid for the whole RCU walk.
smp_rmb makes sure that dead_count is read before last_visited and
last_dead_count while smp_wmb makes sure that last_visited is updated
before last_dead_count so the up-to-date last_dead_count cannot point to
an outdated last_visited. css_tryget then makes sure that the
last_visited is still alive in case the iteration races with the cached
group removal (css is invalidated before mem_cgroup_css_offline increments
dead_count).
In short:
mem_cgroup_iter
rcu_read_lock()
dead_count = atomic_read(parent->dead_count)
smp_rmb()
if (dead_count != iter->last_dead_count)
last_visited POSSIBLY INVALID -> last_visited = NULL
if (!css_tryget(iter->last_visited))
last_visited DEAD -> last_visited = NULL
next = find_next(last_visited)
css_tryget(next)
css_put(last_visited) // css would be invalidated and parent->dead_count
// incremented if this was the last reference
iter->last_visited = next
smp_wmb()
iter->last_dead_count = dead_count
rcu_read_unlock()
cgroup_rmdir
cgroup_destroy_locked
atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
mem_cgroup_css_offline
mem_cgroup_invalidate_reclaim_iterators
while(parent = parent_mem_cgroup)
atomic_inc(parent->dead_count)
css_put(css) // last reference held by cgroup core
Spotted by Ying Han.
Original idea from Johannes Weiner.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Tue, 26 Mar 2013 23:24:34 +0000 (10:24 +1100)]
memcg: rework mem_cgroup_iter to use cgroup iterators
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node is
visited before its children.
Example:
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, d
574bd9f7 ("cgroup: implement generic child / descendant walk macros") has
introduced generic cgroup tree walkers which provide either pre-order or
post-order tree walk. This patch converts css->id based iteration to
pre-order tree walk to keep the semantic with the original iterator where
parent is always visited before its subtree.
cgroup_for_each_descendant_pre suggests using post_create and pre_destroy
for proper synchronization with groups addidition resp. removal. This
implementation doesn't use those because a new memory cgroup is
initialized sufficiently for iteration in mem_cgroup_css_alloc already and
css reference counting enforces that the group is alive for both the last
seen cgroup and the found one resp. it signals that the group is dead and
it should be skipped.
If the reclaim cookie is used we need to store the last visited group into
the iterator so we have to be careful that it doesn't disappear in the
mean time. Elevated reference count on the css keeps it alive even though
the group have been removed (parked waiting for the last dput so that it
can be freed).
Per node-zone-prio iter_lock has been introduced to ensure that css_tryget
and iter->last_visited is set atomically. Otherwise two racing walkers
could both take a references and only one release it leading to a css leak
(which pins cgroup dentry).
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Tue, 26 Mar 2013 23:24:34 +0000 (10:24 +1100)]
memcg: keep prev's css alive for the whole mem_cgroup_iter
The patchset tries to make mem_cgroup_iter saner in the way how it walks
hierarchies. css->id based traversal is far from being ideal as it is not
deterministic because it depends on the creation ordering. Additional to
that css_id is considered a burden for cgroup maintainers because it is
quite some code and memcg is the last user of it. After this series only
the swap accounting uses css_id but that one will follow up later.
The first patch is just preparatory and it changes when we release css of
the previously returned memcg. Nothing controlversial.
The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those groups'
css to keep them alive.
I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch. Johannes felt the race I was
describing should be mostly harmless and I haven't been able to trigger it
so the lock doesn't deserve its own patch. It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it. It will go away with the next patch.
The next patch fixups an unbounded cgroup removal holdoff caused by the
elevated css refcount. The issue has been observed by Ying Han. Johannes
wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed. He has suggested a
different way when the iterator checks whether a cached memcg is still
valid or no. More on that in the patch but the basic idea is that every
memcg tracks the number removed subgroups and iterator records this number
when a group is cached. These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.
The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is moved
to a helper so that the reference counting and iteration are separated.
The last patch just removes css_get_next as there is no user for it any
longer.
My testing looked as follows:
A (use_hierarchy=1, limit_in_bytes=150M)
/|\
1 2 3
Children groups were created so that the number is never higher than 3 and
their limits were random between 50-100M. Each group hosts a kernel build
(starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
and terminated after random time - up to 5 minutes) and then it is
removed.
This should exercise both leaf and hierarchical reclaim as well as races
with cgroup removals and debugging messages I added on top proved that.
100 groups were created during the test.
This patch:
css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently
releases the reference right after it gets the last css_id.
This is correct because neither prev's memcg nor cgroup are accessed after
then. This will change in the next patch so we need to hold the group
alive a bit longer so let's move the css_put at the end of the function.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Ying Han <yinghan@google.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:33 +0000 (10:24 +1100)]
mm/um: use free_highmem_page() to free highmem pages into buddy system
Use helper function free_highmem_page() to free highmem pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:33 +0000 (10:24 +1100)]
mm/SPARC: use free_highmem_page() to free highmem pages into buddy system
Use helper function free_highmem_page() to free highmem pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:33 +0000 (10:24 +1100)]
mm/PPC: use free_highmem_page() to free highmem pages into buddy system
Use helper function free_highmem_page() to free highmem pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Alexander Graf <agraf@suse.de> Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:32 +0000 (10:24 +1100)]
mm/ARM: use free_highmem_page() to free highmem pages into buddy system
Use helper function free_highmem_page() to free highmem pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:31 +0000 (10:24 +1100)]
mm: introduce free_highmem_page() helper to free highmem pages into buddy system
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501
Now it has also been expanded to reduce common code used by memory
initializion.
This is the second part, which applies to the previous part at:
http://marc.info/?l=linux-mm&m=136289696323825&w=2
It introduces a helper function free_highmem_page() to free highmem
pages into the buddy system when initializing mm subsystem.
Introduction of free_highmem_page() is one step forward to clean up
accesses and modificaitons of totalhigh_pages, totalram_pages and
zone->managed_pages etc. I hope we could remove all references to
totalhigh_pages from the arch/ subdirectory.
We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help
to test this patchset are welcomed!
There are several other parts still under development:
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.
This patch:
Introduce helper function free_highmem_page(), which will be used by
architectures with HIGHMEM enabled to free highmem pages into the buddy
system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Attilio Rao <attilio.rao@citrix.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Cong Wang <amwang@redhat.com> Cc: David Daney <david.daney@cavium.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yinghai Lu <yinghai@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:30 +0000 (10:24 +1100)]
mm/xtensa: use common help functions to free reserved pages
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:29 +0000 (10:24 +1100)]
mm/SPARC: use common help functions to free reserved pages
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:28 +0000 (10:24 +1100)]
mm/ppc: use common help functions to free reserved pages
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Anatolij Gustschin <agust@denx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:25 +0000 (10:24 +1100)]
mm/h8300: use common help functions to free reserved pages
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:23 +0000 (10:24 +1100)]
mm/ARM: use common help functions to free reserved pages
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:23 +0000 (10:24 +1100)]
mm/alpha: use common help functions to free reserved pages
Use common help functions to free reserved pages. Also include
<asm/sections.h> to avoid local declarations.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Tue, 26 Mar 2013 23:24:22 +0000 (10:24 +1100)]
mm: introduce common help functions to deal with reserved/managed pages
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501 Now it has also been
expanded to reduce common code used by memory initializion.
This is the first part, which applies to v3.9-rc1.
It introduces following common helper functions to simplify
free_initmem() and free_initrd_mem() on different architectures:
adjust_managed_page_count():
will be used to adjust totalram_pages, totalhigh_pages,
zone->managed_pages when reserving/unresering a page.
__free_reserved_page():
free a reserved page into the buddy system without adjusting
page statistics info
free_reserved_page():
free a reserved page into the buddy system and adjust page
statistics info
mark_page_reserved():
mark a page as reserved and adjust page statistics info
free_reserved_area():
free a continous ranges of pages by calling free_reserved_page()
free_initmem_default():
default method to free __init pages.
We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help to
test this patchset are welcomed!
There are several other parts still under development:
Part2: introduce free_highmem_page() to simplify freeing highmem pages
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.
This patch:
Code to deal with reserved/managed pages are duplicated by many
architectures, so introduce common help functions to reduce duplicated
code. These common help functions will also be used to concentrate code
to modify totalram_pages and zone->managed_pages, which makes the code
much more clear.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Anatolij Gustschin <agust@denx.de> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Mikael Starvik <starvik@axis.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Tue, 26 Mar 2013 23:24:22 +0000 (10:24 +1100)]
direct-io: Fix boundary block handling
When we read/write a file sequentially, we will read/write not only the
data blocks but also the indirect blocks that may not be physically
adjacent to the data blocks. So filesystems set the BH_Boundary flag to
submit the previous I/O before reading/writing an indirect block.
However the generic direct IO code mishandles buffer_boundary(), setting
sdio->boundary before each submit_page_section() call which results in
sending only one page bios as underlying code thinks this page is the last
in the contiguous extent. So fix the problem by setting sdio->boundary
only if the current page is really the last one in the mapped extent.
Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vineet Gupta [Tue, 26 Mar 2013 23:24:21 +0000 (10:24 +1100)]
memblock: add assertion for zero allocation alignment
This came to light when calling memblock allocator from arc port (for
copying flattended DT). If a "0" alignment is passed, the allocator
round_up() call incorrectly rounds up the size to 0.
While the obvious allocation failure causes kernel to panic, it is better
to warn the caller to fix the code.
Tejun suggested that instead of BUG_ON(!align) - which might be
ineffective due to pending console init and such, it is better to WARN_ON,
and continue the boot with a reasonable default align.
Caller passing @size need not be handled similarly as the subsequent
panic will indicate that anyhow.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hillf Danton [Tue, 26 Mar 2013 23:24:21 +0000 (10:24 +1100)]
rmap: recompute pgoff for unmapping huge page
We have to recompute pgoff if the given page is huge, since result based
on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
shown by commit 36e4f20af833 ("hugetlb: do not use vma_hugecache_offset()
for vma_prio_tree_foreach").
Signed-off-by: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Magenheimer [Tue, 26 Mar 2013 23:24:21 +0000 (10:24 +1100)]
staging: zcache: enable zcache to be built/loaded as a module
Allow zcache to be built/loaded as a module. Note runtime dependency
disallows loading if cleancache/frontswap lazy initialization patches are
not present. Zsmalloc support has not yet been merged into zcache but,
once merged, could now easily be selected via a module_param.
If built-in (not built as a module), the original mechanism of enabling via
a kernel boot parameter is retained, but this should be considered deprecated.
Note that module unload is explicitly not yet supported.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Rebased with different order of patches]
[v2: Removed [CLEANCACHE|FRONTSWAP]_HAS_LAZY_INIT ifdef]
[v3: Rebased on top of ramster->zcache move]
[v4: Redid the Makefile]
[v5: s/ZCACHE2/ZCACHE/] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Magenheimer [Tue, 26 Mar 2013 23:24:20 +0000 (10:24 +1100)]
staging: zcache: enable ramster to be built/loaded as a module
Enable module support for ramster. Note runtime dependency disallows
loading if cleancache/frontswap lazy initialization patches are not
present.
If built-in (not built as a module), the original mechanism of enabling via
a kernel boot parameter is retained, but this should be considered deprecated.
Note that module unload is explicitly not yet supported.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Fixed compile issues since ramster_init now has four arguments]
[v2: Fixed rebase on ramster->zcache move] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zcache/tmem: Better error checking on frontswap_register_ops return value.
In the past it either used to be NULL or the "older" backend. Now we
also return -Exx error codes.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andor Daam <andor.daam@googlemail.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Magenheimer [Tue, 26 Mar 2013 23:24:20 +0000 (10:24 +1100)]
xen: tmem: enable Xen tmem shim to be built/loaded as a module
Allow Xen tmem shim to be built/loaded as a module. Xen self-ballooning
and frontswap-selfshrinking are now also "lazily" initialized when the Xen
tmem shim is loaded as a module, unless explicitly disabled by module
parameters.
Note runtime dependency disallows loading if cleancache/frontswap lazy
initialization patches are not present.
If built-in (not built as a module), the original mechanism of enabling via
a kernel boot parameter is retained, but this should be considered deprecated.
Note that module unload is explicitly not yet supported.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Removed the [CLEANCACHE|FRONTSWAP]_HAS_LAZY_INIT ifdef]
[v2: Squashed the xen/tmem: Remove the subsys call patch in] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bob Liu [Tue, 26 Mar 2013 23:24:19 +0000 (10:24 +1100)]
mm: cleancache: clean up cleancache_enabled
cleancache_ops is used to decide whether backend is registered.
So now cleancache_enabled is always true if defined CONFIG_CLEANCACHE.
Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cleancache: Make cleancache_init use a pointer for the ops
Instead of using a backend_registered to determine whether a backend is
enabled. This allows us to remove the backend_register check and just do
'if (cleancache_ops)'
[v1: Rebase on top of b97c4b430b0a (ramster->zcache move] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Magenheimer [Tue, 26 Mar 2013 23:24:19 +0000 (10:24 +1100)]
mm: cleancache: lazy initialization to allow tmem backends to build/run as modules
With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to be
built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends to
register to cleancache even after filesystems were mounted. Calls to
init_fs and init_shared_fs are remembered as fake poolids but no real
tmem_pools created. On backend registration the fake poolids are mapped
to real poolids and respective tmem_pools.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Florian Schmaus <fschmaus@gmail.com> Signed-off-by: Andor Daam <andor.daam@googlemail.com> Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Minor fixes: used #define for some values and bools]
[v2: Removed CLEANCACHE_HAS_LAZY_INIT]
[v3: Added more comments, added a lock for [shared_|]fs_poolid_map] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Tue, 26 Mar 2013 23:24:18 +0000 (10:24 +1100)]
frontswap: get rid of swap_lock dependency
Frontswap initialization routine depends on swap_lock, which want to be
atomic about frontswap's first appearance. IOW, frontswap is not present
and will fail all calls OR frontswap is fully functional but if new
swap_info_struct isn't registered by enable_swap_info, swap subsystem
doesn't start I/O so there is no race between init procedure and page I/O
working on frontswap.
So let's remove unnecessary swap_lock dependency.
Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Minchan Kim <minchan@kernel.org>
[v1: Rebased on my branch, reworked to work with backends loading late]
[v2: Added a check for !map]
[v3: Made the invalidate path follow the init path]
[v4: Address comments by Wanpeng Li <liwanp@linux.vnet.ibm.com>] Signed-off-by: Konrad Rzeszutek Wilk <konrad@darnok.org> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bob Liu [Tue, 26 Mar 2013 23:24:18 +0000 (10:24 +1100)]
mm: frontswap: cleanup code
After allowing tmem backends to build/run as modules, frontswap_enabled
always true if defined CONFIG_FRONTSWAP. But frontswap_test() depends on
whether backend is registered, mv it into frontswap.c using fronstswap_ops
to make the decision.
frontswap_set/clear are not used outside frontswap, so don't export them.
Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
frontswap: make frontswap_init use a pointer for the ops
This simplifies the code in the frontswap - we can get rid of the
'backend_registered' test and instead check against frontswap_ops.
[v1: Rebase on top of 703ba7fe5e0 (ramster->zcache move] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Magenheimer [Tue, 26 Mar 2013 23:24:18 +0000 (10:24 +1100)]
mm: frontswap: lazy initialization to allow tmem backends to build/run as modules
With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to be
built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends to
register to frontswap even after swapon was run. Before a backend
registers all calls to init are recorded and the creation of tmem_pools
delayed until a backend registers or until a frontswap store is attempted.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Florian Schmaus <fschmaus@gmail.com> Signed-off-by: Andor Daam <andor.daam@googlemail.com> Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Fixes per Seth Jennings suggestions]
[v2: Removed FRONTSWAP_HAS_.. ]
[v3: Fix up per Bob Liu <lliubbo@gmail.com> recommendations]
[v4: Fix up per Andrew's comments] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Paul E. McKenney [Tue, 26 Mar 2013 23:24:17 +0000 (10:24 +1100)]
vm: adjust ifdef for TINY_RCU
There is an ifdef in page_cache_get_speculative() that checks for !SMP and
TREE_RCU, which has been an impossible combination since the advent of
TINY_RCU. The ifdef enables a fastpath that is valid when preemption is
disabled by rcu_read_lock() in UP systems, which is the case when TINY_RCU
is enabled. This commit therefore adjusts the ifdef to generate the
fastpath when TINY_RCU is enabled.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Tue, 26 Mar 2013 23:24:17 +0000 (10:24 +1100)]
mm, show_mem: suppress page counts in non-blockable contexts
On large systems with a lot of memory, walking all RAM to determine page
types may take a half second or even more.
In non-blockable contexts, the page allocator will emit a page allocation
failure warning unless __GFP_NOWARN is specified. In such contexts, irqs
are typically disabled and such a lengthy delay may even result in NMI
watchdog timeouts.
To fix this, suppress the page walk in such contexts when printing the
page allocation failure warning.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Robert Jarzmik [Tue, 26 Mar 2013 23:24:16 +0000 (10:24 +1100)]
mm: trace filemap add and del
Use the events API to trace filemap loading and unloading of file pieces
into the page cache.
This patch aims at tracing the eviction reload cycle of executable and
shared libraries pages in a memory constrained environment.
The typical usage is to spot a specific device and inode (for example
/lib/libc.so) to see the eviction cycles, and find out if frequently used
code is rather spread across many pages (bad) or coallesced (good).
Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Tue, 26 Mar 2013 23:24:16 +0000 (10:24 +1100)]
HWPOISON: check dirty flag to match against clean page
Currently page_action() does not check dirty flag to determine whether the
error page is "clean mlocked/unevictable LRU" page. This doesn't cause
any misjudgement because we do matching against "dirty mlocked/unevictable
LRU" just before the check. But in order to make code consistent and/or
to avoid potential regression, we had better check dirty flag explicitly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Chen Gong <gong.chen@linux.intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sasha Levin [Tue, 26 Mar 2013 23:24:15 +0000 (10:24 +1100)]
watchdog: trigger all-cpu backtrace when locked up and going to panic
Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.
This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.
Jan Kara [Tue, 26 Mar 2013 23:24:15 +0000 (10:24 +1100)]
fs: fix hang with BSD accounting on frozen filesystem
When BSD process accounting is enabled and logs information to a
filesystem which gets frozen, system easily becomes unusable because each
attempt to account process information blocks. Thus e.g. every task gets
blocked in exit.
It seems better to drop accounting information (which can already happen
when filesystem is running out of space) instead of locking system up. So
we open the accounting file with O_NONBLOCK.
Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Tested-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Tue, 26 Mar 2013 23:24:14 +0000 (10:24 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Tue, 26 Mar 2013 23:24:14 +0000 (10:24 +1100)]
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following build
warning when CONFIG_PM_SLEEP is not selected. This is because sleep PM
callbacks defined by SIMPLE_DEV_PM_OPS are only used when the
CONFIG_PM_SLEEP is enabled.
drivers/block/mg_disk.c:783:12: warning: 'mg_suspend' defined but not used [-Wunused-function]
drivers/block/mg_disk.c:807:12: warning: 'mg_resume' defined but not used [-Wunused-function]
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lglock: update lockdep annotations to report recursive local locks
Oleg Nesterov recently noticed that the lockdep annotations in lglock.c
are not sufficient to detect some obvious deadlocks, such as
lg_local_lock(LOCK) + lg_local_lock(LOCK) or spin_lock(X) +
lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X).
Both issues are easily fixed by indicating to lockdep that lglock's local
locks are not recursive. We shouldn't use the rwlock acquire/release
functions here, as lglock doesn't share the same semantics. Instead we
can base our lockdep annotations on the lock_acquire_shared (for local
lglock) and lock_acquire_exclusive (for global lglock) helpers.
I am not proposing new lglock specific helpers as I don't see the point of
the existing second level of helpers :)
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@linux.intel.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In lockdep.h, the spinlock/mutex/rwsem/rwlock/lock_map acquire macros have
different definitions based on the value of CONFIG_PROVE_LOCKING. We have
separate ifdefs for each of these definitions, which seems redundant.
Introduce lock_acquire_{exclusive,shared,shared_recursive} helpers which
will have different definitions based on CONFIG_PROVE_LOCKING. Then all
other helper macros can be defined based on the above ones, which reduces
the amount of ifdefined code.
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@linux.intel.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
James Hogan [Tue, 26 Mar 2013 23:24:13 +0000 (10:24 +1100)]
debug_locks.h: make warning more verbose
The WARN_ON(1) in DEBUG_LOCKS_WARN_ON is surprisingly awkward to track
down when it's hit, as it's usually buried in macros, causing multiple
instances to land on the same line number.
This patch makes it more useful by switching to:
WARN(1, "DEBUG_LOCKS_WARN_ON(%s)", #c);
so that the particular DEBUG_LOCKS_WARN_ON is more easily identified and
grep'd for. For example:
WARNING: at kernel/mutex.c:198 _mutex_lock_nested+0x31c/0x380()
DEBUG_LOCKS_WARN_ON(l->magic != l)
Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Liu [Tue, 26 Mar 2013 23:24:12 +0000 (10:24 +1100)]
ocfs2: delay inode update transactions after verifying the input flags
There is no need to start the inode update transactions before/while
verifying the input flags. As a refinement, this patch delay the
transactions utill the pre-check up is ok.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Tue, 26 Mar 2013 23:24:12 +0000 (10:24 +1100)]
ipvs: change type of netns_ipvs->sysctl_sync_qlen_max
This member of struct netns_ipvs is calculated from nr_free_buffer_pages
so change its type to unsigned long in case of overflow. Also, type of
its related proc var sync_qlen_max and the return type of function
sysctl_sync_qlen_max() should be changed to unsigned long, too.
Besides, the type of ipvs_master_sync_state->sync_queue_len should be
changed to unsigned long accordingly.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Borislav Petkov [Tue, 26 Mar 2013 23:24:12 +0000 (10:24 +1100)]
scripts/decodecode: make faulting insn ptr more robust
It can accidentally happen that the faulting insn (the exact instruction
bytes) is repeated a little further on in the trace. This causes that
same instruction to be tagged twice, see example below.
What we want to do, however, is to track back from the end of the whole
disassembly so many lines as the slice which starts with the faulting
instruction is long. This leads us to the actual faulting instruction and
*then* we tag it.
While we're at it, we can drop the sed "g" flag because we address only
this one line.
Also, if we point to an instruction which changes decoding depending on
the slice being objdumped, like a Jcc insn, for example, we do not even
tag it as a faulting instruction because the instruction decode changes in
the second slice but we use that second format as a regex on the fsrst
disassembled buffer and more often than not that instruction doesn't
match.
Again, simply tag the line which is deduced from the original "<>" marking
we've received from the kernel.
This also solves the pathologic issue of multiple tagging like this:
Rob Landley [Tue, 26 Mar 2013 23:24:11 +0000 (10:24 +1100)]
headers_install.pl: convert to headers_install.sh
Remove perl from make headers_install by replacing a perl script (doing a
simple regex search and replace) with a smaller, faster, simpler,
POSIX-2008 shell script implementation. The new shell script is a single
for loop calling sed and piping its output through unifdef to produce the
target file.
Same as last time except for minor tweak to deal with code review from
here: http://lkml.indiana.edu/hypermail/linux/kernel/1302.3/00078.html
(Note that this drops the "arch" argument, which isn't used. Kbuild
already points to the right input files on the command line.)
Signed-off-by: Rob Landley <rob@landley.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Boyer <jwboyer@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowell@redhat.com> Cc: Michal Marek <mmarek@suse.cz> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rob Landley [Tue, 26 Mar 2013 23:24:11 +0000 (10:24 +1100)]
mkcapflags.pl: convert to mkcapflags.sh
Generate asm-x86/cpufeature.h with posix-2008 commands instead of perl.
Signed-off-by: Rob Landley <rob@landley.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Boyer <jwboyer@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowell@redhat.com> Cc: Michal Marek <mmarek@suse.cz> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Engraf [Tue, 26 Mar 2013 23:24:11 +0000 (10:24 +1100)]
ktime_add_ns() may overflow on 32bit architectures
I've triggered an overflow when using ktime_add_ns() on a 32bit
architecture not supporting CONFIG_KTIME_SCALAR.
When passing a very high value for u64 nsec, e.g. 7881299347898368000 the
do_div() function converts this value to seconds (7881299347) which is
still to high to pass to the ktime_set() function as long. The result in
my case is a negative value.
The problem on my system occurs in the tick-sched.c,
tick_nohz_stop_sched_tick() when time_delta is set to
timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is
valid, thus ktime_add_ns() is called with a too large value resulting in a
negative expire value. This leads to an endless loop in the ticker code:
time_delta: 7881299347898368000
expires = ktime_add_ns(last_update, time_delta)
expires: negative value
This error doesn't occurs on 64bit or architectures supporting
CONFIG_KTIME_SCALAR (e.g. ARM, x86-32). 64-bit arches doesn't run into
this problem because ktime_add_ns() can directly calculate the result
without calling do_div() and ktime_set().
Signed-off-by: David Engraf <david.engraf@sysgo.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add some initial basic tests on a few posix timers interface such as
setitimer() and timer_settime().
These simply check that expiration happens in a reasonable timeframe after
expected elapsed clock time (user time, user + system time, real time,
...).
This is helpful for finding basic breakages while hacking
on this subsystem.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The posix cpu timer expiry time is stored in a union of two types: a 64
bits field if we rely on scheduler precise accounting, or a cputime_t if
we rely on jiffies.
This results in quite some duplicate code and special cases to handle the
two types.
Just unify this into a single 64 bits field. cputime_t can always fit
into it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
v3: Corrected the case where max_cpus != nr_cpu_ids by exiting early.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Tue, 26 Mar 2013 23:24:09 +0000 (10:24 +1100)]
timer_list: convert timer list to be a proper seq_file
When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition. On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer. The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.
Convert /proc/timer_list to a proper seq_file with its own iterator. This
is a little more complex given that we have to make two passes with two
separate headers.
[akpm@linux-foundation.org: whitespace fixlet]
[akpm@linux-foundation.org: fix up comment]
[akpm@linux-foundation.org: fix gcc warnings]
[akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Tue, 26 Mar 2013 23:24:08 +0000 (10:24 +1100)]
timer_list-split-timer_list_show_tickdevices-v4
v4: correct extra whitespace
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Tue, 26 Mar 2013 23:24:08 +0000 (10:24 +1100)]
timer_list: split timer_list_show_tickdevices()
Split timer_list_show_tickdevices() out the header and just pull the rest up
to timer_list_show. Also tweak the location of the whitespace. This is all
to prep for the fix.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ondrej Zary [Tue, 26 Mar 2013 23:24:08 +0000 (10:24 +1100)]
cyber2000fb: avoid palette corruption at higher clocks
When 1280x1024@75Hz mode is set, console palette is not set properly -
sometimes the background is white, sometimes yellow and text colors are
also messed up. This does not happen at 1280x1024@60Hz and below.
It seems that the HW needs some time before setting the palette - maybe
the PLL needs more time to lock at higher speeds. This patch fixes the
problem but without knowing what register to check for PLL lock(?), the
delay might be excessive.
On Fri, 28 Jan 2011 18:15:37 +0000
Russell King <rmk@arm.linux.org.uk> wrote:
> On Tue, Jan 18, 2011 at 01:14:24PM -0800, Andrew Morton wrote:
> > Russell, I have an (old) note here that this is awaiting an ack from
> > yourself?
>
> Well, I can reproduce this problem on the Netwinders here. I'm not sure
> that we should delay all mode switches by one second - and any attempt
> to reduce this value does result in the palette not being set correctly.
>
> For 1280x1024-75, the dotclock is 135MHz, which gives a PLL values of
> 0x41 and 0x06. That's: M=0x41+1, N=0x06+1, P=0x00 (top 2 bits of 0x06)
> -> Q=1
>
> Fpll = 14.31818MHz * M / N
> Fout = Fpll / Q
>
> The PLL itself is formed by dividing the 14-ish MHz frequency by N and
> phase comparing the output of the VCO, divided by M, and adjusting the
> VCO until the two correlate. As VCOs typically tend to have a limited
> range, it's normal to divide the output frequency to produce a greater
> range - and in this case that's done by Q.
>
> For the 800x600-100 copied from /etc/fb.modes, this has a dotclock of
> 67.5MHz, which is exactly half this rate. The PLL values for this are:
> M=0x41+1, N=0x06+1, P=0x01, giving PLL values of 0x41 and 0x46.
>
> Booting with 800x600-100 does not suffer the problem. So it's not
> related to PLL lock time. There's something else going on.
>
> Another experiment I tried was forcing the PLL values to produce 108MHz
> instead of 135MHz. 108MHz is the dotclock for 1280x1024-60. This too
> doesn't suffer the problem.
>
> I've also tried chosing other delay values. 100ms is too short and
> produces the problem, but 1s works. 1s for a PLL to lock is a hell of
> a time, especially for a PLL operating in the MHz range.
>
> I've tried setting the PLL to a known good freqency, and then switching
> to 135MHz - the problem persists. It's not like 135MHz is reaching the
> limits - it'll go up to 206MHz.
>
> So, I don't think this has anything to do with PLL locking. I think
> there's something else going on which isn't immediately obvious - maybe
> bandwidth starvation preventing us from writing properly to the palette?
> As it's a horrible VGA, where you write the same register multiple times
> I wouldn't be surprised if some writes were going missing.
>
> I'll see if I can play around with it some more this evening, but I've
> spent an awful long time on just this issue already this afternoon...
>
> I think further investigation needs to happen on this patch before it's
> acceptable. Or maybe we should prevent the cyberpro coming up in
Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhou Zhu [Tue, 26 Mar 2013 23:24:07 +0000 (10:24 +1100)]
drivers/video/mmp: remove legacy hw definitions
Removed legacy hw definitions in hw/mmp_ctrl.h. These definitions are for
earlier soc versions and are not supported in this driver.
Signed-off-by: Zhou Zhu <zzhu3@marvell.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Lisa Du <cldu@marvell.com> Cc: Guoqing Li <ligq@marvell.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haiyang Zhang [Tue, 26 Mar 2013 23:24:07 +0000 (10:24 +1100)]
video: fix a type warning in hyperv_fb.c
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haiyang Zhang [Tue, 26 Mar 2013 23:24:06 +0000 (10:24 +1100)]
drivers/video: add Hyper-V Synthetic Video Frame Buffer Driver
This is the driver for the Hyper-V Synthetic Video, which supports screen
resolution up to Full HD 1920x1080 on Windows Server 2012 host, and
1600x1200 on Windows Server 2008 R2 or earlier. It also solves the double
mouse cursor issue of the emulated video mode.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Cc: Olaf Hering <olaf@aepfle.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Devendra Naga [Tue, 26 Mar 2013 23:24:06 +0000 (10:24 +1100)]
drivers/video/console/fbcon_cw.c: fix compiler warning in cw_update_attr
with make W=1
saw
drivers/video/console/fbcon_cw.c: In function `cw_update_attr':
drivers/video/console/fbcon_cw.c:30:8: warning: variable `t' set but not used [-Wunused-but-set-variable]
matroxfb: convert struct i2c_msg initialization to C99 format
Convert the struct i2c_msg initialization to C99 format. This makes
maintaining and editing the code simpler. Also helps once other fields
like transferred are added in future.
Thanks to Julia Lawall for automating the conversion.
Signed-off-by: Shubhrajyoti D <shubhrajyoti@ti.com> Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Julia Lawall <julia@diku.dk> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daniel Vetter [Tue, 26 Mar 2013 23:24:05 +0000 (10:24 +1100)]
drm/fb-helper: don't sleep for screen unblank when an oops is in progress
Otherwise the system will burn even brighter and worse, leave the user
wondering what's going on exactly.
Since we already have a panic handler which will (try) to restore the
entire fbdev console mode, we can just bail out. Inspired by a patch from
Konstantin Khlebnikov. The callchain leading to this, cut&pasted from
Konstantin's original patch:
Note that the entire locking in the fb helper around panic/sysrq and kdbg
is ... non-existant. So we have a decent change of blowing up
everything. But since reworking this ties in with funny concepts like the
fbdev notifier chain or the impressive things which happen around
console_lock while oopsing, I'll leave that as an exercise for braver
souls than me.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Airlie <airlied@gmail.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>