Minchan Kim [Sat, 21 Jul 2012 00:54:19 +0000 (10:54 +1000)]
mm: factor out memory isolate functions
mm/page_alloc.c has some memory isolation functions but they are used only
when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make
it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:18 +0000 (10:54 +1000)]
mm, memcg: move all oom handling to memcontrol.c
By globally defining check_panic_on_oom(), the memcg oom handler can be
moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the
oom killer and cleans up the code.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Slab poisoning gave me a General Protection Fault on the
atomic_dec(&__task_cred(p)->user->processes);
line of release_task() called from wait_task_zombie(),
every time my dd to USB testing generated a memcg OOM.
oom_kill_process() now does the put_task_struct(),
mem_cgroup_out_of_memory() should not repeat it.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:18 +0000 (10:54 +1000)]
mm, oom: reduce dependency on tasklist_lock
Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
try to reduce the amount of time the readside is held for oom kills. This
makes the interface with the memcg oom handler more consistent since it
now never needs to take tasklist_lock unnecessarily.
The only time the oom killer now takes tasklist_lock is when iterating the
children of the selected task, everything else is protected by
rcu_read_lock().
This requires that a reference to the selected process, p, is grabbed
before calling oom_kill_process(). It may release it and grab a reference
on another one of p's threads if !p->mm, but it also guarantees that it
will release the reference before returning.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:17 +0000 (10:54 +1000)]
mm, memcg: introduce own oom handler to iterate only over its own threads
The global oom killer is serialized by the per-zonelist
try_set_zonelist_oom() which is used in the page allocator. Concurrent
oom kills are thus a rare event and only occur in systems using
mempolicies and with a large number of nodes.
Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom conditions
in several different memcgs in parallel.
This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration. If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.
Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.
The kernel lacks unfair writelocks, and successful calls to the oom killer
usually result in at least one thread entering the exit path, so an
alternative solution is needed.
This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time. Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.
This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim. So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:17 +0000 (10:54 +1000)]
mm, oom: introduce helper function to process threads during scan
This patch introduces a helper function to process each thread during the
iteration over the tasklist. A new return type, enum oom_scan_t, is
defined to determine the future behavior of the iteration:
- OOM_SCAN_OK: continue scanning the thread and find its badness,
- OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
ineligible,
- OOM_SCAN_ABORT: abort the iteration and return, or
- OOM_SCAN_SELECT: always select this thread with the highest badness
possible.
There is no functional change with this patch. This new helper function
will be used in the next patch in the memory controller.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Sat, 21 Jul 2012 00:54:16 +0000 (10:54 +1000)]
mm/hotplug: free zone->pageset when a zone becomes empty
When a zone becomes empty after memory offlining, free zone->pageset.
Otherwise it will cause memory leak when adding memory to the empty zone
again because build_all_zonelists() will allocate zone->pageset for an
empty zone.
Signed-off-by: Jiang Liu <liuj97@gmail.com> Signed-off-by: Wei Wang <Bessel.Wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Sat, 21 Jul 2012 00:54:16 +0000 (10:54 +1000)]
mm/hotplug: correctly add new zone to all other nodes' zone lists
When online_pages() is called to add new memory to an empty zone, it
rebuilds all zone lists by calling build_all_zonelists(). But there's a
bug which prevents the new zone to be added to other nodes' zone lists.
Here the node of the zone is put into N_HIGH_MEMORY state after calling
build_all_zonelists(), but build_all_zonelists() only adds zones from
nodes in N_HIGH_MEMORY state to the fallback zone lists.
build_all_zonelists()
So memory in the new zone will never be used by other nodes, and it may
cause strange behavor when system is under memory pressure. So put node
into N_HIGH_MEMORY state before calling build_all_zonelists().
Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <liuj97@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Sat, 21 Jul 2012 00:54:16 +0000 (10:54 +1000)]
mm/hotplug: correctly setup fallback zonelists when creating new pgdat
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com> Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: setup pageblock_order before it's used by sparsemem
On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
Itanium, pageblock_order is a variable with default value of 0. It's set
to the right value by set_pageblock_order() in function
free_area_init_core().
But pageblock_order may be used by sparse_init() before free_area_init_core()
is called along path:
sparse_init()
->sparse_early_usemaps_alloc_node()
->usemap_size()
->SECTION_BLOCKFLAGS_BITS
->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
NR_PAGEBLOCK_BITS)
The uninitialized pageblock_size will cause memory wasting because
usemap_size() returns a much bigger value then it's really needed.
For example, on an Itanium platform,
sparse_init() pageblock_order=0 usemap_size=24576
free_area_init_core() before pageblock_order=0, usemap_size=24576
free_area_init_core() after pageblock_order=12, usemap_size=8
That means 24K memory has been wasted for each section, so fix it by calling
set_pageblock_order() from sparse_init().
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <liuj97@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Sat, 21 Jul 2012 00:54:14 +0000 (10:54 +1000)]
vmscan: remove obsolete shrink_control comment
09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a
shrinker callback which was returning -1 when nr_to_scan is zero, which
caused excessive slab scanning. But 635697c6 ("vmscan: fix initial
shrinker size handling") fixed the problem, again so we can freely return
-1 although nr_to_scan is zero. So let's revert 09f363c7 because the
comment added in 09f363c7 made an unnecessary rule.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
0ee332c14518699 ("memblock: Kill early_node_map[]") wanted to replace
CONFIG_ARCH_POPULATES_NODE_MAP with CONFIG_HAVE_MEMBLOCK_NODE_MAP but
ended up replacing one occurence with a reference to the non-existent
symbol CONFIG_HAVE_MEMBLOCK_NODE.
The resulting omission of code would probably have been causing problems
to 32-bit machines with memory hotplug.
Signed-off-by: Rabin Vincent <rabin@rab.in> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: have order > 0 compaction start off where it left
This patch makes the comment for cc->wrapped longer, explaining what is
really going on. It also incorporates the comment fix pointed out by
Minchan.
Additionally, Minchan found that, when no pages get isolated, high_pfn
could be a value that is much lower than desired, which might potentially
cause compaction to skip a range of pages.
Only assign zone->compact_cache_free_pfn if we actually isolated free
pages for compaction.
Split out the calculation to get the start of the last page block in a
zone into its own, commented function.
Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Jim Schutt <jaschut@sandia.gov> Acked-by: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: have order > 0 compaction start off where it left
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.
However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.
This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.
The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.
Forced full (order == -1) compactions are left alone.
Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Jim Schutt <jaschut@sandia.gov> Tested-by: Jim Schutt <jaschut@sandia.gov> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Sat, 21 Jul 2012 00:54:12 +0000 (10:54 +1000)]
memcg: rename mem_control_xxx to memcg_xxx
Replace memory_cgroup_xxx() with memcg_xxx()
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Sat, 21 Jul 2012 00:54:12 +0000 (10:54 +1000)]
memcg: fix bad behavior in use_hierarchy file
I have an application that does the following:
* copy the state of all controllers attached to a hierarchy
* replicate it as a child of the current level.
I would expect writes to the files to mostly succeed, since they are
inheriting sane values from parents.
But that is not the case for use_hierarchy. If it is set to 0, we succeed
ok. If we're set to 1, the value of the file is automatically set to 1 in
the children, but if userspace tries to write the very same 1, it will
fail. That same situation happens if we set use_hierarchy, create a
child, and then try to write 1 again.
Now, there is no reason whatsoever for failing to write a value that is
already there. It doesn't even match the comments, that states:
/* If parent's use_hierarchy is set, we can't make any modifications
* in the child subtrees...
since we are not changing anything.
So test the new value against the one we're storing, and automatically
return 0 if we're not proposing a change.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Dhaval Giani <dhaval.giani@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Sat, 21 Jul 2012 00:54:10 +0000 (10:54 +1000)]
mm: do not use page_count() without a page pin
d179e84ba ("mm: vmscan: do not use page_count without a page pin") fixed
this problem in vmscan.c but same problem is in __count_immobile_pages().
I copy and paste d179e84ba's contents for description.
"It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU."
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:09 +0000 (10:54 +1000)]
mm, oom: replace some information in tasklist dump
The number of ptes and swap entries are used in the oom killer's badness
heuristic, so they should be shown in the tasklist dump.
This patch adds those fields and replaces cpu and oom_adj values that are
currently emitted. Cpu isn't interesting and oom_adj is deprecated and
will be removed later this year, the same information is already displayed
as oom_score_adj which is used internally.
At the same time, make the documentation a little more clear to state this
information is helpful to determine why the oom killer chose the task it
did to kill.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:09 +0000 (10:54 +1000)]
mm, oom: fix potential killing of thread that is disabled from oom killing
/proc/sys/vm/oom_kill_allocating_task will immediately kill current when
the oom killer is called to avoid a potentially expensive tasklist scan
for large systems.
Currently, however, it is not checking current's oom_score_adj value which
may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
killing.
This patch avoids killing current in such a condition and simply falls
back to the tasklist scan since memory still needs to be freed.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again
commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds
pages to the buddy allocator again") fixed one free_pcppages_bulk()
misuse. But two another miuse still exist.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()
Eric Wong reported his test suite failex when /tmp is tmpfs.
https://lkml.org/lkml/2012/2/24/479
Currentlt the input check of POSIX_FADV_WILLNEED has two problems.
- requires a_ops->readpage. But in fact, force_page_cache_readahead()
requires that the target filesystem has either ->readpage or ->readpages.
- returns -EINVAL when the filesystem doesn't have ->readpage. But
posix says that fadvise is merely a hint. Thus fadvise() should return
0 if filesystem has no means of implementing fadvise(). The userland
application should not know nor care whcih type of filesystem backs the
TMPDIR directory, as Eric pointed out. There is nothing which userspace
can do to solve this error.
So change the return value to 0 when filesytem doesn't support
readahead.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Eric Wong <normalperson@yhbt.net> Tested-by: Eric Wong <normalperson@yhbt.net> Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When CONFIG_COMPACTION is enabled, compaction_deferred() tries to
recalculate the deferred limit again, which isn't necessary.
When CONFIG_COMPACTION is disabled, compaction_deferred() should return
"true" or "false" since it has "bool" for its return value.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memcg: make mem_cgroup_force_empty_list() return bool
mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY
indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return
a bool to simplify the logic.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memcg: clean up force_empty_list() return value check
After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()"
mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove
the -ENOMEM and -EINTR checks.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memcg: remove check for signal_pending() during rmdir()
After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim
will occur when removing a memory cgroup. If -EINTR is returned here,
cgroup will show a warning.
We don't need to handle any user interruption signal. Remove this.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2012 00:54:06 +0000 (10:54 +1000)]
mm, oom: do not schedule if current has been killed
The oom killer currently schedules away from current in an uninterruptible
sleep if it does not have access to memory reserves. It's possible that
current was killed because it shares memory with the oom killed thread or
because it was killed by the user in the interim, however.
This patch only schedules away from current if it does not have a pending
kill, i.e. if it does not share memory with the oom killed thread. It's
possible that it will immediately retry its memory allocation and fail,
but it will immediately be given access to memory reserves if it calls the
oom killer again.
This prevents the delay of memory freeing when threads that share memory
with the oom killed thread get unnecessarily scheduled.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate
We already hold the hugetlb_lock. That should prevent a parallel cgroup
rmdir from touching page's hugetlb cgroup. So remove the exclude and
wakeup calls.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
A page's hugetlb cgroup assignment and movement to the active list should
occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup
we will iterate the active list and find pages with NULL hugetlb cgroup
values.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we fail to allocate pages from the reserve pool, hugetlb tries to
allocate huge pages using alloc_buddy_huge_page. Add these to the active
list. We also need to add the huge page we allocate when we soft offline
the oldpage to active list.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration
With HugeTLB pages, hugetlb cgroup is uncharged in compound page
destructor. Since we are holding a hugepage reference, we can be sure
that old page won't get uncharged till the last put_page().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Sat, 21 Jul 2012 00:54:02 +0000 (10:54 +1000)]
mm/hugetlb_cgroup: Add huge_page_order check to avoid incorrectly uncharge
alloc_huge_page() will call hugetlb_cgroup_charge_cgroup() to charge
pages, the compound page have less than 3 pages will not charge to hugetlb
cgroup. When alloc_huge_page fails it will call
hugetlb_cgroup_uncharge_cgroup to uncharge pages, however,
hugetlb_cgroup_uncharge_cgroup doesn't have huge_page_order check. That
means it will uncharge pages even if the compound page have less than 3
pages. Add huge_page_order check to avoid this incorrectly uncharge.
Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cgroup_subsys_state can never be NULL, so don't check for that in
hugetlb_cgroup_from_css. Also current task will always be part of some
cgroup. So hugetlb_cgrop_from_task cannot return NULL.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroup
Add the charge and uncharge routines for hugetlb cgroup. We do cgroup
charging in page alloc and uncharge in compound page destructor.
Assigning page's hugetlb cgroup is protected by hugetlb_lock.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb/cgroup: add the cgroup pointer to page lru
Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage
to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess
that is an acceptable limitation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cgroup_subsys_state can never be NULL, so don't check for that in
hugetlb_cgroup_from_css. Also current task will always be part of some
cgroup. So hugetlb_cgrop_from_task cannot return NULL.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement a new controller that allows us to control HugeTLB allocations.
The extension allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how
much HugeTLB pages it would require for its use.
The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb: add a list for tracking in-use HugeTLB pages
hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
On cgroup removal we update the page's HugeTLB cgroup to point to parent
cgroup.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since we migrate only one hugepage, don't use linked list for passing the
page around. Directly pass the page that need to be migrated as argument.
This also removes the usage of page->lru in the migrate path.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb
i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
fix linked list corruption in unmap_hugepage_range()") but we don't use
page->lru in unmap_hugepage_range any more. Also the lock was taken
higher up in the stack in some code path. That would result in deadlock.
For shared pagetable support for huge pages, since pagetable pages are ref
counted we don't need any lock during huge_pmd_unshare. We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
(39dde65c9940c97f ("shared page table for hugetlb page")).
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patchset implements a cgroup resource controller for HugeTLB pages.
The controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how
much HugeTLB pages it would require for its use.
The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met, job
schedulers like (SLURM) will schedule the job. We need to make sure that
the jobs won't consume more resources than requested. If they do we
should either error out or kill the application.
This patch:
Rename max_hstate to hugetlb_max_hstate. We will be using this from other
subsystems like hugetlb controller in later patches.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Sat, 21 Jul 2012 00:53:57 +0000 (10:53 +1000)]
mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.
Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, function should_fail() has "bool" for its return value, so it's
reasonable to change the return value of function should_fail_alloc_page()
into "bool" as well.
The patch does cleanup on function should_fail_alloc_page() to have "bool"
for its return value.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.
Even for mprotect_fixup(), we can get the right result in the end.
Signed-off-by: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
documentation: update how page-cluster affects swap I/O
Fix of the documentation of /proc/sys/vm/page-cluster to match the
behavior of the code and add some comments about what the tunable will
change in that behavior.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Swap readahead works fine, but the I/O to disk is almost always done in
page size requests, despite the fact that readahead submits
1<<page-cluster pages at a time.
On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more
I/Os than required.
On a single device this might not be an issue, but as soon as a server
runs on shared san resources savin I/Os not only improves swapin
throughput but also provides a lower resource utilization.
With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.
In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
I got ~10% to ~40% more throughput in my cases and at the same time much
lower cpu consumption when broken down per transferred kilobyte (the
majority of that due to saved interrupts and better cache handling). In a
shared SAN others might get an additional benefit as well, because this
now causes less protocol overhead.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Sat, 21 Jul 2012 00:53:54 +0000 (10:53 +1000)]
mm: make vb_alloc() more foolproof
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate
0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT
and interesting stuff happens. So make debugging such problems easier and
warn about 0-size allocation.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
vmalloc: walk vmap_areas by sorted list instead of rb_next()
There's a walk by repeating rb_next to find a suitable hole. Could be
simply replaced by walk on the sorted vmap_area_list. More simpler and
efficient.
Mutation of the list and tree only happens in pair within
__insert_vmap_area and __free_vmap_area, under protection of
vmap_area_lock. The patch code is also under vmap_area_lock, so the list
walk is safe, and consistent with the tree walk.
Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and
rounds for hours.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: do not call compound_head() in page_get_cache()
page_get_cache() does not need to call compound_head(), as its unique
caller virt_to_slab() already makes sure to return a head page.
Additionally, removing the compound_head() call makes page_get_cache()
consistent with page_get_slab().
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While allocating pages using buddy allocator, the compound page is
probably split up to free pages. Under these circumstances, the compound
page should be destroyed by destroy_compound_page(). However, there is a
duplicate check to judge if the page is compound.
Remove the duplicate check since the compound_order() returns 0 when the
page doesn't have PG_head set in destroy_compound_page(). That is to say,
destroy_compound_page() needn't check PageHead().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Sat, 21 Jul 2012 00:53:53 +0000 (10:53 +1000)]
xtensa/mm/fault.c: port OOM changes to do_page_fault
d065bd810b6de ("mm: retry page fault when blocking on disk transfer") and 37b23e0525d393d ("x86,mm: make pagefault killable")
The above commits introduced changes into the x86 pagefault handler for
making the page fault handler retryable as well as killable.
These changes reduce the mmap_sem hold time, which is crucial during OOM
killer invocation.
Port these changes to xtensa.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Acked-by: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the suid_dumpable sysctl is set to "2", and there is no core dump
pipe defined in the core_pattern sysctl, a local user can cause core files
to be written to root-writable directories, potentially with
user-controlled content. This means an admin can unknowningly reintroduce
a variation of CVE-2006-2451, allowing local users to gain root
privileges.
While cron has been fixed to abort reading a file when there is any parse
error, there are still other sensitive directories that will read any file
present and skip unparsable lines.
Instead of introducing a suid_dumpable=3 mode and breaking all users of
mode 2, this only disables the unsafe portion of mode 2 (writing to disk
via relative path). Most users of mode 2 (e.g. Chrome OS) already use a
core dump pipe handler, so this change will not break them. For the
situations where a pipe handler is not defined but mode 2 is still active,
crash dumps will only be written to fully qualified paths. If a relative
path is defined (e.g. the default "core" pattern), dump attempts will
trigger a printk yelling about the lack of a fully qualified path.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a file is truncated with truncate()/ftruncate() and then closed,
iversion is not updated. This patch uses ATTR_SIZE flag as an indication
to increment iversion.
Mimi said:
On fput(), i_version is used to detect and flag files that have changed
and need to be re-measured in the IMA measurement policy. When a file
is truncated with truncate()/ftruncate() and then closed, i_version is
not updated. As a result, although the file has changed, it will not be
re-measured and added to the IMA measurement list on subsequent access.
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com> Acked-by: Mimi Zohar <zohar@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Martin Michlmayr [Sat, 21 Jul 2012 00:53:51 +0000 (10:53 +1000)]
drivers/scsi/atp870u.c: fix bad use of udelay
The ACARD driver calls udelay() with a value > 2000, which leads to
to the following compilation error on ARM:
ERROR: "__bad_udelay" [drivers/scsi/atp870u.ko] undefined!
make[1]: *** [__modpost] Error 1
This is because udelay is defined on ARM, roughly speaking, as
The argument to __const_udelay is the number of jiffies to wait divided by
4, but this does not work unless the multiplication does not overflow, and
that is what the build error is designed to prevent. The intended
behavior can be achieved by using mdelay to call udelay multiple times in
a loop.
[jn: adding context] Signed-off-by: Martin Michlmayr <tbm@cyrius.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Sat, 21 Jul 2012 00:53:50 +0000 (10:53 +1000)]
ocfs2: use bitmap_weight()
Use bitmap_weight() instead of reinventing the wheel.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Sat, 21 Jul 2012 00:53:50 +0000 (10:53 +1000)]
ocfs2: use find_last_bit()
We already have find_last_bit(). So just use it as described in the
comment.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ARM: exynos: add thermal sensor driver platform data support
Add necessary default platform data support needed for TMU driver. This
dt/non-dt values are tested for origen exynos4210 and smdk exynos5250
platforms.
Signed-off-by: Amit Daniel Kachhap <amit.kachhap@linaro.org> Cc: Donggeun Kim <dg77.kim@samsung.com> Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> Cc: SangWook Ju <sw.ju@samsung.com> Cc: Durgadoss <durgadoss.r@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Kyungmin Park <kmpark@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thermal: exynos: register the tmu sensor with the kernel thermal layer
This code added creates a link between temperature sensors, linux thermal
framework and cooling devices for samsung exynos platform. This layer
monitors the temperature from the sensor and informs the generic thermal
layer to take the necessary cooling action.
[akpm@linux-foundation.org: fix comment layout] Signed-off-by: Amit Daniel Kachhap <amit.kachhap@linaro.org> Cc: Donggeun Kim <dg77.kim@samsung.com> Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> Cc: SangWook Ju <sw.ju@samsung.com> Cc: Durgadoss <durgadoss.r@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Kyungmin Park <kmpark@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thermal: exynos5: add exynos5 thermal sensor driver support
Insert exynos5 TMU sensor changes into the thermal driver. Some exynos4
changes are made generic for exynos series.
[akpm@linux-foundation.org: fix comment layout] Signed-off-by: SangWook Ju <sw.ju@samsung.com> Signed-off-by: Amit Daniel Kachhap <amit.kachhap@linaro.org> Cc: Donggeun Kim <dg77.kim@samsung.com> Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> Cc: Durgadoss <durgadoss.r@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Kyungmin Park <kmpark@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hwmon: exynos4: move thermal sensor driver to driver/thermal directory
This movement is needed because the hwmon entries and corresponding sysfs
interface is a duplicate of utilities already provided by
driver/thermal/thermal_sys.c. The goal is to place it in thermal folder
and add necessary functions to use the in-kernel thermal interfaces.
Signed-off-by: Amit Daniel Kachhap <amit.kachhap@linaro.org> Signed-off-by: Donggeun Kim <dg77.kim@samsung.com> Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> Cc: SangWook Ju <sw.ju@samsung.com> Cc: Durgadoss <durgadoss.r@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Kyungmin Park <kmpark@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patchset introduces a new generic cooling device based on cpufreq
that can be used on non-ACPI platforms. As a proof of concept, we have
drivers for the following platforms using this mechanism now:
* Samsung Exynos (Exynos4 and Exynos5) in the current patchset.
* TI OMAP (git://git.linaro.org/people/amitdanielk/linux.git omap4460_thermal)
* Freescale i.MX (git://git.linaro.org/people/amitdanielk/linux.git imx6q_thermal)
There is a small change in cpufreq cooling registration APIs, so a minor
change is needed for OMAP and Freescale platforms.
Brief Description:
1) The generic cooling devices code is placed inside driver/thermal/*
as placing inside acpi folder will need un-necessary enabling of acpi
code. This codes is architecture independent.
2) This patchset adds generic cpu cooling low level implementation
through frequency clipping. In future, other cpu related cooling
devices may be added here. An ACPI version of this already exists
(drivers/acpi/processor_thermal.c) . But this will be useful for
platforms like ARM using the generic thermal interface along with the
generic cpu cooling devices. The cooling device registration API's
return cooling device pointers which can be easily binded with the
thermal zone trip points. The important APIs exposed are,
a) struct thermal_cooling_device *cpufreq_cooling_register(
struct freq_clip_table *tab_ptr, unsigned int tab_size)
b) void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
3) Samsung exynos platform thermal implementation is done using the
generic cpu cooling APIs and the new trip type. The temperature sensor
driver present in the hwmon folder(registered as hwmon driver) is moved
to thermal folder and registered as a thermal driver.
A simple data/control flow diagrams is shown below,
Core Linux thermal <-----> Exynos thermal interface <----- Temperature Sensor
| |
\|/ |
Cpufreq cooling device <---------------
TODO:
*Will send the DT enablement patches later after the driver is merged.
This patch:
Add support for generic cpu thermal cooling low level implementations
using frequency scaling up/down based on the registration parameters.
Different cpu related cooling devices can be registered by the user and
the binding of these cooling devices to the corresponding trip points can
be easily done as the registration APIs return the cooling device pointer.
The user of these APIs are responsible for passing clipping frequency .
The drivers can also register to recieve notification about any cooling
action called.
[akpm@linux-foundation.org: fix comment layout] Signed-off-by: Amit Daniel Kachhap <amit.kachhap@linaro.org> Cc: Donggeun Kim <dg77.kim@samsung.com> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: SangWook Ju <sw.ju@samsung.com> Cc: Durgadoss <durgadoss.r@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Kyungmin Park <kmpark@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch add basic Renesas R-Car thermal sensor support.
It was tested on R-Car H1 Marzen board.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Len Brown <len.brown@intel.com> Cc: Joe Perches <joe@perches.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
temp_crit.name and temp_input.name have a length of 16 bytes. Using
THERMAL_NAME_LENGTH (20) as length parameter for snprintf() may result in
out-of-bounds memory accesses. Replace it with sizeof().
Addresses Coverity #115679
Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
How is the compiler even handling exported functions that are marked
inline? Anyway, these shouldn't be inline because of that, so remove that
marking.
Based on a larger patch by Mark Charlebois to get LLVM to build the
kernel.
Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Charlebois <mcharleb@qualcomm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: hank <pyu@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>