]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
15 years agomm: add zone nr_pages helper function
KOSAKI Motohiro [Thu, 8 Jan 2009 02:08:16 +0000 (18:08 -0800)]
mm: add zone nr_pages helper function

Add zone_nr_pages() helper function.

It is used by a later patch.  This patch doesn't have any functional
change.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomm: introduce zone_reclaim struct
KOSAKI Motohiro [Thu, 8 Jan 2009 02:08:15 +0000 (18:08 -0800)]
mm: introduce zone_reclaim struct

Add zone_reclam_stat struct for later enhancement.

A later patch uses this.  This patch doesn't any behavior change (yet).

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoinactive_anon_is_low: move to vmscan
KOSAKI Motohiro [Thu, 8 Jan 2009 02:08:14 +0000 (18:08 -0800)]
inactive_anon_is_low: move to vmscan

The inactive_anon_is_low() is called only vmscan.  Then it can move to
vmscan.c

This patch doesn't have any functional change.

Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: hierarchy avoid unnecessary reclaim
Daisuke Nishimura [Thu, 8 Jan 2009 02:08:13 +0000 (18:08 -0800)]
memcg: hierarchy avoid unnecessary reclaim

If hierarchy is not used, no tree-walk is necessary.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: swapout refcnt fix
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:08:13 +0000 (18:08 -0800)]
memcg: swapout refcnt fix

css's refcnt is dropped before end of following access.
Hold it until end of access.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: memory swap controller: fix limit check
Daisuke Nishimura [Thu, 8 Jan 2009 02:08:12 +0000 (18:08 -0800)]
memcg: memory swap controller: fix limit check

There are scatterd calls of res_counter_check_under_limit(), and most of
them don't take mem+swap accounting into account.

define mem_cgroup_check_under_limit() and avoid direct use of
res_counter_check_limit().

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: check group leader fix
Nikanth Karthikesan [Thu, 8 Jan 2009 02:08:11 +0000 (18:08 -0800)]
memcg: check group leader fix

Remove unnecessary codes (...fragments of not-implemented
functionalilty...)

Reported-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: revert gfp mask fix
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:08:10 +0000 (18:08 -0800)]
memcg: revert gfp mask fix

My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
happen at memory reclaim.

But in recent discussion, it's NACKed because it sounds ugly.

This patch is for reverting it and add some clean up to gfp_mask of
callers of charge.  No behavior change but need review before generating
HUNK in deep queue.

This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: fix reclaim result checks
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:08:09 +0000 (18:08 -0800)]
memcg: fix reclaim result checks

check_under_limit logic was wrong and this check should be against
mem_over_limit rather than mem.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: avoid unnecessary system-wide-oom-killer
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:08:08 +0000 (18:08 -0800)]
memcg: avoid unnecessary system-wide-oom-killer

Current mmtom has new oom function as pagefault_out_of_memory().  It's
added for select bad process rathar than killing current.

When memcg hit limit and calls OOM at page_fault, this handler called and
system-wide-oom handling happens.  (means kernel panics if panic_on_oom is
true....)

To avoid overkill, check memcg's recent behavior before starting
system-wide-oom.

And this patch also fixes to guarantee "don't accnout against process with
TIF_MEMDIE".  This is necessary for smooth OOM.

[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcontrol: rcu_read_lock() to protect mm_match_cgroup()
Lai Jiangshan [Thu, 8 Jan 2009 02:08:07 +0000 (18:08 -0800)]
memcontrol: rcu_read_lock() to protect mm_match_cgroup()

mm_match_cgroup() calls cgroup_subsys_state().

We must use rcu_read_lock() to protect cgroup_subsys_state().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: memory cgroup hierarchy feature selector
Balbir Singh [Thu, 8 Jan 2009 02:08:07 +0000 (18:08 -0800)]
memcg: memory cgroup hierarchy feature selector

Don't enable multiple hierarchy support by default.  This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature.  This feature can only be enabled when the cgroup for which the
feature this is enabled, has no children.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: memory cgroup hierarchical reclaim
Balbir Singh [Thu, 8 Jan 2009 02:08:06 +0000 (18:08 -0800)]
memcg: memory cgroup hierarchical reclaim

This patch introduces hierarchical reclaim.  When an ancestor goes over
its limit, the charging routine points to the parent that is above its
limit.  The reclaim process then starts from the last scanned child of the
ancestor and reclaims until the ancestor goes below its limit.

[akpm@linux-foundation.org: coding-style fixes]
[d-nishimura@mtf.biglobe.ne.jp: mem_cgroup_from_res_counter should handle both mem->res and mem->memsw]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: memory cgroup resource counters for hierarchy
Balbir Singh [Thu, 8 Jan 2009 02:08:05 +0000 (18:08 -0800)]
memcg: memory cgroup resource counters for hierarchy

Add support for building hierarchies in resource counters.  Cgroups allows
us to build a deep hierarchy, but we currently don't link the resource
counters belonging to the memory controller control groups, in the same
fashion as the corresponding cgroup entries in the cgroup hierarchy.  This
patch provides the infrastructure for resource counters that have the same
hiearchy as their cgroup counter parts.

These set of patches are based on the resource counter hiearchy patches
posted by Pavel Emelianov.

NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root.  It is known that hiearchies are
expensive, so the user needs to be careful and aware of the trade-offs
before creating very deep ones.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: memory cgroup hierarchy documentation
Balbir Singh [Thu, 8 Jan 2009 02:08:03 +0000 (18:08 -0800)]
memcg: memory cgroup hierarchy documentation

Documentation updates for hierarchy support

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: add mem_cgroup_disabled()
Hirokazu Takahashi [Thu, 8 Jan 2009 02:08:02 +0000 (18:08 -0800)]
memcg: add mem_cgroup_disabled()

We check mem_cgroup is disabled or not by checking
mem_cgroup_subsys.disabled.  I think it has more references than expected,
now.

replacing
   if (mem_cgroup_subsys.disabled)
with
   if (mem_cgroup_disabled())

give us good look, I think.

[kamezawa.hiroyu@jp.fujitsu.com: fix typo]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: synchronized LRU
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:08:01 +0000 (18:08 -0800)]
memcg: synchronized LRU

A big patch for changing memcg's LRU semantics.

Now,
  - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

  - LRU of page_cgroup is not synchronous with global LRU.

  - page and page_cgroup is one-to-one and statically allocated.

  - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

  - SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
       - at charge.
       - at account_move().
    2. at charge
       the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
       the page is isolated and not on LRU.

Pros.
  - easy for maintenance.
  - memcg can make use of laziness of pagevec.
  - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
  - LRU status of memcg will be synchronized with global LRU's one.
  - # of locks are reduced.
  - account_move() is simplified very much.
Cons.
  - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: mem+swap controller core
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:08:00 +0000 (18:08 -0800)]
memcg: mem+swap controller core

This patch implements per cgroup limit for usage of memory+swap.  However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.

Mem+Swap controller works as following.
  - memory usage is limited by memory.limit_in_bytes.
  - memory + swap usage is limited by memory.memsw_limit_in_bytes.

This has following benefits.
  - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

  - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
  map
    - charge  page and memsw.

  unmap
    - uncharge page/memsw if not SwapCache.

  swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

  swap-in (do_swap_page)
    - charged as page and memsw.
      record in swap_cgroup is cleared.
      memsw accounting is decremented.

  swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead.  This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
 - maybe more optimization can be don in swap-in path. (but not very safe.)
   But we just do simple accounting at this stage.

[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: swap cgroup for remembering usage
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:58 +0000 (18:07 -0800)]
memcg: swap cgroup for remembering usage

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
  - swap_cgroup_swapon() .... called from swapon
  - swap_cgroup_swapoff() ... called at the end of swapoff.

  - swap_cgroup_record() .... record information of swap entry.
  - swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information".  No actual method
for limit the usage of swap.  These routine uses flat table to record and
lookup.  "wise" lookup system like radix-tree requires requires memory
allocation at new records but swap-out is usually called under memory
shortage (or memcg hits limit.) So, I used static allocation.  (maybe
dynamic allocation is not very hard but it adds additional memory
allocation in memory shortage path.)

Note1: In this, we use pointer to record information and this means
      8bytes per swap entry. I think we can reduce this when we
      create "id of cgroup" in the range of 0-65535 or 0-255.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Hugh Dickins <hugh@veritas.com>
Reported-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: mem+swap controller Kconfig
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:57 +0000 (18:07 -0800)]
memcg: mem+swap controller Kconfig

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory to
remember "who uses swap".  This adds more overhead.  So, it's better to
offer "choice" to users.  This patch adds 2 choices.

This patch adds 2 parameters to enable swap extension or not.
  - CONFIG
  - boot option

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: handle swap caches
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:56 +0000 (18:07 -0800)]
memcg: handle swap caches

SwapCache support for memory resource controller (memcg)

Before mem+swap controller, memcg itself should handle SwapCache in proper
way.  This is cut-out from it.

In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache.  This is a leak of account and should be handled.

SwapCache accounting is done as following.

  charge (anon)
- charged when it's mapped.
  (because of readahead, charge at add_to_swap_cache() is not sane)
  uncharge (anon)
- uncharged when it's dropped from swapcache and fully unmapped.
  means it's not uncharged at unmap.
  Note: delete from swap cache at swap-in is done after rmap information
        is established.
  charge (shmem)
- charged at swap-in. this prevents charge at add_to_page_cache().

  uncharge (shmem)
- uncharged when it's dropped from swapcache and not on shmem's
  radix-tree.

  at migration, check against 'old page' is modified to handle shmem.

Comparing to the old version discussed (and caused troubles), we have
advantages of
  - PCG_USED bit.
  - simple migrating handling.

So, situation is much easier than several months ago, maybe.

[hugh@veritas.com: memcg: handle swap caches build fix]
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: new force_empty to free pages under group
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:55 +0000 (18:07 -0800)]
memcg: new force_empty to free pages under group

By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of
memory usage and force_empty is removed.

This patch adds "force_empty" again, in reasonable manner.

memory.force_empty file works when

  #echo 0 (or some) > memory.force_empty
  and have following function.

  1. only works when there are no task in this cgroup.
  2. free all page under this cgroup as much as possible.
  3. page which cannot be freed will be moved up to parent.
  4. Then, memcg will be empty after above echo returns.

This is much better behavior than old "force_empty" which just forget
all accounts. This patch also check signal_pending() and above "echo"
can be stopped by "Ctrl-C".

[akpm@linux-foundation.org: cleanup]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: reduce size of mem_cgroup by using nr_cpu_ids
Jan Blunck [Thu, 8 Jan 2009 02:07:53 +0000 (18:07 -0800)]
memcg: reduce size of mem_cgroup by using nr_cpu_ids

As Jan Blunck <jblunck@suse.de> pointed out, allocating per-cpu stat for
memcg to the size of NR_CPUS is not good.

This patch changes mem_cgroup's cpustat allocation not based on NR_CPUS
but based on nr_cpu_ids.

Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: move all acccounting to parent at rmdir()
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:53 +0000 (18:07 -0800)]
memcg: move all acccounting to parent at rmdir()

This patch provides a function to move account information of a page
between mem_cgroups and rewrite force_empty to make use of this.

This moving of page_cgroup is done under
 - lru_lock of source/destination mem_cgroup is held.
 - lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup()
should confirm pc->mem_cgroup is still valid or not.  Typical code can be
following.

(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irqrestore(&mz->lru_lock);

Of course, better way is
lock_page_cgroup(pc);
....
unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.

If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.

Expected users of this routine is:
  - force_empty (rmdir)
  - moving tasks between cgroup (for moving account information.)
  - hierarchy (maybe useful.)

force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)

If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.

Purpose of this behavior is
  - Fix "forget all" behavior of force_empty and avoid leak of accounting.
  - By "moving first, free if necessary", keep pages on memory as much as
    possible.

Adding a switch to change behavior of force_empty to
  - free first, move if necessary
  - free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration. (I'll add if someone requtests.)

This patch also removes memory.force_empty file, a brutal debug-only interface.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: do not recalculate section unnecessarily in init_section_page_cgroup
Fernando Luis Vazquez Cao [Thu, 8 Jan 2009 02:07:51 +0000 (18:07 -0800)]
memcg: do not recalculate section unnecessarily in init_section_page_cgroup

In init_section_page_cgroup() the section a given pfn belongs to is
calculated at the top of the function and, despite the fact that the
pfn/section correspondence does not change, it is recalculated further
down the same function.  By computing this just once and reusing that
value we save some bytes in the object file and do not waste CPU cycles.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: simple migration handling
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:50 +0000 (18:07 -0800)]
memcg: simple migration handling

Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldpage to newpage)

 before
  - "newpage" is charged before migration.
 at success.
  - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
 at failure
  - "newpage" is uncharged.
  - "oldpage" is charged if necessary (*1)

But (*1) is not reliable....because of GFP_ATOMIC.

This patch tries to change behavior as following by charge/commit/cancel ops.

 before
  - charge PAGE_SIZE (no target page)
 success
  - commit charge against "newpage".
 failure
  - commit charge against "oldpage".
    (PCG_USED bit works effectively to avoid double-counting)
  - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: fix gfp_mask of callers of charge
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:49 +0000 (18:07 -0800)]
memcg: fix gfp_mask of callers of charge

Fix misuse of gfp_kernel.

Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

I think that this is from the fact that page_cgroup *was* dynamically
allocated.

But now, we allocate all page_cgroup at boot.  And
mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
specified GFP_RECLAIM_MASK.

  * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

Note: This patch is not for fixing behavior but for showing sane information
      in source code.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: introduce charge-commit-cancel style of functions
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:48 +0000 (18:07 -0800)]
memcg: introduce charge-commit-cancel style of functions

There is a small race in do_swap_page().  When the page swapped-in is
charged, the mapcount can be greater than 0.  But, at the same time some
process (shares it ) call unmap and make mapcount 1->0 and the page is
uncharged.

      CPUA  CPUB
       mapcount == 1.
   (1) charge if mapcount==0     zap_pte_range()
                                (2) mapcount 1 => 0.
        (3) uncharge(). (success)
   (4) set page's rmap()
       mapcount 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
  - charge
   account to res_counter by PAGE_SIZE and try to free pages if necessary.
  - commit
   register page_cgroup and add to LRU if necessary.
  - cancel
   uncharge PAGE_SIZE because of do_swap_page failure.

     CPUA
  (1) charge (always)
  (2) set page's rmap (mapcount > 0)
  (3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does charge -> commit at a time.

And this patch also adds following function to clarify all charges.

  - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
called against newly allocated anon pages.

  - mem_cgroup_charge_migrate_fixup()
        called only from remove_migration_ptes().
we'll have to rewrite this later.(this patch just keeps old behavior)
This function will be removed by additional patch to make migration
clearer.

Good for clarifying "what we do"

Then, we have 4 following charge points.
  - newpage
  - swap-in
  - add-to-cache.
  - migration.

[akpm@linux-foundation.org: add missing inline directives to stubs]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodevices cgroup: allow mkfifo
Serge E. Hallyn [Thu, 8 Jan 2009 02:07:46 +0000 (18:07 -0800)]
devices cgroup: allow mkfifo

The devcgroup_inode_permission() hook in the devices whitelist cgroup has
always bypassed access checks on fifos.  But the mknod hook did not.  The
devices whitelist is only about block and char devices, and fifos can't
even be added to the whitelist, so fifos can't be created at all except by
tasks which have 'a' in their whitelist (meaning they have access to all
devices).

Fix the behavior by bypassing access checks to mkfifo.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: <stable@kernel.org> [2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodevcgroup: use list_for_each_entry_rcu()
Lai Jiangshan [Thu, 8 Jan 2009 02:07:45 +0000 (18:07 -0800)]
devcgroup: use list_for_each_entry_rcu()

We should use list_for_each_entry_rcu in RCU read site.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: make cgroup_path() RCU-safe
Paul Menage [Thu, 8 Jan 2009 02:07:44 +0000 (18:07 -0800)]
cgroups: make cgroup_path() RCU-safe

Fix races between /proc/sched_debug by freeing cgroup objects via an RCU
callback.  Thus any cgroup reference obtained from an RCU-safe source will
remain valid during the RCU section.  Since dentries are also RCU-safe,
this allows us to traverse up the tree safely.

Additionally, make cgroup_path() check for a NULL cgrp->dentry to avoid
trying to report a path for a partially-created cgroup.

[lizf@cn.fujitsu.com: call deactive_super() in cgroup_diput()]
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: skip processes from other namespaces when listing a cgroup
Gowrishankar M [Thu, 8 Jan 2009 02:07:43 +0000 (18:07 -0800)]
cgroups: skip processes from other namespaces when listing a cgroup

Once tasks are populated from system namespace inside cgroup, container
replaces other namespace task with 0 while listing tasks, inside
container.

Though this is expected behaviour from container end, there is no use of
showing unwanted 0s.

In this patch, we check if a process is in same namespace before loading
into pid array.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Gowrishankar M <gowrishankar.m@in.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: introduce link_css_set() to remove duplicate code
Li Zefan [Thu, 8 Jan 2009 02:07:42 +0000 (18:07 -0800)]
cgroups: introduce link_css_set() to remove duplicate code

Add a common function link_css_set() to link a css_set to a cgroup.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: add inactive subsystems to rootnode.subsys_list
Li Zefan [Thu, 8 Jan 2009 02:07:42 +0000 (18:07 -0800)]
cgroups: add inactive subsystems to rootnode.subsys_list

Though for an inactive hierarchy, we have subsys->root == &rootnode, but
rootnode's subsys_list is always empty.

This conflicts with the code in find_css_set():

for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
...
if (ss->root->subsys_list.next == &ss->sibling) {
...
}
}
if (list_empty(&rootnode.subsys_list)) {
...
}

The above code assumes rootnode.subsys_list links all inactive
hierarchies.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: make root_list contains active hierarchies only
Li Zefan [Thu, 8 Jan 2009 02:07:41 +0000 (18:07 -0800)]
cgroups: make root_list contains active hierarchies only

Don't link rootnode to the root list, so root_list contains active
hierarchies only as the comment indicates.  And rename for_each_root() to
for_each_active_root().

Also remove redundant check in cgroup_kill_sb().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: remove rcu_read_lock() in cgroupstats_build()
Lai Jiangshan [Thu, 8 Jan 2009 02:07:40 +0000 (18:07 -0800)]
cgroups: remove rcu_read_lock() in cgroupstats_build()

cgroup_iter_* do not need rcu_read_lock().

In cgroup_enable_task_cg_lists(), do_each_thread() and while_each_thread()
are protected by RCU, it's OK, for write_lock(&css_set_lock) implies
rcu_read_lock() in non-RT kernel.

If we need explicit rcu_read_lock(), we should add rcu_read_lock() in
cgroup_enable_task_cg_lists(), not cgroup_iter_*.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: call find_css_set() safely in cgroup_attach_task()
Lai Jiangshan [Thu, 8 Jan 2009 02:07:39 +0000 (18:07 -0800)]
cgroups: call find_css_set() safely in cgroup_attach_task()

In cgroup_attach_task(), tsk maybe exit when we call find_css_set().  and
find_css_set() will access to invalid css_set.

This patch increases the count before get_css_set(), and decreases it
after find_css_set().

NOTE:

css_set's refcount is also taskcount, after this patch applied, taskcount
may be off-by-one WHEN cgroup_lock() is not held.  but I reviewed other
code which use taskcount, they are still correct.  No regression found by
reviewing and simply testing.

So I do not use two counters in css_set.  (one counter for taskcount, the
other for refcount.  like struct mm_struct) If this fix cause regression,
we will use two counters in css_set.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: use task_lock() for access tsk->cgroups safe in cgroup_clone()
Lai Jiangshan [Thu, 8 Jan 2009 02:07:38 +0000 (18:07 -0800)]
cgroups: use task_lock() for access tsk->cgroups safe in cgroup_clone()

Use task_lock() protect tsk->cgroups and get_css_set(tsk->cgroups).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: don't put struct cgroupfs_root protected by RCU
Lai Jiangshan [Thu, 8 Jan 2009 02:07:37 +0000 (18:07 -0800)]
cgroups: don't put struct cgroupfs_root protected by RCU

We don't access struct cgroupfs_root in fast path, so we should not put
struct cgroupfs_root protected by RCU

But the comment in struct cgroup_subsys.root confuse us.

struct cgroup_subsys.root is used in these places:

1 find_css_set(): if (ss->root->subsys_list.next == &ss->sibling)
2 rebind_subsystems(): if (ss->root != &rootnode)
                       rcu_assign_pointer(ss->root, root);
                       rcu_assign_pointer(subsys[i]->root, &rootnode);
3 cgroup_has_css_refs(): if (ss->root != cgrp->root)
4 cgroup_init_subsys(): ss->root = &rootnode;
5 proc_cgroupstats_show(): ss->name, ss->root->subsys_bits,
                           ss->root->number_of_cgroups, !ss->disabled);
6 cgroup_clone(): root = subsys->root;
                  if ((root != subsys->root) ||

All these place we have held cgroup_lock() or we don't dereference to
struct cgroupfs_root.  It's means wo don't need RCU when use struct
cgroup_subsys.root, and we should not put struct cgroupfs_root protected
by RCU.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: fix cgroup_iter_next() bug
Lai Jiangshan [Thu, 8 Jan 2009 02:07:36 +0000 (18:07 -0800)]
cgroups: fix cgroup_iter_next() bug

We access res->cgroups without the task_lock(), so res->cgroups may be
changed.  it's unreliable, and "if (l == &res->cgroups->tasks)" may be
false forever.

We don't need add any lock for fixing this bug.  we just access to struct
css_set by struct cg_cgroup_link, not by struct task_struct.

Since we hold css_set_lock, struct cg_cgroup_link is reliable.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: add lock for child->cgroups in cgroup_post_fork()
Lai Jiangshan [Thu, 8 Jan 2009 02:07:36 +0000 (18:07 -0800)]
cgroups: add lock for child->cgroups in cgroup_post_fork()

When cgroup_post_fork() is called, child is seen by find_task_by_vpid(),
so child->cgroups maybe be changed, It'll incorrect.

child->cgroups<old>'s refcnt is decreased
child->cgroups<new>'s refcnt is increased
but child->cg_list is added to child->cgroups<old>'s list.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: fix a typo in Kconfig
Li Zefan [Thu, 8 Jan 2009 02:07:35 +0000 (18:07 -0800)]
memcg: fix a typo in Kconfig

s/contoller/controller/

Signed-of-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agons_cgroup: remove unused spinlock
Li Zefan [Thu, 8 Jan 2009 02:07:34 +0000 (18:07 -0800)]
ns_cgroup: remove unused spinlock

I happened to find the spinlock in struct ns_cgroup is never used.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: remove some redundant NULL checks
Li Zefan [Thu, 8 Jan 2009 02:07:33 +0000 (18:07 -0800)]
cgroups: remove some redundant NULL checks

- In cgroup_clone(), if vfs_mkdir() returns successfully,
  dentry->d_fsdata will be the pointer to the newly created
  cgroup and won't be NULL.

- a cgroup file's dentry->d_fsdata won't be NULL, guaranteed
  by cgroup_add_file().

- When walking through the subsystems of a cgroup_fs (using
  for_each_subsys), cgrp->subsys[ss->subsys_id] won't be NULL,
  guaranteed by cgroup_create().

(Also remove 2 unused variables in cgroup_rmdir().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: documentation updates
Li Zefan [Thu, 8 Jan 2009 02:07:32 +0000 (18:07 -0800)]
cgroups: documentation updates

- remove 'releasable' since it has been moved to the debug subsys.
- update lock requirements of subsys callbacks.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: make cgroup config a submenu
KAMEZAWA Hiroyuki [Thu, 8 Jan 2009 02:07:30 +0000 (18:07 -0800)]
cgroups: make cgroup config a submenu

Making CGROUP related configs be a sub-menu.

This patch make CGROUP related configs be a sub-menu and makes 1st level
configs of "General Setup" shorter.

 including following additional changes
  - add help comment about CGROUPS and GROUP_SCHED.
  - moved MM_OWNER config to the bottom.
    (for good indent in menuconfig)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoquota: don't set grace time when user isn't above softlimit
Jan Kara [Thu, 8 Jan 2009 02:07:29 +0000 (18:07 -0800)]
quota: don't set grace time when user isn't above softlimit

do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if
user was not above his softlimit.  This does not make much sence and by
coincidence causes quota code to omit softlimit warning when user really
exceeds softlimit.  This patch makes do_set_dqblk() reset user's grace
time if he has not exceeded softlimit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocoda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL
Richard A. Holden III [Thu, 8 Jan 2009 02:07:28 +0000 (18:07 -0800)]
coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL

Fix
fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used
fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used

these are only used when CONFIG_SYSCTL is defined.

Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agojbd: remove excess kernel-doc notation
Randy Dunlap [Thu, 8 Jan 2009 02:07:27 +0000 (18:07 -0800)]
jbd: remove excess kernel-doc notation

Remove excess kernel-doc from fs/jbd/transaction.c:

Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: tighten restrictions on inode flags
Duane Griffin [Thu, 8 Jan 2009 02:07:26 +0000 (18:07 -0800)]
ext3: tighten restrictions on inode flags

At the moment there are few restrictions on which flags may be set on
which inodes.  Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links.  Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.

Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: don't inherit inappropriate inode flags from parent
Duane Griffin [Thu, 8 Jan 2009 02:07:26 +0000 (18:07 -0800)]
ext3: don't inherit inappropriate inode flags from parent

At present INDEX is the only flag that new ext3 inodes do NOT inherit from
their parent.  In addition prevent the flags DIRTY, ECOMPR, IMAGIC and
TOPDIR from being inherited.  List inheritable flags explicitly to prevent
future flags from accidentally being inherited.

This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: allocate ->s_blockgroup_lock separately
Pekka Enberg [Thu, 8 Jan 2009 02:07:25 +0000 (18:07 -0800)]
ext3: allocate ->s_blockgroup_lock separately

As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit
which makes it a very bad fit for SLAB allocators.  The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.

To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately.  This shinks down struct ext3_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agojbd: improve fsync batching
Josef Bacik [Thu, 8 Jan 2009 02:07:24 +0000 (18:07 -0800)]
jbd: improve fsync batching

There is a flaw with the way jbd handles fsync batching.  If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit.  The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going.  Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.

This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction.  If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk.  We acheive
sub-jiffie sleeping using schedule_hrtimeout.  This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout.  I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average.  I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.

I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array.  I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk.  I ran the following command

fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
results

type threads with patch without patch
sata 2 24.6 26.3
sata 4 49.2 48.1
sata 8 70.1 67.0
sata 16 104.0 94.1
sata 32 153.6 142.7

xserve 2 246.4 222.0
xserve 4 480.0 440.8
xserve 8 829.5 730.8
xserve 16 1172.7 1026.9
xserve 32 1816.3 1650.5

ramdisk 2 2538.3 1745.6
ramdisk 4 2942.3 661.9
ramdisk 8 2882.5 999.8
ramdisk 16 2738.7 1801.9
ramdisk 32 2541.9 2394.0

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ric Wheeler <rwheeler@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext2: tighten restrictions on inode flags
Duane Griffin [Thu, 8 Jan 2009 02:07:21 +0000 (18:07 -0800)]
ext2: tighten restrictions on inode flags

At the moment there are few restrictions on which flags may be set on
which inodes.  Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links.  Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.

Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext2: don't inherit inappropriate inode flags from parent
Duane Griffin [Thu, 8 Jan 2009 02:07:20 +0000 (18:07 -0800)]
ext2: don't inherit inappropriate inode flags from parent

At present BTREE/INDEX is the only flag that new ext2 inodes do NOT
inherit from their parent.  In addition prevent the flags DIRTY, ECOMPR,
INDEX, IMAGIC and TOPDIR from being inherited.  List inheritable flags
explicitly to prevent future flags from accidentally being inherited.

This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext2: allocate ->s_blockgroup_lock separately
Pekka J Enberg [Thu, 8 Jan 2009 02:07:19 +0000 (18:07 -0800)]
ext2: allocate ->s_blockgroup_lock separately

As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit
which makes it a very bad fit for SLAB allocators.  The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.

To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately.  This shinks down struct ext2_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext2: fix ext2_splice_branch() comments
Qinghuang Feng [Thu, 8 Jan 2009 02:07:18 +0000 (18:07 -0800)]
ext2: fix ext2_splice_branch() comments

There is no argument named @chain in ext2_splice_branch, remove references
to it.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agortc-ds1307: remove legacy probe() checks
Jüri Reitel [Thu, 8 Jan 2009 02:07:16 +0000 (18:07 -0800)]
rtc-ds1307: remove legacy probe() checks

Remove RTC register value checks from the rtc-ds1307 probe() function.
They were left over from the legacy style I2C driver, which had to defend
against finding a non-RTC chip when the driver was probed.

Also fix a minor glitch in the alarm support: DS1307 chips don't have
alarms, so name those methods after one of the chips which actually *do*
have alarms (DS1337).

Signed-off-by: Jüri Reitel <juri.reitel@liewenthal.ee>
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Sebastien Barre <sbarre@sdelcc.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Jean Delvare <khali@linux-fr.org>
Cc: Rodolfo Giometti <giometti@enneenne.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agortc-ds1307: SMBus compatibility
BARRE Sebastien [Thu, 8 Jan 2009 02:07:13 +0000 (18:07 -0800)]
rtc-ds1307: SMBus compatibility

Change i2c access functions to SMBus access functions in order to use the
ds1307 with SMBus adapter.

Signed-off-by: Sebastien Barre <sbarre@sdelcc.com>
Acked-by: David Brownell <david-b@pacbell.net>
Tested-by: David Brownell <david-b@pacbell.net>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Jean Delvare <khali@linux-fr.org>
Cc: Rodolfo Giometti <giometti@enneenne.com>
Tested-by: Sebastien Barre <sbarre@sdelcc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoxen: add xenfs to allow usermode <-> Xen interaction
Alex Zeffertt [Thu, 8 Jan 2009 02:07:11 +0000 (18:07 -0800)]
xen: add xenfs to allow usermode <-> Xen interaction

The xenfs filesystem exports various interfaces to usermode.  Initially
this exports a file to allow usermode to interact with xenbus/xenstore.

Traditionally this appeared in /proc/xen.  Rather than extending procfs,
this patch adds a backward-compat mountpoint on /proc/xen, and provides
a xenfs filesystem which can be mounted there.

Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodrivers/xen/xenbus/xenbus_client.c: cleanup kerneldoc
Qinghuang Feng [Thu, 8 Jan 2009 02:07:10 +0000 (18:07 -0800)]
drivers/xen/xenbus/xenbus_client.c: cleanup kerneldoc

no argument named @xbt in xenbus_switch_state(), remove it.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years ago[ARM] fix pxa930_trkball build errors
Russell King [Thu, 8 Jan 2009 12:27:00 +0000 (12:27 +0000)]
[ARM] fix pxa930_trkball build errors

drivers/input/mouse/pxa930_trkball.c: In function `pxa930_trkball_probe':
drivers/input/mouse/pxa930_trkball.c:189: error: `ret' undeclared (first use in this function)
drivers/input/mouse/pxa930_trkball.c:230: error: `ret' undeclared (first use in this function)

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] fix netx
Russell King [Thu, 8 Jan 2009 12:02:27 +0000 (12:02 +0000)]
[ARM] fix netx

2fcfe6b872b21639dcffbaf3ca2a84ec01d104e0 missed out on the cpumask
updates; update netx for these changes.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] fix pnx4008
Russell King [Thu, 8 Jan 2009 11:00:03 +0000 (11:00 +0000)]
[ARM] fix pnx4008

arch/arm/mach-pnx4008/include/mach/gpio.h:214: error: implicit declaration of function 'IO_ADDRESS'

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] fix pxa
Russell King [Thu, 8 Jan 2009 10:57:08 +0000 (10:57 +0000)]
[ARM] fix pxa

arch/arm/mach-pxa/pxa300.c:94: error: 'CKEN_MMC3' undeclared here (not in a function)

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] remove missed CLPS7500 defconfig
Russell King [Fri, 2 Jan 2009 13:56:55 +0000 (13:56 +0000)]
[ARM] remove missed CLPS7500 defconfig

635f0258e5ae526034486b4ae9020e64bfb7d27e missed removing the defconfig

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] clps711x: fix warning in edb7211-mm.c
Russell King [Fri, 2 Jan 2009 12:53:37 +0000 (12:53 +0000)]
[ARM] clps711x: fix warning in edb7211-mm.c

Fix:
include/asm-generic/pgtable.h:305: warning: 'struct vm_area_struct' declared inside parameter list
include/asm-generic/pgtable.h:305: warning: its scope is only this definition or declaration, which is probably not what you want
include/asm-generic/pgtable.h:317: warning: 'struct vm_area_struct' declared inside parameter list
include/asm-generic/pgtable.h:331: warning: 'struct vm_area_struct' declared inside parameter list

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] clps711x: fix warning in fortunet
Russell King [Fri, 2 Jan 2009 12:49:01 +0000 (12:49 +0000)]
[ARM] clps711x: fix warning in fortunet

Fix:

arch/arm/include/asm/irq.h:26: warning: 'struct pt_regs' declared inside parameter list

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] fix AT91, davinci, h720x, ks8695, msm, mx2, mx3, netx, omap1, omap2, pxa, s3c
Russell King [Thu, 8 Jan 2009 10:01:47 +0000 (10:01 +0000)]
[ARM] fix AT91, davinci, h720x, ks8695, msm, mx2, mx3, netx, omap1, omap2, pxa, s3c

arch/arm/mach-at91/at91cap9.c:337: error: 'NR_AIC_IRQS' undeclared here (not in a function)
arch/arm/mach-at91/at91rm9200.c:301: error: 'NR_AIC_IRQS' undeclared here (not in a function)
arch/arm/mach-at91/at91sam9260.c:351: error: 'NR_AIC_IRQS' undeclared here (not in a function)
arch/arm/mach-at91/at91sam9261.c:287: error: 'NR_AIC_IRQS' undeclared here (not in a function)
arch/arm/mach-at91/at91sam9263.c:312: error: 'NR_AIC_IRQS' undeclared here (not in a function)
arch/arm/mach-at91/at91sam9rl.c:304: error: 'NR_AIC_IRQS' undeclared here (not in a function)
arch/arm/mach-h720x/h7202-eval.c:38: error: implicit declaration of function 'IRQ_CHAINED_GPIOB'
arch/arm/mach-ks8695/devices.c:46: error: 'KS8695_IRQ_WAN_RX_STATUS' undeclared here (not in a function)
arch/arm/mach-msm/devices.c:28: error: 'INT_UART1' undeclared here (not in a function)
arch/arm/mach-mx2/devices.c:233: error: 'MXC_GPIO_IRQ_START' undeclared here (not in a function)
arch/arm/mach-mx3/devices.c:128: error: 'MXC_GPIO_IRQ_START' undeclared here (not in a function)
arch/arm/mach-omap1/mcbsp.c:140: error: 'INT_730_McBSP1RX' undeclared here (not in a function)
arch/arm/mach-omap1/mcbsp.c:165: error: 'INT_McBSP1RX' undeclared here (not in a function)
arch/arm/mach-omap1/mcbsp.c:200: error: 'INT_McBSP1RX' undeclared here (not in a function)
arch/arm/mach-omap2/board-apollon.c:286: error: implicit declaration of function 'omap_set_gpio_direction'
arch/arm/mach-omap2/mcbsp.c:154: error: 'INT_24XX_MCBSP1_IRQ_RX' undeclared here (not in a function)
arch/arm/mach-omap2/mcbsp.c:181: error: 'INT_24XX_MCBSP1_IRQ_RX' undeclared here (not in a function)
arch/arm/mach-pxa/e350.c:36: error: 'IRQ_BOARD_START' undeclared here (not in a function)
arch/arm/plat-s3c/dev-i2c0.c:32: error: 'IRQ_IIC' undeclared here (not in a function)
...

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] Fix realview build
Russell King [Thu, 8 Jan 2009 09:58:51 +0000 (09:58 +0000)]
[ARM] Fix realview build

arch/arm/mach-realview/platsmp.c:140: error: 'jiffies' undeclared (first use in this function)
drivers/amba/bus.c:246: error: 'NO_IRQ' undeclared (first use in this function)

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] 5357/1: Kirkwood: add missing ge01 tclk initialization
Nicolas Pitre [Tue, 6 Jan 2009 22:02:08 +0000 (23:02 +0100)]
[ARM] 5357/1: Kirkwood: add missing ge01 tclk initialization

Otherwise the mv643xx_eth driver will assume 133 MHz which is incorrect.

Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] 5358/1: AT2440EVB: Use new include path of mci.h
Ramax Lo [Wed, 7 Jan 2009 02:28:31 +0000 (03:28 +0100)]
[ARM] 5358/1: AT2440EVB: Use new include path of mci.h

Since mci.h has been moved, use the new include path.

Signed-off-by: Ramax Lo <ramaxlo@gmail.com>
Acked-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] 5361/1: mv78xx0: fix compilation error
Nicolas Pitre [Wed, 7 Jan 2009 03:58:23 +0000 (04:58 +0100)]
[ARM] 5361/1: mv78xx0: fix compilation error

Commit ba84be2338d3 broke the build.

Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] 5360/1: Orion: fix compilation error
Nicolas Pitre [Wed, 7 Jan 2009 03:52:58 +0000 (04:52 +0100)]
[ARM] 5360/1: Orion: fix compilation error

Commit ba84be2338d3 broke the build.

Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] 5359/1: Kirkwood: fix compilation error
Nicolas Pitre [Wed, 7 Jan 2009 03:47:02 +0000 (04:47 +0100)]
[ARM] 5359/1: Kirkwood: fix compilation error

Commit ba84be2338d3 broke the build.

Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
15 years ago[ARM] S3C64XX: Fix EINT group macro definition
Matt Hsu [Thu, 8 Jan 2009 08:27:30 +0000 (16:27 +0800)]
[ARM] S3C64XX: Fix EINT group macro definition

Fix IRQ_EINT_GROUP which has an extra _ in it and
an error in the IRQ offset.

Signed-off-by: Matt Hsu <matt_hsu@openmoko.org>
[ben-linux@fluff.org: rewrite description]
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years ago[ARM] Ensure CONFIG_SERIAL_SAMSUNG_UARTS is always set.
Ben Dooks [Thu, 8 Jan 2009 13:21:17 +0000 (13:21 +0000)]
[ARM] Ensure CONFIG_SERIAL_SAMSUNG_UARTS is always set.

Always set CONFIG_SERIAL_SAMSUNG_UARTS when building any
of the S3C platforms as even if the driver is not selected
there it is still the facility for the machine files to
register configuration data for the possibility of the
driver being built.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years ago[ARM] S3C24XX: Add gpio_to_irq implementation
Ben Dooks [Thu, 8 Jan 2009 12:40:50 +0000 (12:40 +0000)]
[ARM] S3C24XX: Add gpio_to_irq implementation

Add to_irq field handlers for the GPIO banks that are configurable
as interripts.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years ago[ARM] S3C24XX: Add gpio_to_irq() facility
Ben Dooks [Thu, 8 Jan 2009 12:33:11 +0000 (12:33 +0000)]
[ARM] S3C24XX: Add gpio_to_irq() facility

Add gpio_to_irq() by re-directing the call to the
generic __gpio_to_irq() code in the gpiolib.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years agoasync: Don't call async_synchronize_full_special() while holding sb_lock
Dave Kleikamp [Thu, 8 Jan 2009 15:46:31 +0000 (09:46 -0600)]
async: Don't call async_synchronize_full_special() while holding sb_lock

sync_filesystems() shouldn't be calling async_synchronize_full_special
while holding a spinlock.  The second while loop in that function is the
right place for this anyway.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-by: Grissiom <chaos.proton@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years ago[ARM] footbridge: dc21285.c warning fixes
Ben Dooks [Thu, 8 Jan 2009 15:42:42 +0000 (15:42 +0000)]
[ARM] footbridge: dc21285.c warning fixes

The dc21285 requests a number of IRQs that it doesn't really
care whether they get added. Change to use a macro that ensures
that at-least the user gets warned if they fail to add, which
also stops the warnings from __unused_result on request_irq().

dc21285.c:337: warning: ignoring return value of 'request_irq', declared with attribute warn_unused_result
dc21285.c:339: warning: ignoring return value of 'request_irq', declared with attribute warn_unused_result
dc21285.c:341: warning: ignoring return value of 'request_irq', declared with attribute warn_unused_result
dc21285.c:343: warning: ignoring return value of 'request_irq', declared with attribute warn_unused_result
dc21285.c:345: warning: ignoring return value of 'request_irq', declared with attribute warn_unused_result

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years ago[ARM] footbridge: add isa_init_irq() to common header
Ben Dooks [Thu, 8 Jan 2009 15:42:41 +0000 (15:42 +0000)]
[ARM] footbridge: add isa_init_irq() to common header

isa_init_irq() is defined in arch/arm/mach-footbridge/isa-irq.c
and used in arch/arm/mach-footbridge/common.c but there is no
definition in any header. Move the definition in common.c to
common.h to stop the sparse warning:

isa-irq.c:118:13: warning: symbol 'isa_init_irq' was not declared.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years ago[ARM] arch/arm/kernel/isa.c: missing definition of register_isa_ports
Ben Dooks [Thu, 8 Jan 2009 15:42:40 +0000 (15:42 +0000)]
[ARM] arch/arm/kernel/isa.c: missing definition of register_isa_ports

arch/arm/kernel/isa.c should include <linux/io.h> to get the
definition of register_io_ports() at-least when compiling for
footbridge to fix the following sparse warning:

isa.c:68:1: warning: symbol 'register_isa_ports' was not declared.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
15 years agoserial: Add driver for the Cell Network Processor serial port NWP device
Benjamin Krill [Wed, 7 Jan 2009 09:32:38 +0000 (10:32 +0100)]
serial: Add driver for the Cell Network Processor serial port NWP device

Add support for the nwp serial device which is connected to a DCR bus. It
uses the of_serial device driver to determine necessary properties from
the device tree.  The supported device is added as serial port number 85.

NWP stands for network processor and it is part of the QPACE - Quantum
Chromodynamics Parallel Computing on the Cell Broadband Engine project.
The implementation is a lightweight uart implementation with the focus
to consume as little resources as possible and it is connected to a
DCR bus.

Signed-off-by: Benjamin Krill <ben@codiert.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc: enable dynamic ftrace
Steven Rostedt [Tue, 6 Jan 2009 18:49:17 +0000 (18:49 +0000)]
powerpc: enable dynamic ftrace

This patch enables dynamic ftrace. The PowerPC port was dependent on
other code not yet in mainline. Now that the code is, we can now
let PowerPC compile with dynamic ftrace.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc/cell: Fix the prototype of create_vma_map()
Stephen Rothwell [Tue, 6 Jan 2009 13:58:22 +0000 (13:58 +0000)]
powerpc/cell: Fix the prototype of create_vma_map()

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc/mm: Make clear_fixmap() actually work
Anton Vorontsov [Mon, 29 Dec 2008 06:40:35 +0000 (06:40 +0000)]
powerpc/mm: Make clear_fixmap() actually work

The clear_fixmap() routine issues map_page() with flags set to 0.
Currently this causes a BUG_ON() inside the map_page(), as it assumes
that a PTE should be clear before mapping.

This patch makes the map_page() to trigger the BUG_ON() only if the
flags were set.

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc/kdump: Use ppc_save_regs() in crash_setup_regs()
Anton Vorontsov [Wed, 17 Dec 2008 10:09:35 +0000 (10:09 +0000)]
powerpc/kdump: Use ppc_save_regs() in crash_setup_regs()

The patch replaces internal registers dump implementation with
ppc_save_regs(). From now on PPC64 and PPC32 are using the same
code for crash_setup_regs().

NOTE: The old regs dump implementation was capturing SP (r1) directly
as is, so you could see crash_kexec() function on top of the back-trace.
But ppc_save_regs() goes up one stack frame, so you'll not see it
anymore, at the top-level you'll see who actually triggered the crash
dump instead.

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc: Export cacheable_memzero as its now used in a driver
Kumar Gala [Wed, 7 Jan 2009 05:00:05 +0000 (23:00 -0600)]
powerpc: Export cacheable_memzero as its now used in a driver

The Freescale PowerPC specific gianfar driver (gig-e) uses
cacheable_memzero for performance reasons we need to export
the symbol to allow the driver to be built as a module.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc: Fix missing semicolons in mmu_decl.h
Benjamin Herrenschmidt [Tue, 6 Jan 2009 17:56:51 +0000 (17:56 +0000)]
powerpc: Fix missing semicolons in mmu_decl.h

This is a brown paper bag from one of my earlier patches that
breaks build on 40x and 8xx.

And yes, I've now added 40x and 8xx to my list of test configs :-)

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc/pasemi: local_irq_save uses an unsigned long
Ingo Molnar [Tue, 6 Jan 2009 14:06:28 +0000 (14:06 +0000)]
powerpc/pasemi: local_irq_save uses an unsigned long

[Split from a larger patch - sfr]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc/cell: Fix some u64 vs. long types
Ingo Molnar [Tue, 6 Jan 2009 14:03:44 +0000 (14:03 +0000)]
powerpc/cell: Fix some u64 vs. long types

in/out_be64() work on u64s.

The first parameter to ppc_md.ioremap is a phys_addr_t.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc/cell: Use correct types in beat files
Ingo Molnar [Tue, 6 Jan 2009 14:01:23 +0000 (14:01 +0000)]
powerpc/cell: Use correct types in beat files

Only pass the address of a u64 if that is what the function requires.

[Split out of a larger patch - sfr]
[update comment - sfr]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc: Use correct type in prom_init.c
Ingo Molnar [Tue, 6 Jan 2009 13:56:52 +0000 (13:56 +0000)]
powerpc: Use correct type in prom_init.c

tce_entryp is a "u64 *" not an "unsigned long *".

[Split from a large patch -sfr]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agopowerpc: Remove unnecessary casts
Stephen Rothwell [Tue, 6 Jan 2009 13:54:25 +0000 (13:54 +0000)]
powerpc: Remove unnecessary casts

of_get_flat_dt_prop() returns a "void *", so we don't need to cast when
assigning its result to a pointer variable.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agomtd/ps3vram: Use _PAGE_NO_CACHE in memory ioremap
Geoff Levand [Thu, 8 Jan 2009 01:22:07 +0000 (17:22 -0800)]
mtd/ps3vram: Use _PAGE_NO_CACHE in memory ioremap

Use _PAGE_NO_CACHE for gpu memory ioremap.  Also,
add __iomem attribute to gpu memory pointer and
change use of memset() to memset_io().

Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agomtd/ps3vram: Use msleep in waits
Geoff Levand [Thu, 8 Jan 2009 01:22:02 +0000 (17:22 -0800)]
mtd/ps3vram: Use msleep in waits

Replace the use of udelay() with msleep() in the looping wait routines
ps3vram_notifier_wait() and ps3vram_wait_ring().

Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agomtd/ps3vram: Use proper kernel types
Geoff Levand [Tue, 6 Jan 2009 11:32:28 +0000 (11:32 +0000)]
mtd/ps3vram: Use proper kernel types

Replace the use of stdint.h types with kernel types
in the ps3vram driver.

Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agomtd/ps3vram: Cleanup ps3vram driver messages
Geoff Levand [Tue, 6 Jan 2009 11:32:21 +0000 (11:32 +0000)]
mtd/ps3vram: Cleanup ps3vram driver messages

Cleanup the ps3vram driver messages.  Add a new struct device pointer
variable dev to struct ps3vram_priv and use dev_dbg(), pr_dbg(), etc.
where appropriate.

Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
15 years agomtd/ps3vram: Remove ps3vram debug routines
Geoff Levand [Tue, 6 Jan 2009 11:32:15 +0000 (11:32 +0000)]
mtd/ps3vram: Remove ps3vram debug routines

Remove the ps3vram debug routines ps3vram_dump_ring() and
ps3vram_dump_reports().  These routines are not needed.

Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>