My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
happen at memory reclaim.
But in recent discussion, it's NACKed because it sounds ugly.
This patch is for reverting it and add some clean up to gfp_mask of
callers of charge. No behavior change but need review before generating
HUNK in deep queue.
This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.
Current mmtom has new oom function as pagefault_out_of_memory(). It's
added for select bad process rathar than killing current.
When memcg hit limit and calls OOM at page_fault, this handler called and
system-wide-oom handling happens. (means kernel panics if panic_on_oom is
true....)
To avoid overkill, check memcg's recent behavior before starting
system-wide-oom.
And this patch also fixes to guarantee "don't accnout against process with
TIF_MEMDIE". This is necessary for smooth OOM.
Balbir Singh [Thu, 8 Jan 2009 02:08:07 +0000 (18:08 -0800)]
memcg: memory cgroup hierarchy feature selector
Don't enable multiple hierarchy support by default. This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature. This feature can only be enabled when the cgroup for which the
feature this is enabled, has no children.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Thu, 8 Jan 2009 02:08:06 +0000 (18:08 -0800)]
memcg: memory cgroup hierarchical reclaim
This patch introduces hierarchical reclaim. When an ancestor goes over
its limit, the charging routine points to the parent that is above its
limit. The reclaim process then starts from the last scanned child of the
ancestor and reclaims until the ancestor goes below its limit.
[akpm@linux-foundation.org: coding-style fixes]
[d-nishimura@mtf.biglobe.ne.jp: mem_cgroup_from_res_counter should handle both mem->res and mem->memsw] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Thu, 8 Jan 2009 02:08:05 +0000 (18:08 -0800)]
memcg: memory cgroup resource counters for hierarchy
Add support for building hierarchies in resource counters. Cgroups allows
us to build a deep hierarchy, but we currently don't link the resource
counters belonging to the memory controller control groups, in the same
fashion as the corresponding cgroup entries in the cgroup hierarchy. This
patch provides the infrastructure for resource counters that have the same
hiearchy as their cgroup counter parts.
These set of patches are based on the resource counter hiearchy patches
posted by Pavel Emelianov.
NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are
expensive, so the user needs to be careful and aware of the trade-offs
before creating very deep ones.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).
- LRU of page_cgroup is not synchronous with global LRU.
- page and page_cgroup is one-to-one and statically allocated.
- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
- SwapCache is handled.
And, when we handle LRU list of page_cgroup, we do following.
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);
But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.
Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements per cgroup limit for usage of memory+swap. However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.
Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.
This has following benefits.
- A user can limit total resource usage of mem+swap.
Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.
And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..
Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.
Ability to set appropriate swap limit for each group is required.
Maybe someone wonder "why not swap but mem+swap ?"
- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.
In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.
Accounting target information is stored in swap_cgroup which is
per swap entry record.
Charge is done as following.
map
- charge page and memsw.
unmap
- uncharge page/memsw if not SwapCache.
swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.
swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.
swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.
There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)
TODO:
- maybe more optimization can be don in swap-in path. (but not very safe.)
But we just do simple accounting at this stage.
[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For accounting swap, we need a record per swap entry, at least.
This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.
- swap_cgroup_record() .... record information of swap entry.
- swap_cgroup_lookup() .... lookup information of swap entry.
This patch just implements "how to record information". No actual method
for limit the usage of swap. These routine uses flat table to record and
lookup. "wise" lookup system like radix-tree requires requires memory
allocation at new records but swap-out is usually called under memory
shortage (or memcg hits limit.) So, I used static allocation. (maybe
dynamic allocation is not very hard but it adds additional memory
allocation in memory shortage path.)
Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.
Config and control variable for mem+swap controller.
This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)
For accounting swap, it's obvious that we have to use additional memory to
remember "who uses swap". This adds more overhead. So, it's better to
offer "choice" to users. This patch adds 2 choices.
This patch adds 2 parameters to enable swap extension or not.
- CONFIG
- boot option
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SwapCache support for memory resource controller (memcg)
Before mem+swap controller, memcg itself should handle SwapCache in proper
way. This is cut-out from it.
In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache. This is a leak of account and should be handled.
SwapCache accounting is done as following.
charge (anon)
- charged when it's mapped.
(because of readahead, charge at add_to_swap_cache() is not sane)
uncharge (anon)
- uncharged when it's dropped from swapcache and fully unmapped.
means it's not uncharged at unmap.
Note: delete from swap cache at swap-in is done after rmap information
is established.
charge (shmem)
- charged at swap-in. this prevents charge at add_to_page_cache().
uncharge (shmem)
- uncharged when it's dropped from swapcache and not on shmem's
radix-tree.
at migration, check against 'old page' is modified to handle shmem.
Comparing to the old version discussed (and caused troubles), we have
advantages of
- PCG_USED bit.
- simple migrating handling.
So, situation is much easier than several months ago, maybe.
By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of
memory usage and force_empty is removed.
This patch adds "force_empty" again, in reasonable manner.
memory.force_empty file works when
#echo 0 (or some) > memory.force_empty
and have following function.
1. only works when there are no task in this cgroup.
2. free all page under this cgroup as much as possible.
3. page which cannot be freed will be moved up to parent.
4. Then, memcg will be empty after above echo returns.
This is much better behavior than old "force_empty" which just forget
all accounts. This patch also check signal_pending() and above "echo"
can be stopped by "Ctrl-C".
[akpm@linux-foundation.org: cleanup] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch provides a function to move account information of a page
between mem_cgroups and rewrite force_empty to make use of this.
This moving of page_cgroup is done under
- lru_lock of source/destination mem_cgroup is held.
- lock_page_cgroup() is held.
Then, a routine which touches pc->mem_cgroup without lock_page_cgroup()
should confirm pc->mem_cgroup is still valid or not. Typical code can be
following.
(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irqrestore(&mz->lru_lock);
Of course, better way is
lock_page_cgroup(pc);
....
unlock_page_cgroup(pc);
But you should confirm the nest of lock and avoid deadlock.
If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.
Expected users of this routine is:
- force_empty (rmdir)
- moving tasks between cgroup (for moving account information.)
- hierarchy (maybe useful.)
force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)
If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.
Purpose of this behavior is
- Fix "forget all" behavior of force_empty and avoid leak of accounting.
- By "moving first, free if necessary", keep pages on memory as much as
possible.
Adding a switch to change behavior of force_empty to
- free first, move if necessary
- free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration. (I'll add if someone requtests.)
This patch also removes memory.force_empty file, a brutal debug-only interface.
memcg: do not recalculate section unnecessarily in init_section_page_cgroup
In init_section_page_cgroup() the section a given pfn belongs to is
calculated at the top of the function and, despite the fact that the
pfn/section correspondence does not change, it is recalculated further
down the same function. By computing this just once and reusing that
value we save some bytes in the object file and do not waste CPU cycles.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldpage to newpage)
before
- "newpage" is charged before migration.
at success.
- "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
at failure
- "newpage" is uncharged.
- "oldpage" is charged if necessary (*1)
But (*1) is not reliable....because of GFP_ATOMIC.
This patch tries to change behavior as following by charge/commit/cancel ops.
before
- charge PAGE_SIZE (no target page)
success
- commit charge against "newpage".
failure
- commit charge against "oldpage".
(PCG_USED bit works effectively to avoid double-counting)
- if "oldpage" is obsolete, cancel charge of PAGE_SIZE.
Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.
I think that this is from the fact that page_cgroup *was* dynamically
allocated.
But now, we allocate all page_cgroup at boot. And
mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
specified GFP_RECLAIM_MASK.
* This is because we just want to reduce memory usage.
"Where we should reclaim from ?" is not a problem in memcg.
This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
Note: This patch is not for fixing behavior but for showing sane information
in source code.
memcg: introduce charge-commit-cancel style of functions
There is a small race in do_swap_page(). When the page swapped-in is
charged, the mapcount can be greater than 0. But, at the same time some
process (shares it ) call unmap and make mapcount 1->0 and the page is
uncharged.
For fixing this, I added a new interface.
- charge
account to res_counter by PAGE_SIZE and try to free pages if necessary.
- commit
register page_cgroup and add to LRU if necessary.
- cancel
uncharge PAGE_SIZE because of do_swap_page failure.
CPUA
(1) charge (always)
(2) set page's rmap (mapcount > 0)
(3) commit charge was necessary or not after set_pte().
This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does charge -> commit at a time.
And this patch also adds following function to clarify all charges.
- mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
called against newly allocated anon pages.
- mem_cgroup_charge_migrate_fixup()
called only from remove_migration_ptes().
we'll have to rewrite this later.(this patch just keeps old behavior)
This function will be removed by additional patch to make migration
clearer.
Good for clarifying "what we do"
Then, we have 4 following charge points.
- newpage
- swap-in
- add-to-cache.
- migration.
Serge E. Hallyn [Thu, 8 Jan 2009 02:07:46 +0000 (18:07 -0800)]
devices cgroup: allow mkfifo
The devcgroup_inode_permission() hook in the devices whitelist cgroup has
always bypassed access checks on fifos. But the mknod hook did not. The
devices whitelist is only about block and char devices, and fifos can't
even be added to the whitelist, so fifos can't be created at all except by
tasks which have 'a' in their whitelist (meaning they have access to all
devices).
Fix the behavior by bypassing access checks to mkfifo.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: James Morris <jmorris@namei.org> Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Thu, 8 Jan 2009 02:07:44 +0000 (18:07 -0800)]
cgroups: make cgroup_path() RCU-safe
Fix races between /proc/sched_debug by freeing cgroup objects via an RCU
callback. Thus any cgroup reference obtained from an RCU-safe source will
remain valid during the RCU section. Since dentries are also RCU-safe,
this allows us to traverse up the tree safely.
Additionally, make cgroup_path() check for a NULL cgrp->dentry to avoid
trying to report a path for a partially-created cgroup.
[lizf@cn.fujitsu.com: call deactive_super() in cgroup_diput()] Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 8 Jan 2009 02:07:42 +0000 (18:07 -0800)]
cgroups: add inactive subsystems to rootnode.subsys_list
Though for an inactive hierarchy, we have subsys->root == &rootnode, but
rootnode's subsys_list is always empty.
This conflicts with the code in find_css_set():
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
...
if (ss->root->subsys_list.next == &ss->sibling) {
...
}
}
if (list_empty(&rootnode.subsys_list)) {
...
}
The above code assumes rootnode.subsys_list links all inactive
hierarchies.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 8 Jan 2009 02:07:41 +0000 (18:07 -0800)]
cgroups: make root_list contains active hierarchies only
Don't link rootnode to the root list, so root_list contains active
hierarchies only as the comment indicates. And rename for_each_root() to
for_each_active_root().
Also remove redundant check in cgroup_kill_sb().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Thu, 8 Jan 2009 02:07:40 +0000 (18:07 -0800)]
cgroups: remove rcu_read_lock() in cgroupstats_build()
cgroup_iter_* do not need rcu_read_lock().
In cgroup_enable_task_cg_lists(), do_each_thread() and while_each_thread()
are protected by RCU, it's OK, for write_lock(&css_set_lock) implies
rcu_read_lock() in non-RT kernel.
If we need explicit rcu_read_lock(), we should add rcu_read_lock() in
cgroup_enable_task_cg_lists(), not cgroup_iter_*.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Thu, 8 Jan 2009 02:07:39 +0000 (18:07 -0800)]
cgroups: call find_css_set() safely in cgroup_attach_task()
In cgroup_attach_task(), tsk maybe exit when we call find_css_set(). and
find_css_set() will access to invalid css_set.
This patch increases the count before get_css_set(), and decreases it
after find_css_set().
NOTE:
css_set's refcount is also taskcount, after this patch applied, taskcount
may be off-by-one WHEN cgroup_lock() is not held. but I reviewed other
code which use taskcount, they are still correct. No regression found by
reviewing and simply testing.
So I do not use two counters in css_set. (one counter for taskcount, the
other for refcount. like struct mm_struct) If this fix cause regression,
we will use two counters in css_set.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All these place we have held cgroup_lock() or we don't dereference to
struct cgroupfs_root. It's means wo don't need RCU when use struct
cgroup_subsys.root, and we should not put struct cgroupfs_root protected
by RCU.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Thu, 8 Jan 2009 02:07:36 +0000 (18:07 -0800)]
cgroups: fix cgroup_iter_next() bug
We access res->cgroups without the task_lock(), so res->cgroups may be
changed. it's unreliable, and "if (l == &res->cgroups->tasks)" may be
false forever.
We don't need add any lock for fixing this bug. we just access to struct
css_set by struct cg_cgroup_link, not by struct task_struct.
Since we hold css_set_lock, struct cg_cgroup_link is reliable.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 8 Jan 2009 02:07:33 +0000 (18:07 -0800)]
cgroups: remove some redundant NULL checks
- In cgroup_clone(), if vfs_mkdir() returns successfully,
dentry->d_fsdata will be the pointer to the newly created
cgroup and won't be NULL.
- a cgroup file's dentry->d_fsdata won't be NULL, guaranteed
by cgroup_add_file().
- When walking through the subsystems of a cgroup_fs (using
for_each_subsys), cgrp->subsys[ss->subsys_id] won't be NULL,
guaranteed by cgroup_create().
(Also remove 2 unused variables in cgroup_rmdir().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch make CGROUP related configs be a sub-menu and makes 1st level
configs of "General Setup" shorter.
including following additional changes
- add help comment about CGROUPS and GROUP_SCHED.
- moved MM_OWNER config to the bottom.
(for good indent in menuconfig)
Jan Kara [Thu, 8 Jan 2009 02:07:29 +0000 (18:07 -0800)]
quota: don't set grace time when user isn't above softlimit
do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if
user was not above his softlimit. This does not make much sence and by
coincidence causes quota code to omit softlimit warning when user really
exceeds softlimit. This patch makes do_set_dqblk() reset user's grace
time if he has not exceeded softlimit.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL
Fix
fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used
fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used
these are only used when CONFIG_SYSCTL is defined.
Signed-off-by: Richard A. Holden III <aciddeath@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Duane Griffin [Thu, 8 Jan 2009 02:07:26 +0000 (18:07 -0800)]
ext3: tighten restrictions on inode flags
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Duane Griffin [Thu, 8 Jan 2009 02:07:26 +0000 (18:07 -0800)]
ext3: don't inherit inappropriate inode flags from parent
At present INDEX is the only flag that new ext3 inodes do NOT inherit from
their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC and
TOPDIR from being inherited. List inheritable flags explicitly to prevent
future flags from accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pekka Enberg [Thu, 8 Jan 2009 02:07:25 +0000 (18:07 -0800)]
ext3: allocate ->s_blockgroup_lock separately
As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit
which makes it a very bad fit for SLAB allocators. The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.
To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately. This shinks down struct ext3_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.
Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josef Bacik [Thu, 8 Jan 2009 02:07:24 +0000 (18:07 -0800)]
jbd: improve fsync batching
There is a flaw with the way jbd handles fsync batching. If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit. The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going. Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.
This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction. If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk. We acheive
sub-jiffie sleeping using schedule_hrtimeout. This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout. I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average. I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.
I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array. I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk. I ran the following command
Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ric Wheeler <rwheeler@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Duane Griffin [Thu, 8 Jan 2009 02:07:21 +0000 (18:07 -0800)]
ext2: tighten restrictions on inode flags
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Duane Griffin [Thu, 8 Jan 2009 02:07:20 +0000 (18:07 -0800)]
ext2: don't inherit inappropriate inode flags from parent
At present BTREE/INDEX is the only flag that new ext2 inodes do NOT
inherit from their parent. In addition prevent the flags DIRTY, ECOMPR,
INDEX, IMAGIC and TOPDIR from being inherited. List inheritable flags
explicitly to prevent future flags from accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pekka J Enberg [Thu, 8 Jan 2009 02:07:19 +0000 (18:07 -0800)]
ext2: allocate ->s_blockgroup_lock separately
As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit
which makes it a very bad fit for SLAB allocators. The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.
To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately. This shinks down struct ext2_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.
Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jüri Reitel [Thu, 8 Jan 2009 02:07:16 +0000 (18:07 -0800)]
rtc-ds1307: remove legacy probe() checks
Remove RTC register value checks from the rtc-ds1307 probe() function.
They were left over from the legacy style I2C driver, which had to defend
against finding a non-RTC chip when the driver was probed.
Also fix a minor glitch in the alarm support: DS1307 chips don't have
alarms, so name those methods after one of the chips which actually *do*
have alarms (DS1337).
Alex Zeffertt [Thu, 8 Jan 2009 02:07:11 +0000 (18:07 -0800)]
xen: add xenfs to allow usermode <-> Xen interaction
The xenfs filesystem exports various interfaces to usermode. Initially
this exports a file to allow usermode to interact with xenbus/xenstore.
Traditionally this appeared in /proc/xen. Rather than extending procfs,
this patch adds a backward-compat mountpoint on /proc/xen, and provides
a xenfs filesystem which can be mounted there.
Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Kleikamp [Thu, 8 Jan 2009 15:46:31 +0000 (09:46 -0600)]
async: Don't call async_synchronize_full_special() while holding sb_lock
sync_filesystems() shouldn't be calling async_synchronize_full_special
while holding a spinlock. The second while loop in that function is the
right place for this anyway.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-by: Grissiom <chaos.proton@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Benjamin Krill [Wed, 7 Jan 2009 09:32:38 +0000 (10:32 +0100)]
serial: Add driver for the Cell Network Processor serial port NWP device
Add support for the nwp serial device which is connected to a DCR bus. It
uses the of_serial device driver to determine necessary properties from
the device tree. The supported device is added as serial port number 85.
NWP stands for network processor and it is part of the QPACE - Quantum
Chromodynamics Parallel Computing on the Cell Broadband Engine project.
The implementation is a lightweight uart implementation with the focus
to consume as little resources as possible and it is connected to a
DCR bus.
Signed-off-by: Benjamin Krill <ben@codiert.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Steven Rostedt [Tue, 6 Jan 2009 18:49:17 +0000 (18:49 +0000)]
powerpc: enable dynamic ftrace
This patch enables dynamic ftrace. The PowerPC port was dependent on
other code not yet in mainline. Now that the code is, we can now
let PowerPC compile with dynamic ftrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Vorontsov [Mon, 29 Dec 2008 06:40:35 +0000 (06:40 +0000)]
powerpc/mm: Make clear_fixmap() actually work
The clear_fixmap() routine issues map_page() with flags set to 0.
Currently this causes a BUG_ON() inside the map_page(), as it assumes
that a PTE should be clear before mapping.
This patch makes the map_page() to trigger the BUG_ON() only if the
flags were set.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Acked-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Vorontsov [Wed, 17 Dec 2008 10:09:35 +0000 (10:09 +0000)]
powerpc/kdump: Use ppc_save_regs() in crash_setup_regs()
The patch replaces internal registers dump implementation with
ppc_save_regs(). From now on PPC64 and PPC32 are using the same
code for crash_setup_regs().
NOTE: The old regs dump implementation was capturing SP (r1) directly
as is, so you could see crash_kexec() function on top of the back-trace.
But ppc_save_regs() goes up one stack frame, so you'll not see it
anymore, at the top-level you'll see who actually triggered the crash
dump instead.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Kumar Gala [Wed, 7 Jan 2009 05:00:05 +0000 (23:00 -0600)]
powerpc: Export cacheable_memzero as its now used in a driver
The Freescale PowerPC specific gianfar driver (gig-e) uses
cacheable_memzero for performance reasons we need to export
the symbol to allow the driver to be built as a module.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Ingo Molnar [Tue, 6 Jan 2009 14:06:28 +0000 (14:06 +0000)]
powerpc/pasemi: local_irq_save uses an unsigned long
[Split from a larger patch - sfr] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Ingo Molnar [Tue, 6 Jan 2009 14:01:23 +0000 (14:01 +0000)]
powerpc/cell: Use correct types in beat files
Only pass the address of a u64 if that is what the function requires.
[Split out of a larger patch - sfr]
[update comment - sfr] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Ingo Molnar [Tue, 6 Jan 2009 13:56:52 +0000 (13:56 +0000)]
powerpc: Use correct type in prom_init.c
tce_entryp is a "u64 *" not an "unsigned long *".
[Split from a large patch -sfr] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Geoff Levand [Tue, 6 Jan 2009 11:32:21 +0000 (11:32 +0000)]
mtd/ps3vram: Cleanup ps3vram driver messages
Cleanup the ps3vram driver messages. Add a new struct device pointer
variable dev to struct ps3vram_priv and use dev_dbg(), pr_dbg(), etc.
where appropriate.
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Jim Paris [Tue, 6 Jan 2009 11:32:10 +0000 (11:32 +0000)]
mtd/ps3vram: Add ps3vram driver for accessing video RAM as MTD
Add ps3vram driver, which exposes unused video RAM on the PS3 as a MTD
device suitable for storage or swap. Fast data transfer is achieved
using a local cache in system RAM and DMA transfers via the GPU.
Signed-off-by: Vivien Chappelier <vivien.chappelier@free.fr> Signed-off-by: Jim Paris <jim@jtan.com> Acked-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Mohan Kumar M [Tue, 6 Jan 2009 00:23:01 +0000 (00:23 +0000)]
powerpc: Enable RELOCATABLE option for CRASH_DUMP
Enable RELOCATABLE option if user selects CRASH_DUMP option. Without this
patch user has to first select RELOCATABLE option and then has to enable
CRASH_DUMP option.
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Dave Liu [Tue, 30 Dec 2008 23:42:55 +0000 (23:42 +0000)]
powerpc: Remove the redundant _tlbil_pid at SMP case
Signed-off-by: Dave Liu <daveliu@freescale.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Stephen Rothwell [Tue, 30 Dec 2008 17:06:15 +0000 (17:06 +0000)]
powerpc/cell: Bitops work on unsigned longs
So change the flags member of struct spu from u64 to unsigned long.
This change will also prevent some warnings when we change u64 to unsigned
long long.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Matthias Fuchs [Tue, 30 Dec 2008 00:59:38 +0000 (00:59 +0000)]
powerpc: Add ioctls for RS485 mode control of serial drivers
These ioctls take a struct serial_rs485
(see linux/serial.h) as argument. They are already available
on x86. This patch adds them for the powerpc architecture.
Signed-off-by: Matthias Fuchs <mfuchs@ma-fu.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Paul Mackerras [Sun, 28 Dec 2008 14:12:57 +0000 (14:12 +0000)]
powerpc: Fix pciconfig_iobase system call on PCI-Express powermac
X has been failing to start on my quad G5 powermac since commit 1fd0f52583a85b21a394201b007bc1ee104b235d ("powerpc: Fix domain numbers
in /proc on 64-bit") went in. The reason is that the change allows X
to see the PCI-PCI bridge above the video card (previously it was
obscured by the fact that there were two "00" directories in
/proc/bus/pci), and the pciconfig_iobase system call on the bridge is
failing because of a hack that we have to return information about the
AGP bus when X asks about bus 0. This machine doesn't have an AGP bus
(it has PCI Express) and so the pciconfig_iobase call is returning -1,
which ultimately causes X to fail to start.
This fixes it by checking that we have an AGP bridge before
redirecting the pciconfig_iobase call to return information about the
AGP bus. With this, X starts successfully both on a quad G5 with
PCI Express and on an older dual G5 with AGP.
Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Nathan Lynch [Tue, 23 Dec 2008 18:55:54 +0000 (18:55 +0000)]
powerpc: Rewrite sysfs processor cache info code
The current code for providing processor cache information in sysfs
has the following deficiencies:
- several complex functions that are hard to understand
- implicit recursion (cache_desc_release -> kobject_put -> cache_desc_release)
- explicit recursion (create_cache_index_info)
- use of two per-cpu arrays when one would suffice
- duplication of work on systems where CPUs share cache
Also, when I looked at implementing support for a shared_cpu_map
attribute, it was pretty much impossible to handle hotplug without
checking every single online CPU's cache_desc list and fixing things
up... not that this is a hot path, but it would have introduced
O(n^2)-ish behavior during boot. Addressing this involved rethinking
the core data structures used, which didn't lend itself to an
incremental approach.
This implementation maintains a "forest" (potentially more than one
tree) of cache objects which reflects the system's cache topology.
Cache objects are instantiated as needed as CPUs come online. A
per-cpu array is used mainly for sysfs-related bookkeeping; the
objects in the array just point to the appropriate points in the
forest.
This maintains compatibility with the existing code and includes some
enhancements:
- Implement the shared_cpu_map attribute, which is essential for
enabling userspace to discover the system's overall cache topology.
- Use cache-block-size properties if cache-line-size is not available.
I chose to place this implementation in a new file since it would have
roughly doubled the size of sysfs.c, which is already kind of messy.
Signed-off-by: Nathan Lynch <ntl@pobox.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Grant Likely [Fri, 19 Dec 2008 14:57:20 +0000 (14:57 +0000)]
powerpc: Copy bootable images in the default install script
This patch makes the default install script (arch/powerpc/boot/install.sh)
copy the bootable image files into the install directory. Before this
patch only the vmlinux image file was copied.
This patch makes the default 'make install' command useful for embedded
development when $(INSTALL_PATH) is set in the environment.
As a side effect, this patch changes the calling convention of the
install.sh script. Instead of a single 5th parameter, the script is now
passed a list of all the target images stored in the $(image-y) Makefile
variable. This should be backwards compatible with existing install scripts
since it just adds additional arguments and does not change existing ones.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Dave Hansen [Tue, 9 Dec 2008 08:21:35 +0000 (08:21 +0000)]
powerpc/mm: Make careful_allocation() return virtual addrs
Since we memset() the result in both of the uses here,
just make careful_alloc() return a virtual address.
Also, add a separate variable to store the physial
address that comes back from the lmb_alloc() functions.
This makes it less likely that someone will screw it up
forgetting to convert before returning since the vaddr
is always in a void* and the paddr is always in an
unsigned long.
I admit this is arbitrary since one of its users needs
a paddr and one a vaddr, but it does remove a good
number of casts.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we fail a bootmem allocation, the bootmem code itself
panics. No need to redo it here.
Also change the wording of the other panic. We don't
strictly have to allocate memory on the specified node.
It is just a hint and that node may not even *have* any
memory on it. In that case we can and do fall back to
other nodes.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Nicolas Palix [Wed, 3 Dec 2008 00:25:03 +0000 (00:25 +0000)]
powerpc/powermac: Add missing of_node_put
This patch fixes some unbalanced OF node references in the
powermac code
Signed-off-by: Nicolas Palix <npalix@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There's a problem on some embedded platforms when we re-assign
everything on PCI, such as 44x. The generic code tries to avoid
assigning devices to addresses overlapping the low legacy
addresses such as VGA hard decoded areas using constants that
are unfortunately no good for us, as they don't take into account
the address translation we do to access PCI busses.
Thus we end up allocating things like IO BARs to 0, which is
technically legal, but will shadow hard decoded ports for use
by things like VGA cards.
This works around it by attempting to reserve legacy regions
before we try to assign addresses.
NOTE: This may have nasty side effects in cases I haven't tested
yet:
- We try to use FW mappings (ie. powermac) and the FW has allocated
a conflicting address over those legacy regions. This will typically
happen. I would expect the new code to just fail with an informative
message without harm but I haven't had a chance to test that scenario
yet.
- A device with fixed BARs overlapping those legacy addresses such
as an IDE controller in legacy mode is in the system. I don't know
for sure yet what will happen there, I have to test :-)
Ideally, we should change PCIBIOS_MIN_IO/MIN_MEM accross the board
to take a bus pointer so they can provide appropriate per-bus translated
values to the generic code but that's a more invasive patch. I will
do that in the future, but in the meantime, this fixes the problem
locally
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
sparc64: Work around branch tracer warning.
sparc64: Fix unsigned long long warnings in drivers.
sparc64: Use unsigned long long for u64.
sparc: refactor code in fault_32.c
sparc64: refactor code in init_64.c
sparc64: refactor code in viohs.c
sparc: make proces_ver_nack a bit more readable
Linus Torvalds [Thu, 8 Jan 2009 01:21:24 +0000 (17:21 -0800)]
Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: (67 commits)
nfsd: get rid of NFSD_VERSION
nfsd: last_byte_offset
nfsd: delete wrong file comment from nfsd/nfs4xdr.c
nfsd: git rid of nfs4_cb_null_ops declaration
nfsd: dprint each op status in nfsd4_proc_compound
nfsd: add etoosmall to nfserrno
NFSD: FIDs need to take precedence over UUIDs
SUNRPC: The sunrpc server code should not be used by out-of-tree modules
svc: Clean up deferred requests on transport destruction
nfsd: fix double-locks of directory mutex
svc: Move kfree of deferral record to common code
CRED: Fix NFSD regression
NLM: Clean up flow of control in make_socks() function
NLM: Refactor make_socks() function
nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT
SUNRPC: Ensure the server closes sockets in a timely fashion
NFSD: Add documenting comments for nfsctl interface
NFSD: Replace open-coded integer with macro
NFSD: Fix a handful of coding style issues in write_filehandle()
NFSD: clean up failover sysctl function naming
...