Li Zefan [Thu, 2 Apr 2009 23:57:52 +0000 (16:57 -0700)]
cpuset: avoid changing cpuset's mems when errno returned
When writing to cpuset.mems, cpuset has to update its mems_allowed before
calling update_tasks_nodemask(), but this function might return -ENOMEM.
To avoid this rare case, we allocate the memory before changing
mems_allowed, and then pass to update_tasks_nodemask(). Similar to what
update_cpumask() does.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 2 Apr 2009 23:57:51 +0000 (16:57 -0700)]
cpuset: rewrite update_tasks_nodemask()
This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's
mems_allowed.
Not only simplify the code largely, but also avoid allocating an array to
hold mm pointers of all the tasks in the cpuset. This array can be big
(size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a
chance to fail the allocation when under memory stress.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 2 Apr 2009 23:57:49 +0000 (16:57 -0700)]
cpuset: fix possible races in cpu/memory hotplug
Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected
by callback_mutex, otherwise the reader may read wrong cpus/mems. This is
cpuset's lock rule.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroups: use css id in swap cgroup for saving memory v5
Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine,
size of swap_cgroup goes down to 2 bytes from 8bytes.
This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)
From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
To size of swap_cgroup = 2G/4k * 2 = 1Mbytes.
Reduction is large. Of course, there are trade-offs. This CSS ID will
add overhead to swap-in/swap-out/swap-free.
But in general,
- swap is a resource which the user tend to avoid use.
- If swap is never used, swap_cgroup area is not used.
- Reading traditional manuals, size of swap should be proportional to
size of memory. Memory size of machine is increasing now.
I think reducing size of swap_cgroup makes sense.
Note:
- ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
- memcg can be obsolete at rmdir() but not freed while refcnt from
swap_cgroup is available.
Changelog v4->v5:
- reworked on to memcg-charge-swapcache-to-proper-memcg.patch
Changlog ->v4:
- fixed not configured case.
- deleted unnecessary comments.
- fixed NULL pointer bug.
- fixed message in dmesg.
[nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This swap-in is one of the most complicated work. In do_swap_page(),
following events occur when pte is unchanged.
(1) the page (SwapCache) is looked up.
(2) lock_page()
(3) try_charge_swapin()
(4) reuse_swap_page() (may call delete_swap_cache())
(5) commit_charge_swapin()
(6) swap_free().
Considering following situation for example.
(A) The page has not been charged before (2) and reuse_swap_page()
doesn't call delete_from_swap_cache().
(B) The page has not been charged before (2) and reuse_swap_page()
calls delete_from_swap_cache().
(C) The page has been charged before (2) and reuse_swap_page() doesn't
call delete_from_swap_cache().
(D) The page has been charged before (2) and reuse_swap_page() calls
delete_from_swap_cache().
memory.usage/memsw.usage changes to this page/swp_entry will be
Case (A) (B) (C) (D)
Event
Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
===========================================
(3) +1/+1 +1/+1 +1/+1 +1/+1
(4) - 0/ 0 - -1/ 0
(5) 0/-1 0/ 0 -1/-1 0/ 0
(6) - 0/-1 - 0/-1
===========================================
Result 1/ 1 1/ 1 1/ 1 1/ 1
In any cases, charges to this page should be 1/ 1.
In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
(because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
charges to the memcg("foo") to which the "current" belongs.
OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
to which the page has been charged.
So, if the "foo" and "baa" is different(for example because of task move),
this charge will be moved from "baa" to "foo".
I think this is an unexpected behavior.
This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
to return the memcg to which the swapcache has been charged if PCG_USED bit
is set.
IIUC, checking PCG_USED bit of swapcache is safe under page lock.
commit 4f98a2fee8acdb4ac84545df98cccecfd130f8db (vmscan: split LRU lists
into anon & file sets) removed mem_cgroup_reclaim_imbalance(), but there
are some leftovers in memcontrol.h.
Display memcg values like failcnt, usage and limit when an OOM occurs due
to memcg.
Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
Daisuke Nishimura and KOSAKI Motohiro for review.
Sample output
-------------
Task in /a/x killed as a result of limit of /a
memory: usage 1048576kB, limit 1048576kB, failcnt 4183
memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0
[akpm@linux-foundation.org: compilation fix]
[akpm@linux-foundation.org: fix kerneldoc and whitespace]
[akpm@linux-foundation.org: add printk facility level] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch tries to fix OOM Killer problems caused by hierarchy.
Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
kill a task in memcg.
But, when hierarchy is used, it's broken and correct task cannot
be killed. For example, in following cgroup
/groupA/ hierarchy=1, limit=1G,
01 nolimit
02 nolimit
All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
This patch provides makes the bad process be selected from all tasks
under hierarchy. BTW, currently, oom_jiffies is updated against groupA
in above case. oom_jiffies of tree should be updated.
To see how oom_jiffies is used, please check mem_cgroup_oom_called()
callers.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm
As pointed out, shrinking memcg's limit should return -EBUSY after
reasonable retries. This patch tries to fix the current behavior of
shrink_usage.
Before looking into "shrink should return -EBUSY" problem, we should fix
hierarchical reclaim code. It compares current usage and current limit,
but it only makes sense when the kernel reclaims memory because hit
limits. This is also a problem.
What this patch does are.
1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
hierarchical reclaim returns immediately and the caller checks the kernel
should shrink more or not.
(At shrinking memory, usage is always smaller than limit. So check for
usage < limit is useless.)
2. For adjusting to above change, 2 changes in "shrink"'s retry path.
2-a. retry_count depends on # of children because the kernel visits
the children under hierarchy one by one.
2-b. rather than checking return value of hierarchical_reclaim's progress,
compares usage-before-shrink and usage-after-shrink.
If usage-before-shrink <= usage-after-shrink, retry_count is
decremented.
Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up memory.stat file routine and show "total" hierarchical stat.
This patch does
- renamed get_all_zonestat to be get_local_zonestat.
- remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
- add mcs_stat to cover both of per-cpu/per-lru stat.
- add "total" stat of hierarchy (*)
- add a callback system to scan all memcg under a root.
== "total" is added.
[kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
cache 0
rss 0
pgpgin 0
pgpgout 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 50331648
hierarchical_memsw_limit 9223372036854775807
total_cache 65536
total_rss 192512
total_pgpgin 218
total_pgpgout 155
total_inactive_anon 0
total_active_anon 135168
total_inactive_file 61440
total_active_file 4096
total_unevictable 0
==
(*) maybe the user can do calc hierarchical stat by his own program
in userland but if it can be written in clean way, it's worth to be
shown, I think.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 2 Apr 2009 23:57:31 +0000 (16:57 -0700)]
debug cgroup: remove unneeded cgroup_lock
Since we are in cgroup write handler, so the cgrp is valid, so we don't
have to hold cgroup_mutex when calling cgroup_task_count(). One similar
example is in cgroup_tasks_open().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 2 Apr 2009 23:57:30 +0000 (16:57 -0700)]
cgroups: don't change release_agent when remount failed
Remount can fail in either case:
- wrong mount options is specified, or option 'noprefix' is changed.
- a to-be-added subsys is already mounted/active.
When using remount to change 'release_agent', for the above former failure
case, remount will return errno with release_agent unchanged, but for the
latter case, remount will return EBUSY with relase_agent changed, which is
unexpected I think:
# mount -t cgroup -o cpu xxx /cgrp1
# mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
# cat /cgrp2/release_agent
agent1
# mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
mount: /cgrp2 not mounted already, or bad option
# cat /cgrp2/release_agent
agent1 <-- ok
# mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
mount: /cgrp2 is busy
# cat /cgrp2/release_agent
agent2 <-- unexpected!
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Thu, 2 Apr 2009 23:57:29 +0000 (16:57 -0700)]
cgroups: show correct file mode
We have some read-only files and write-only files, but currently they are
all set to 0644, which is counter-intuitive and cause trouble for some
cgroup tools like libcgroup.
This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
own files' file mode, and for the most cases cft->mode can be default to 0
and cgroup will figure out proper mode.
Acked-by: Paul Menage <menage@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/groupA use_hierarchy==1
/01 some tasks
/02 some tasks
/03 some tasks
/04 empty
When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
is triggered and the kernel walks tree under groupA. In this case,
rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from the kernel.
In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)
This patch tries to modify above behavior, by
- retries if css_refcnt is got by someone.
- add "return value" to pre_destroy() and allows subsystem to
say "we're really busy!"
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
This patch attaches unique ID to each css and provides following.
- css_lookup(subsys, id)
returns pointer to struct cgroup_subysys_state of id.
- css_get_next(subsys, id, rootid, depth, foundid)
returns the next css under "root" by scanning
When cgroup_subsys->use_id is set, an id for css is maintained.
The cgroup framework only parepares
- css_id of root css for subsys
- id is automatically attached at creation of css.
- id is *not* freed automatically. Because the cgroup framework
don't know lifetime of cgroup_subsys_state.
free_css_id() function is provided. This must be called by subsys.
There are several reasons to develop this.
- Saving space .... For example, memcg's swap_cgroup is array of
pointers to cgroup. But it is not necessary to be very fast.
By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
reduce much amount of memory usage.
- Scanning without lock.
CSS_ID provides "scan id under this ROOT" function. By this, scanning
css under root can be written without locks.
ex)
do {
rcu_read_lock();
next = cgroup_get_next(subsys, id, root, &found);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
id = found + 1
} while(...)
Characteristics:
- Each css has unique ID under subsys.
- Lifetime of ID is controlled by subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.
Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.
- When subsys->use_id is set, # of css in the system is limited to
65535.
[bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Grzegorz Nosek [Thu, 2 Apr 2009 23:57:23 +0000 (16:57 -0700)]
cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups
The ns_proxy cgroup allows moving processes to child cgroups only one
level deep at a time. This commit relaxes this restriction and makes it
possible to attach tasks directly to grandchild cgroups, e.g.:
($pid is in the root cgroup)
echo $pid > /cgroup/CG1/CG2/tasks
Previously this operation would fail with -EPERM and would have to be
performed as two steps:
echo $pid > /cgroup/CG1/tasks
echo $pid > /cgroup/CG1/CG2/tasks
Also, the target cgroup no longer needs to be empty to move a task there.
Signed-off-by: Grzegorz Nosek <root@localdomain.pl> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Xiaodong [Thu, 2 Apr 2009 23:57:21 +0000 (16:57 -0700)]
documentation: fix unix_dgram_qlen description
Previous description about system parameter in /proc/sys/net/unix/ is
wrong (or missed). Simply add a new description about unix_dgram_qlen
according to latest kernel.
Signed-off-by: Li Xiaodong <lixd@cn.fujitsu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls
Now /proc/sys is described in many places and much information is
redundant. This patch updates the proc.txt and move the /proc/sys
desciption out to the files in Documentation/sysctls.
Details are:
merge
- 2.1 /proc/sys/fs - File system data
- 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
- 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
with Documentation/sysctls/fs.txt.
remove
- 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
since it's not better then the Documentation/binfmt_misc.txt.
merge
- 2.3 /proc/sys/kernel - general kernel parameters
with Documentation/sysctls/kernel.txt
remove
- 2.5 /proc/sys/dev - Device specific parameters
since it's obsolete the sysfs is used now.
remove
- 2.6 /proc/sys/sunrpc - Remote procedure calls
since it's not better then the Documentation/sysctls/sunrpc.txt
move
- 2.7 /proc/sys/net - Networking stuff
- 2.9 Appletalk
- 2.10 IPX
to newly created Documentation/sysctls/net.txt.
remove
- 2.8 /proc/sys/net/ipv4 - IPV4 settings
since it's not better then the Documentation/networking/ip-sysctl.txt.
add
- Chapter 3 Per-Process Parameters
to descibe /proc/<pid>/xxx parameters.
Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hppfs_read_file() may return (ssize_t) -ENOMEM, or -EFAULT. When stored
in size_t 'count', these errors will not be noticed, a large value will be
added to *ppos.
Jan Kara [Thu, 2 Apr 2009 23:57:17 +0000 (16:57 -0700)]
ext3: avoid false EIO errors
Sometimes block_write_begin() can map buffers in a page but later we
fail to copy data into those buffers (because the source page has been
paged out in the mean time). We then end up with !uptodate mapped
buffers. To add a bit more to the confusion, block_write_end() does
not commit any data (and thus does not any mark buffers as uptodate) if
we didn't succeed with copying all the data.
Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new
aops) missed these cases and thus we were inserting non-uptodate
buffers to transaction's list which confuses JBD code and it reports IO
errors, aborts a transaction and generally makes users afraid about
their data ;-P.
This patch fixes the problem by reorganizing ext3_..._write_end() code
to first call block_write_end() to mark buffers with valid data
uptodate and after that we file only uptodate buffers to transaction's
lists.
We also fix a problem where we could leave blocks allocated beyond i_size
(i_disksize in fact) because of failed write. We now add inode to orphan
list when write fails (to be safe in case we crash) and then truncate blocks
beyond i_size in a separate transaction.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext3: return -EIO not -ESTALE on directory traversal through deleted inode
ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext[234]_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted such
that a directory entry references a deleted inode. This leads to a
misleading error message - "Stale NFS file handle" - and confusion on the
part of the admin.
The bug can be easily reproduced by creating a new filesystem, making a
link to an unused inode using debugfs, then mounting and attempting to ls
-l said link.
This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
from ext3_iget(), as ext3 does for other filesystem metadata corruption;
and also invokes the appropriate ext*_error functions when this case is
detected.
Cyrus Massoumi [Thu, 2 Apr 2009 23:57:12 +0000 (16:57 -0700)]
ext3: remove the BKL in ext3/ioctl.c
Reformat ext3/ioctl.c to make it look more like ext4/ioctl.c and remove
the BKL around ext3_ioctl().
Signed-off-by: Cyrus Massoumi <cyrusm@gmx.net> Cc: <linux-ext4@vger.kernel.org> Acked-by: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michael Buesch [Thu, 2 Apr 2009 23:57:07 +0000 (16:57 -0700)]
spi-gpio: allow operation without CS signal
Change spi-gpio so that it is possible to drive SPI communications over
GPIO without the need for a chipselect signal.
This is useful in very small setups where there's only one slave device
on the bus.
This patch does not affect existing setups.
I use this for a tiny communication channel between an embedded device and
a microcontroller. There are not enough GPIOs available for chipselect
and it's not needed anyway in this case.
Signed-off-by: Michael Buesch <mb@bu3sch.de> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Thu, 2 Apr 2009 23:57:06 +0000 (16:57 -0700)]
gpio: gpio_{request,free}() now required (feature removal)
We want to phase out the GPIO "autorequest" mechanism in gpiolib and
require all callers to use gpio_request().
- Update feature-removal-schedule
- Update the documentation now
- Convert the relevant pr_warning() in gpiolib to a WARN()
so folk using this mechanism get a noisy stack dump
Some drivers and board init code will probably need to change.
Implementations not using gpiolib will still be fine; they are already
required to implement gpio_{request,free}() stubs.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow GPIOs in GPIOLIB chips to be named. This name is then used when the
GPIO is exported to sysfs, although it could be used elsewhere if deemed
useful.
Signed-off-by: Daniel Silverstone <dsilvers@simtec.co.uk> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simon Kitching [Thu, 2 Apr 2009 23:57:00 +0000 (16:57 -0700)]
initramfs: prevent initramfs printk message being split by messages from other code.
initramfs uses printk without a linefeed, then does some work, then uses
printk to finish the message off. However if some other code does a
printk in between, then the messages get mixed together. Better for each
message to be an independent line...
Example of problem that this fixes:
checking if image is initramfs...<7>Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0
it is
Signed-off-by: Simon Kitching <skitching@apache.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Thu, 2 Apr 2009 23:56:58 +0000 (16:56 -0700)]
memory_accessor: implement the new memory_accessor interfaces for SPI EEPROMs
- Define new setup() hook to export the accessor
- Implement accessor methods
Moves some error checking out of the sysfs interface code into the layer
below it, which is now shared by both sysfs and memory access code.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kevin Hilman [Thu, 2 Apr 2009 23:56:57 +0000 (16:56 -0700)]
memory_accessor: implement the new memory_accessor interface for I2C EEPROM
In the case of at24, the platform code registers a 'setup' callback with
the at24_platform_data. When the at24 driver detects an EEPROM, it fills
out the read and write functions of the memory_accessor and calls the
setup callback passing the memory_accessor struct. The platform code can
then use the read/write functions in the memory_accessor struct for
reading and writing the EEPROM.
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: David Brownell <dbrownell@users.sourceforge.net> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kevin Hilman [Thu, 2 Apr 2009 23:56:56 +0000 (16:56 -0700)]
memory_accessor: new interface for reading/writing persistent memory
Add an interface by which other kernel code can read/write persistent
memory such as I2C or SPI EEPROMs, or devices which provide NVRAM. Use
cases include storage of board-specific configuration data like Ethernet
addresses and sensor calibrations.
Original idea, review and improvement suggestions by David Brownell.
Acked-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jean Delvare [Thu, 2 Apr 2009 23:56:54 +0000 (16:56 -0700)]
workqueue: add to_delayed_work() helper function
It is a fairly common operation to have a pointer to a work and to need a
pointer to the delayed work it is contained in. In particular, all
delayed works which want to rearm themselves will have to do that. So it
would seem fair to offer a helper function for this operation.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jean Delvare <khali@linux-fr.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch/um/kernel/syscall.c: In function 'kernel_execve':
arch/um/kernel/syscall.c:130: warning: passing argument 1 of 'um_execve' discards qualifiers from pointer target type
arch/um/kernel/syscall.c:130: warning: passing argument 2 of 'um_execve' discards qualifiers from pointer target type
arch/um/kernel/syscall.c:130: warning: passing argument 3 of 'um_execve' discards qualifiers from pointer target type
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Jeff Dike <jdike@addtoit.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uml: fix link error from prefixing of i386 syscalls with ptregs_
Fix the following link error:
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x11c): undefined reference to `ptregs_fork'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x140): undefined reference to `ptregs_execve'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2cc): undefined reference to `ptregs_iopl'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2d8): undefined reference to `ptregs_vm86old'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2f0): undefined reference to `ptregs_sigreturn'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2f4): undefined reference to `ptregs_clone'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x3ac): undefined reference to `ptregs_vm86'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x3c8): undefined reference to `ptregs_rt_sigreturn'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x3fc): undefined reference to `ptregs_sigaltstack'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x40c): undefined reference to `ptregs_vfork'
This was introduced by commit 253f29a4, "x86: pass in pt_regs pointer
for syscalls that need it"
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Brian Gerst <brgerst@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jeff Dike <jdike@addtoit.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uml: fix compile error from net_device_ops conversion
Fix the following compile error:
arch/um/drivers/net_kern.c: In function 'uml_inetaddr_event':
arch/um/drivers/net_kern.c:760: error: 'struct net_device' has no member named 'open'
This was introduced by commit 8bb95b39, "uml: convert network device
to netdevice ops".
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The missing device table means that the floppy module is not auto-loaded,
even when the appropriate PNP device (0700) is found.
We don't actually use the table in the module, since the device doesn't
have a struct pnp_driver, but it's sufficient to cause an alias in the
module that udev/modprobe will use.
Signed-off-by: Scott James Remnant <scott@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Philippe De Muyter <phdm@macqel.be> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new "address_space flag"--AS_MM_ALL_LOCKS--was defined to use the next
available AS flag while the Unevictable LRU was under development. The
Unevictable LRU was using the same flag and "no one" noticed. Current
mainline, since 2.6.28, has same value for two symbolic flag names.
So, define a unique flag value for AS_UNEVICTABLE--up close to the other
flags, [at the cost of an additional #ifdef] so we'll notice next time.
Note that #ifdef is not actually required, if we don't mind having the
unused flag value defined.
Replace #defines with an enum.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: <stable@kernel.org> [2.6.28.x, 2.6.29.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The calculation of the value nr in do_xip_mapping_read is incorrect. If
the copy required more than one iteration in the do while loop the copies
variable will be non-zero. The maximum length that may be passed to the
call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the
check only compares against (nr > len).
This bug is the cause for the heap corruption Carsten has been chasing
for so long:
With this bug fix the commit 0e4a9b59282914fe057ab17027f55123964bc2e2
"ext2/xip: refuse to change xip flag during remount with busy inodes" can
be removed again.
Cc: Carsten Otte <cotte@de.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anton Blanchard [Thu, 2 Apr 2009 23:56:39 +0000 (16:56 -0700)]
mm: align vmstat_work's timer
Even though vmstat_work is marked deferrable, there are still benefits to
aligning it. For certain applications we want to keep OS jitter as low as
possible and aligning timers and work so they occur together can reduce
their overall impact.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Layton [Thu, 2 Apr 2009 23:56:37 +0000 (16:56 -0700)]
writeback: guard against jiffies wraparound on inode->dirtied_when checks (try #3)
The dirtied_when value on an inode is supposed to represent the first time
that an inode has one of its pages dirtied. This value is in units of
jiffies. It's used in several places in the writeback code to determine
when to write out an inode.
The problem is that these checks assume that dirtied_when is updated
periodically. If an inode is continuously being used for I/O it can be
persistently marked as dirty and will continue to age. Once the time
compared to is greater than or equal to half the maximum of the jiffies
type, the logic of the time_*() macros inverts and the opposite of what is
needed is returned. On 32-bit architectures that's just under 25 days
(assuming HZ == 1000).
As the least-recently dirtied inode, it'll end up being the first one that
pdflush will try to write out. sync_sb_inodes does this check:
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
...but now dirtied_when appears to be in the future. sync_sb_inodes bails
out without attempting to write any dirty inodes. When this occurs,
pdflush will stop writing out inodes for this superblock. Nothing can
unwedge it until jiffies moves out of the problematic window.
This patch fixes this problem by changing the checks against dirtied_when
to also check whether it appears to be in the future. If it does, then we
consider the value to be far in the past.
This should shrink the problematic window of time to such a small period
(30s) as not to matter.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
_outside_ of inode_lock. So any I_FREEING testing is incomplete without a
coupled testing of I_CLEAR.
So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().
Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.
Masayoshi MIZUMA has a nice panic flow:
=====================================================================
[process A] | [process B]
| |
| prune_icache() | drop_pagecache()
| spin_lock(&inode_lock) | drop_pagecache_sb()
| inode->i_state |= I_FREEING; | |
| spin_unlock(&inode_lock) | V
| | | spin_lock(&inode_lock)
| V | |
| dispose_list() | |
| list_del() | |
| clear_inode() | |
| inode->i_state = I_CLEAR | |
| | | V
| | | if (inode->i_state & (I_FREEING|I_WILL_FREE))
| | | continue; <==== NOT MATCH
| | |
| | | (DANGER from here on! Accessing disposing inode!)
| | |
| | | __iget()
| | | list_move() <===== PANIC on poisoned list !!
V V |
(time)
=====================================================================
Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Thu, 2 Apr 2009 23:56:32 +0000 (16:56 -0700)]
nommu: fix a number of issues with the per-MM VMA patch
Fix a number of issues with the per-MM VMA patch:
(1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
system.
(2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
lest it overflow.
(3) Move the allocation of the vm_area_struct slab back for fork.c.
(4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
(5) Use BUG_ON() rather than if () BUG().
(6) Make the default validate_nommu_regions() a static inline rather than a
#define.
(7) Make free_page_series()'s objection to pages with a refcount != 1 more
informative.
(8) Adjust the __put_nommu_region() banner comment to indicate that the
semaphore must be held for writing.
(9) Limit the number of warnings about munmaps of non-mmapped regions.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Thu, 2 Apr 2009 23:56:30 +0000 (16:56 -0700)]
generic debug pagealloc: build fix
This fixes a build failure with generic debug pagealloc:
mm/debug-pagealloc.c: In function 'set_page_poison':
mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: In function 'clear_page_poison':
mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: In function 'page_poison':
mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: At top level:
mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
mm/debug-pagealloc.c: In function 'kernel_map_pages':
mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)
by fixing
- debug_flags should be in struct page
- define DEBUG_PAGEALLOC config option for all architectures
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ilpo Järvinen [Wed, 1 Apr 2009 23:18:20 +0000 (23:18 +0000)]
tcp: miscounts due to tcp_fragment pcount reset
It seems that trivial reset of pcount to one was not sufficient
in tcp_retransmit_skb. Multiple counters experience a positive
miscount when skb's pcount gets lowered without the necessary
adjustments (depending on skb's sacked bits which exactly), at
worst a packets_out miscount can crash at RTO if the write queue
is empty!
Triggering this requires mss change, so bidir tcp or mtu probe or
like.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Tested-by: Uwe Bugla <uwe.bugla@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Dumon [Wed, 1 Apr 2009 22:59:07 +0000 (22:59 +0000)]
hso: fix for the 'invalid frame length' messages
Some devices cannot send very short usb transfers. To get around this the
firmware adds a known pattern and flags the driver that it should check for
this pattern on short transfers. This flag was not taken into account by
the driver.
Signed-off-by: Jan Dumon <j.dumon@option.com> Signed-off-by: David S. Miller <davem@davemloft.net>
EDIDs should be backward compatible, so don't bail if we see a version
of 3 (which is out there now) and print a message if we see something
newer, but allow it to be parsed.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
drm: sync the mode validation for INTERLACE/DBLSCAN
Check whether the INTERLACE/DBLSCAN is supported by output device. If
not, the mode containing the flag of INTERLACE/DBLSCAN will be marked
as unsupported.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Add EXPORT_SYMBOL_GPL(fsl_pq_mdio_bus_name) for module builds
Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Anton Vorontsov [Tue, 31 Mar 2009 08:33:52 +0000 (08:33 +0000)]
fsl_pq_mdio: Revive UCC MDIO support
commit 1577ecef766650a57fceb171acee2b13cbfaf1d3 ("netdev: Merge UCC
and gianfar MDIO bus drivers") introduced a regression so that UCC
MDIO buses no longer work.
This is because fsl_pq_mdio driver wrongly masks all non-TBI PHYs
for !fsl,gianfar-mdio buses, while it should do that only for
fsl,gianfar-tbi buses.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Anton Vorontsov [Thu, 2 Apr 2009 08:26:07 +0000 (01:26 -0700)]
ucc_geth: Pass proper device to DMA routines, otherwise oops happens
The driver should pass a device that actually specifies internal DMA
ops, but currently it passes netdev's device, which is wrong and that
causes following oops:
This driver contains experimental NAPI code disabled by default.
The commit bea3348ee ("[NET]: Make NAPI polling independent of struct
net_device objects.") converted the NAPI path of this driver but that
conversion was not complete. This patch fixes a build error
introduced by the commit.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
GRO assumes that there is a one-to-one relationship between NAPI
structure and network device. Some devices like sky2 share multiple
devices on a single interrupt so only have one NAPI handler. Rather than
split GRO from NAPI, just have GRO assume if device changes that
it is a different flow.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Original comment (Karsten):
On a MSI MS-6702E mainboard, when in rtl8169_init_one() for the first time
after BIOS has run, IntrStatus reads 5 after chip has been reset.
IntrStatus should equal 0 there, so patch changes IntrStatus reset to happen
after chip reset instead of before.
Remark (Francois):
Assuming that the loglevel of the driver is increased above NETIF_MSG_INTR,
the bug reveals itself with a typical "interrupt 0025 in poll" message
at startup. In retrospect, the message should had been read as an hint of
an unexpected hardware state several months ago :o(
Fixes (at least part of) https://bugzilla.redhat.com/show_bug.cgi?id=460747
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Tested-by: Josep <josep.puigdemont@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ixgbe: Fix potential memory leak/driver panic issue while setting up Tx & Rx ring parameters
While setting up the ring parameters using ethtool the driver can
panic or leak memory as ixgbe_open tries to setup tx & rx resources.
The updated logic will use ixgbe_down/up after successful allocation of
tx & rx resources
Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Tue, 31 Mar 2009 21:35:05 +0000 (21:35 +0000)]
ixgbe: fix ethtool -A|a behavior
We were basicly ignoring ethtool users request for FC autoneg
and replying to queries with a "best guess". This patch
enables the driver to store if we want to enable/disable
autoneg FC and do the correct behavior.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ixgbe: Patch to fix driver panic while freeing up tx & rx resources
When network interface is made active we were not handling the error
scenarios properly to clean up rx & tx resources which might result in
a driver panic.
Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Tue, 31 Mar 2009 21:34:23 +0000 (21:34 +0000)]
ixgbe: refactor tx buffer processing to use skb_dma_map/unmap
This patch resolves an issue with map single being used to map a buffer and
then unmap page being used to unmap it. In addition it handles any error
conditions that may be detected using skb_dma_map.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
PJ Waskiewicz [Tue, 31 Mar 2009 21:34:05 +0000 (21:34 +0000)]
ixgbe: Fix 82598 MSI-X allocation on systems with more than 8 CPU cores
MSI-X allocation broke after the 82599 merge on systems with more than 8
CPU cores. 82598 drops back into MSI mode, which isn't sufficient to run
full, efficient 10G line rate.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Tue, 31 Mar 2009 21:33:44 +0000 (21:33 +0000)]
ixgbe: feature - driver to default with FC on.
In the past flow control wasn't enabled by default under the
incorrect assumption that this opened up us to a denial of
service attack. However since any switch that forwarded flow
control would be extremely msiconfigured and/or buggy, this
concern no longer out weighs the preformance gains from
having FC enabled.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Acked-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
PJ Waskiewicz [Tue, 31 Mar 2009 21:33:25 +0000 (21:33 +0000)]
ixgbe: Fix DCB netlink layer for 82599 to enable Priority Flow Control
The priority flow control settings from the netlink layer aren't taking
effect in the base driver. The boolean pfc_mode_enable in the dcb_config
struct isn't being set, so the hardware configuration code is never
reached.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Tue, 31 Mar 2009 21:33:02 +0000 (21:33 +0000)]
ixgbe: Fix ethtool output with advertised mode.
Ethtool tries to get advertised speed from phy.autoneg_advertised.
However for copper media this wasn't happening until later do to
an other fix which moved mac.ops.setup_link_speed placement in
ixgbe_link_config(). This patch will display the default advertised
speeds if it can't yet get this information from phy.autoneg_advertised.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Acked-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Tue, 31 Mar 2009 21:32:42 +0000 (21:32 +0000)]
ixgbe: fix build when DEBUG is defined
The ixgbe driver had issues when DEBUG was defined because the hw_dbg macro
was incomplete. This patch completes the code based off of the code that
already existed in the igb module.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>