]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
14 years agoMerge branch 'master' into for-2.6.33
Jens Axboe [Thu, 3 Dec 2009 12:49:39 +0000 (13:49 +0100)]
Merge branch 'master' into for-2.6.33

14 years agoGFS2: Fix glock refcount issues
Steven Whitehouse [Fri, 27 Nov 2009 10:31:11 +0000 (10:31 +0000)]
GFS2: Fix glock refcount issues

This patch fixes some ref counting issues. Firstly by moving
the point at which we drop the ref count after a dlm lock
operation has completed we ensure that we never call
gfs2_glock_hold() on a lock with a zero ref count.

Secondly, by using atomic_dec_and_lock() in gfs2_glock_put()
we ensure that at no time will a glock with zero ref count
appear on the lru_list. That means that we can remove the
check for this in our shrinker (which was racy).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agowriteback: remove unused nonblocking and congestion checks (gfs2)
Wu Fengguang [Wed, 18 Nov 2009 10:09:41 +0000 (18:09 +0800)]
writeback: remove unused nonblocking and congestion checks (gfs2)

No one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more. And lumpy pageout will want to do
nonblocking writeback without the congestion wait.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: drop rindex glock to refresh rindex list
Benjamin Marzinski [Tue, 10 Nov 2009 18:54:56 +0000 (12:54 -0600)]
GFS2: drop rindex glock to refresh rindex list

When a gfs2 filesystem is grown, it needs to rebuild the rindex list to be able
to use the new space.  gfs2 does this when the rindex is marked not uptodate,
which happens when the rindex glock is dropped.  However, on a single node
setup, there is never any reason to drop the rindex glock, so gfs2 never
invalidates the the rindex. This patch makes gfs2 automatically drop the
rindex glock after filesystem grows, so it can refresh the rindex list.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Tag all metadata with jid
Steven Whitehouse [Fri, 6 Nov 2009 16:20:51 +0000 (16:20 +0000)]
GFS2: Tag all metadata with jid

There are two spare field in the header common to all GFS2
metadata. One is just the right size to fit a journal id
in it, and this patch updates the journal code so that each
time a metadata block is modified, we tag it with the journal
id of the node which is performing the modification.

The reason for this is that it should make it much easier to
debug issues which arise if we can tell which node was the
last to modify a particular metadata block.

Since the field is updated before the block is written into
the journal, each journal should only contain metadata which
is tagged with its own journal id. The one exception to this
is the journal header block, which might have a different node's
id in it, if that journal was recovered by another node in the
cluster.

Thus each journal will contain a record of which nodes recovered
it, via the journal header.

The other field in the metadata header could potentially be
used to hold information about what kind of operation was
performed, but for the time being we just zero it on each
transaction so that if we use it for that in future, we'll
know that the information (where it exists) is reliable.

I did consider using the other field to hold the journal
sequence number, however since in GFS2's journaling we write
the modified data into the journal and not the original
data, this gives no information as to what action caused the
modification, so I think we can probably come up with a better
use for those 64 bits in the future.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agocfq-iosched: no dispatch limit for single queue
Shaohua Li [Thu, 3 Dec 2009 11:58:05 +0000 (12:58 +0100)]
cfq-iosched: no dispatch limit for single queue

Since commit 2f5cb7381b737e24c8046fd4aeab571fb71315f5, each queue can send
up to 4 * 4 requests if only one queue exists. I wonder why we have such limit.
Device supports tag can send more requests. For example, AHCI can send 31
requests. Test (direct aio randread) shows the limits reduce about 4% disk
thoughput.
On the other hand, since we send one request one time, if other queue
pop when current is sending more than cfq_quantum requests, current queue will
stop send requests soon after one request, so sounds there is no big latency.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
14 years agoGFS2: Locking order fix in gfs2_check_blk_state
Steven Whitehouse [Fri, 6 Nov 2009 11:10:51 +0000 (11:10 +0000)]
GFS2: Locking order fix in gfs2_check_blk_state

In some cases we already have the rindex lock when
we enter this function.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Remove dirent_first() function
Steven Whitehouse [Fri, 6 Nov 2009 11:06:37 +0000 (11:06 +0000)]
GFS2: Remove dirent_first() function

This function only had one caller left, and that caller only
called it for leaf blocks, hence one branch of the "if" was
never taken. In addition the call to get_left had already
verified the metadata type, so the function can be reduced
to a single line of code in its caller.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Display nobarrier option in /proc/mounts
Steven Whitehouse [Fri, 30 Oct 2009 10:48:53 +0000 (10:48 +0000)]
GFS2: Display nobarrier option in /proc/mounts

Since the default is barriers on, this only displays the
nobarrier option when that is active.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: add barrier/nobarrier mount options
Christoph Hellwig [Fri, 30 Oct 2009 07:03:27 +0000 (08:03 +0100)]
GFS2: add barrier/nobarrier mount options

Currently gfs2 issues barrier unconditionally.  There are various reasons
to disable them, be that just for testing or for stupid devices flushing
large battert backed caches.  Add a nobarrier option that matches xfs and
btrfs for this.  Also add a symmetric barrier option to turn it back on
at remount time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: remove division from new statfs code
Benjamin Marzinski [Mon, 26 Oct 2009 18:29:47 +0000 (13:29 -0500)]
GFS2: remove division from new statfs code

It's not necessary to do any 64bit division for the statfs sync code, so
remove it.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Improve statfs and quota usability
Benjamin Marzinski [Tue, 20 Oct 2009 07:39:44 +0000 (02:39 -0500)]
GFS2: Improve statfs and quota usability

GFS2 now has three new mount options, statfs_quantum, quota_quantum and
statfs_percent.  statfs_quantum and quota_quantum simply allow you to
set the tunables of the same name.  Setting setting statfs_quantum to 0
will also turn on the statfs_slow tunable.  statfs_percent accepts an
integer between 0 and 100.  Numbers between 1 and 100 will cause GFS2 to
do any early sync when the local number of blocks free changes by at
least statfs_percent from the totoal number of blocks free.  Setting
statfs_percent to 0 disables this.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Use dquot_send_warning()
Steven Whitehouse [Mon, 28 Sep 2009 11:49:15 +0000 (12:49 +0100)]
GFS2: Use dquot_send_warning()

This adds support to GFS2 to send quota warnings via netlink.
Also it removes a stray \r which was left over from when the
code used to print warnings on the console.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoVFS: Export dquot_send_warning
Steven Whitehouse [Mon, 28 Sep 2009 11:35:17 +0000 (12:35 +0100)]
VFS: Export dquot_send_warning

Sending a message to userspace in a generic format to warn
of events (e.g. quota exceeded) in the quota subsystem is
a generically useful feature. This patch makes some minor
changes to the send_message function from dquot.c renaming
it quota_send_message, moving it to quota.c and exporting it
for use by filesystems which do not use the dquot code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Add set_xquota support
Steven Whitehouse [Wed, 23 Sep 2009 12:50:49 +0000 (13:50 +0100)]
GFS2: Add set_xquota support

This patch adds the ability to set GFS2 quota limit and
warning levels via the XFS quota API.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Add get_xquota support
Steven Whitehouse [Mon, 28 Sep 2009 10:52:16 +0000 (11:52 +0100)]
GFS2: Add get_xquota support

This adds support for viewing the current GFS2 quota settings
via the XFS quota API. The setting of quotas will be addressed
in a later patch. Fields which are not supported here are left
set to zero.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
14 years agoGFS2: Clean up gfs2_adjust_quota() and do_glock()
Steven Whitehouse [Tue, 15 Sep 2009 19:42:56 +0000 (20:42 +0100)]
GFS2: Clean up gfs2_adjust_quota() and do_glock()

Both of these functions contained confusing and in one case
duplicate code. This patch adds a new check in do_glock()
so that we report -ENOENT if we are asked to sync a quota
entry which doesn't exist. Due to the previous patch this is
now reported correctly to userspace.

Also there are a few new comments, and I hope that the code
is easier to understand now.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Remove constant argument from qd_get()
Steven Whitehouse [Tue, 15 Sep 2009 15:30:38 +0000 (16:30 +0100)]
GFS2: Remove constant argument from qd_get()

This function was only ever called with the "create"
argument set to true, so we can remove it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Remove constant argument from qdsb_get()
Steven Whitehouse [Tue, 15 Sep 2009 15:25:40 +0000 (16:25 +0100)]
GFS2: Remove constant argument from qdsb_get()

The "create" argument to qdsb_get() was only ever set to true,
so this patch removes that argument.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Add proper error reporting to quota sync via sysfs
Steven Whitehouse [Tue, 15 Sep 2009 15:20:30 +0000 (16:20 +0100)]
GFS2: Add proper error reporting to quota sync via sysfs

For some reason, the errors were not making it to userspace.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Add get_xstate quota function
Steven Whitehouse [Fri, 11 Sep 2009 14:57:27 +0000 (15:57 +0100)]
GFS2: Add get_xstate quota function

This allows querying of the quota state via the XFS quota
API.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Remove obsolete code in quota.c
Steven Whitehouse [Fri, 11 Sep 2009 14:21:56 +0000 (15:21 +0100)]
GFS2: Remove obsolete code in quota.c

There is no point in testing for GLF_DEMOTE here, we might as
well always release the glock at that point.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Hook gfs2_quota_sync into VFS via gfs2_quotactl_ops
Steven Whitehouse [Tue, 15 Sep 2009 08:59:02 +0000 (09:59 +0100)]
GFS2: Hook gfs2_quota_sync into VFS via gfs2_quotactl_ops

The plan is to add further operations to the gfs2_quotactl_ops
in future patches. The sync operation is easy, so we start with
that one.

We plan to use the XFS quota control functions because they more
closely match the GFS2 ones.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Alter arguments of gfs2_quota/statfs_sync
Steven Whitehouse [Fri, 11 Sep 2009 13:36:44 +0000 (14:36 +0100)]
GFS2: Alter arguments of gfs2_quota/statfs_sync

These two functions are altered so that gfs2_quota_sync may
in future be called directly from the VFS. The GFS2 superblock
changes to a VFS super block and there is an addition of an int
argument which is currently ignored.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoVFS: Use GFP_NOFS in posix_acl_from_xattr()
Steven Whitehouse [Tue, 29 Sep 2009 15:31:03 +0000 (16:31 +0100)]
VFS: Use GFP_NOFS in posix_acl_from_xattr()

GFS2 needs to call this from under a glock, so we need GFP_NOFS
and I suspect that other filesystems might require this too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Add cached ACLs support
Steven Whitehouse [Tue, 29 Sep 2009 15:26:23 +0000 (16:26 +0100)]
GFS2: Add cached ACLs support

The other patches in this series have been building towards
being able to support cached ACLs like other filesystems. The
only real difference with GFS2 is that we have to invalidate
the cache when we drop a glock, but that is dealt with in earlier
patches.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Clean up ACLs
Steven Whitehouse [Fri, 2 Oct 2009 11:00:00 +0000 (12:00 +0100)]
GFS2: Clean up ACLs

To prepare for support for caching of ACLs, this cleans up the GFS2
ACL support by pushing the xattr code back into xattr.c and changing
the acl_get function into one which only returns ACLs so that we
can drop the caching function into it shortly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Use gfs2_set_mode() instead of munge_mode()
Steven Whitehouse [Tue, 29 Sep 2009 11:40:19 +0000 (12:40 +0100)]
GFS2: Use gfs2_set_mode() instead of munge_mode()

These two functions do the same thing, so lets only use
one of them.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoGFS2: Use forget_all_cached_acls()
Steven Whitehouse [Fri, 2 Oct 2009 10:54:39 +0000 (11:54 +0100)]
GFS2: Use forget_all_cached_acls()

Invalidate all the cached ACLs when we drop the glock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agoVFS: Add forget_all_cached_acls()
Steven Whitehouse [Tue, 29 Sep 2009 11:27:23 +0000 (12:27 +0100)]
VFS: Add forget_all_cached_acls()

This is required for cluster filesystems which want to use
cached ACLs so that they can invalidate the cache when
required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
14 years agoGFS2: Fix up system xattrs
Steven Whitehouse [Fri, 2 Oct 2009 10:50:54 +0000 (11:50 +0100)]
GFS2: Fix up system xattrs

This code has been shamelessly stolen from XFS at the suggestion
of Christoph Hellwig. I've not added support for cached ACLs so
far... watch for that in a later patch, although this is designed
in such a way that they should be easy to add.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
14 years agoGFS2: Fix -o meta mounts for subsequent mounts (i.e. all but the first one)
Steven Whitehouse [Mon, 28 Sep 2009 09:30:49 +0000 (10:30 +0100)]
GFS2: Fix -o meta mounts for subsequent mounts (i.e. all but the first one)

We have a long term plan to use the "-o meta" flag to GFS2 mounts to
access the alternate root which is used to store metadata for a GFS2
filesystem. This will allow us to eventually remove support for the
gfs2meta filesystem type (which is in any case just a "front end" to
the gfs2 filesystem type with the meta/master root).

Currently the "-o meta" option is only taken into account on the
initial mount of the filesystem. Subsequent mounts of the same
filesystem (i.e. on the same device) result in basically the same
as bind mounting the root of the original mount.

This patch changes that by using what is more or less a copy
of get_sb_bdev() and extending it so that it will take into
account the alternate root in all cases. The main difference
is that we have to parse the mount options a bit earlier. We can
then use them to select the appropriate root towards the end of
the function.

In addition this also fixes a bug where it was possible (but certainly
not desirable) to set different ro/rw options for the meta root
when mounted via the gfs2meta fs compared with the original mount.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
14 years agoGFS2: Fix potential race in glock code
Steven Whitehouse [Tue, 22 Sep 2009 09:56:16 +0000 (10:56 +0100)]
GFS2: Fix potential race in glock code

We need to be careful of the ordering between clearing the
GLF_LOCK bit and scheduling the workqueue.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
14 years agomutex: Fix missing conditions to build mutex_spin_on_owner()
Frederic Weisbecker [Wed, 2 Dec 2009 19:49:17 +0000 (20:49 +0100)]
mutex: Fix missing conditions to build mutex_spin_on_owner()

We don't need to build mutex_spin_on_owner() if we have
CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
it won't be used under such configs.

Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
checks before building it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
14 years agomutex: Better control mutex adaptive spinning config
Frederic Weisbecker [Wed, 2 Dec 2009 19:49:16 +0000 (20:49 +0100)]
mutex: Better control mutex adaptive spinning config

Introduce CONFIG_MUTEX_SPIN_ON_OWNER so that we can centralize
in a single place the conditions that determine its definition
and use.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
14 years agoASoC: au1x: dbdma2: plug memleak in pcm device creation error path
Manuel Lauss [Tue, 1 Dec 2009 17:10:35 +0000 (18:10 +0100)]
ASoC: au1x: dbdma2: plug memleak in pcm device creation error path

free the allocated pcm platform device in the error path.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
Acked-by: Liam Girdwood <lrg@slimlogic.co.uk>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
14 years agoASoC: au1x: dbdma2: fix oops on soc device removal.
Manuel Lauss [Tue, 1 Dec 2009 17:10:34 +0000 (18:10 +0100)]
ASoC: au1x: dbdma2: fix oops on soc device removal.

platform_device_unregister() frees resources for us, no need to
do it explicitly.  Fixes an oops when machine code removes the
soc-audio device.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
Acked-by: Liam Girdwood <lrg@slimlogic.co.uk>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
14 years agox86, Calgary IOMMU quirk: Find nearest matching Calgary while walking up the PCI...
Darrick J. Wong [Wed, 2 Dec 2009 23:05:56 +0000 (15:05 -0800)]
x86, Calgary IOMMU quirk: Find nearest matching Calgary while walking up the PCI tree

On a multi-node x3950M2 system, there's a slight oddity in the
PCI device tree for all secondary nodes:

 30:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
  \-33:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
     \-34:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)

...as compared to the primary node:

 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
  \-01:00.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
 03:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
  \-04:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)

In both nodes, the LSI RAID controller hangs off a CalIOC2
device, but on the secondary nodes, the BIOS hides the VGA
device and substitutes the device tree ending with the disk
controller.

It would seem that Calgary devices don't necessarily appear at
the top of the PCI tree, which means that the current code to
find the Calgary IOMMU that goes with a particular device is
buggy.

Rather than walk all the way to the top of the PCI
device tree and try to match bus number with Calgary descriptor,
the code needs to examine each parent of the particular device;
if it encounters a Calgary with a matching bus number, simply
use that.

Otherwise, we BUG() when the bus number of the Calgary doesn't
match the bus number of whatever's at the top of the device tree.

Extra note: This patch appears to work correctly for the x3950
that came before the x3950 M2.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jon D. Mason <jdmason@kudzu.us>
Cc: Corinna Schultz <coschult@us.ibm.com>
Cc: <stable@kernel.org>
LKML-Reference: <20091202230556.GG10295@tux1.beaverton.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
14 years agorcu: Make RCU's CPU-stall detector be default
Paul E. McKenney [Wed, 2 Dec 2009 20:10:16 +0000 (12:10 -0800)]
rcu: Make RCU's CPU-stall detector be default

The RCU_CPU_STALL_DETECTOR costs almost nothing and has located
some bugs that might otherwise have been difficult to track
down.  Make it be default for the TREE RCU implementations.

The vmlinux size impact is limited (on 64-bit x86 defconfig):

   text    data     bss     dec     hex filename
   8440248 1260076  995588 10695912  a334e8 vmlinux.before
   8440774 1260060  995588 10696422  a336e6 vmlinux.after

+526 bytes - acceptable default cost.

For RAM starved systems, TINY_RCU does not support CPU-stall detection
and is much smaller, but then again it is a uniprocessor...

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846162906-git-send-email->
[ v2: added image size calculations to the changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
14 years agorcu: Add expedited grace-period support for preemptible RCU
Paul E. McKenney [Wed, 2 Dec 2009 20:10:15 +0000 (12:10 -0800)]
rcu: Add expedited grace-period support for preemptible RCU

Implement an synchronize_rcu_expedited() for preemptible RCU
that actually is expedited.  This uses
synchronize_sched_expedited() to force all threads currently
running in a preemptible-RCU read-side critical section onto the
appropriate ->blocked_tasks[] list, then takes a snapshot of all
of these lists and waits for them to drain.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1259784616158-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
14 years agorcu: Enable fourth level of TREE_RCU hierarchy
Paul E. McKenney [Wed, 2 Dec 2009 20:10:14 +0000 (12:10 -0800)]
rcu: Enable fourth level of TREE_RCU hierarchy

Enable a fourth level of rcu_node hierarchy for TREE_RCU and
TREE_PREEMPT_RCU.  This is for stress-testing and experiemental
purposes only, although in theory this would enable 16,777,216
CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit
systems. Normal experimental use of this fourth level will
normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system,
though the more adventurous (and more fortunate) experimenters
may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even
CONFIG_RCU_FANOUT=4 for 256-CPU systems.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846161257-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
14 years agorcu: Rename "quiet" functions
Paul E. McKenney [Wed, 2 Dec 2009 20:10:13 +0000 (12:10 -0800)]
rcu: Rename "quiet" functions

The number of "quiet" functions has grown recently, and the
names are no longer very descriptive.  The point of all of these
functions is to do some portion of the task of reporting a
quiescent state, so rename them accordingly:

o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
quiescent state to the per-CPU rcu_data structure.  If this
turns out to be a new quiescent state for this grace period,
then rcu_report_qs_rnp() will be invoked to propagate the
quiescent state up the rcu_node hierarchy.

o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
a quiescent state for a given CPU (or possibly a set of CPUs)
up the rcu_node hierarchy.

o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
reports a full set of quiescent states to the global rcu_state
structure.

o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
a quiescent state due to a task exiting an RCU read-side critical
section that had previously blocked in that same critical section.
As indicated by the new name, this type of quiescent state is
reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
to do so).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846163698-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
14 years agoALSA: hda - Fix memory leaks in the previous patch
Takashi Iwai [Thu, 3 Dec 2009 09:14:10 +0000 (10:14 +0100)]
ALSA: hda - Fix memory leaks in the previous patch

The previous hack for replacing the codec name give memory leaks at
error paths.  This patch fixes them.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
14 years agoALSA: hda - Add ALC661/259, ALC892/888VD support
Kailang Yang [Thu, 3 Dec 2009 09:07:50 +0000 (10:07 +0100)]
ALSA: hda - Add ALC661/259, ALC892/888VD support

Fixed List:
   1. Add alc_read_coef_idx function
   2. Add ALC661 ALC259
   3. Add ALC892 ALC888VD

Signed-off-by: Kailang Yang <kailang@realtek.com.tw>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
14 years agoMAINTAINERS: add tree and file pattern for ARM IMX
Uwe Kleine-König [Mon, 30 Nov 2009 15:53:24 +0000 (16:53 +0100)]
MAINTAINERS: add tree and file pattern for ARM IMX

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
14 years agoblock: Allow devices to indicate whether discarded blocks are zeroed
Martin K. Petersen [Thu, 3 Dec 2009 08:24:48 +0000 (09:24 +0100)]
block: Allow devices to indicate whether discarded blocks are zeroed

The discard ioctl is used by mkfs utilities to clear a block device
prior to putting metadata down.  However, not all devices return zeroed
blocks after a discard.  Some drives return stale data, potentially
containing old superblocks.  It is therefore important to know whether
discarded blocks are properly zeroed.

Both ATA and SCSI drives have configuration bits that indicate whether
zeroes are returned after a discard operation.  Implement a block level
interface that allows this information to be bubbled up the stack and
queried via a new block device ioctl.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
14 years agopata_ali: Fix regression with old devices
Alan Cox [Mon, 30 Nov 2009 13:23:05 +0000 (13:23 +0000)]
pata_ali: Fix regression with old devices

Making the new stuff work broke some of the old chipsets. We need to go
back to the old set up values for these it seems. Unfortunately even with
documentation this is basically a mix of cargoculting and guesswork.

Chased down to the exact line by Gianluca.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years ago[libata] PATA: Update experimental tags
Alan Cox [Mon, 30 Nov 2009 13:23:00 +0000 (13:23 +0000)]
[libata] PATA: Update experimental tags

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_cmd64x: implement serialization as per notes
Alan Cox [Mon, 30 Nov 2009 13:22:54 +0000 (13:22 +0000)]
pata_cmd64x: implement serialization as per notes

Daniela Engert pointed out that there are some implementation notes for the
643 and 646 that deal with certain serialization rules. In theory we don't
need them because they apply when the motherboard decides not to retry PCI
requests for long enough and the chip is busy doing a DMA transfer on the
other channel.

The rule basically is "don't touch the taskfile of the other channel while
a DMA is in progress". To implement that we need to

- not issue a command on a channel when there is a DMA command queued
- not issue a DMA command on a channel when there are PIO commands queued
- use the alternative access to the interrupt source so that we do not
  touch altstatus or status on shared IRQ.

Updated to remote extra conditional check Bartlomiej noted and to remove
the variables for irq checks as the CMD648 doesn't have the underlying problem.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_sis: Implement MWDMA for the UDMA 133 capable chips
Alan Cox [Mon, 30 Nov 2009 13:22:49 +0000 (13:22 +0000)]
pata_sis: Implement MWDMA for the UDMA 133 capable chips

Bartlomiej pointed out that while this got fixed in the old driver whoever
did it didn't port it across.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_via: Blacklist some combinations of Transcend Flash and via
Alan Cox [Mon, 30 Nov 2009 13:22:43 +0000 (13:22 +0000)]
pata_via: Blacklist some combinations of Transcend Flash and via

Reported by Mikulas Patocka.

VIA VT82C586B + Transcend TS64GSSD25-M v0826 does not work in UDMA mode

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agolibata/sff: Use ops->bmdma_stop instead of ata_bmdma_stop()
Benjamin Herrenschmidt [Wed, 2 Dec 2009 00:36:28 +0000 (11:36 +1100)]
libata/sff: Use ops->bmdma_stop instead of ata_bmdma_stop()

In libata-sff, ata_sff_post_internal_cmd() directly calls ata_bmdma_stop()
instead of ap->ops->bmdma_stop(). This can be a problem for controllers
that use their own bmdma_stop for which the generic sff one isn't suitable

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agolibata: add translation for SCSI WRITE SAME (aka TRIM support)
Christoph Hellwig [Tue, 17 Nov 2009 15:00:47 +0000 (10:00 -0500)]
libata: add translation for SCSI WRITE SAME (aka TRIM support)

Add support for the ATA TRIM command in libata.  We translate a WRITE SAME 16
command with the unmap bit set into an ATA TRIM command and export enough
information in READ CAPACITY 16 and the block limits EVPD page so that the new
SCSI layer discard support will driver this for us.

Note that I hardcode the WRITE_SAME_16 opcode for now as the patch to introduce
the symbolic is not in 2.6.32 yet but only in the SCSI tree - as soon as it is
merged we can fix it up to properly use the symbolic name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agolibata: retry failed FLUSH if device didn't fail it
Tejun Heo [Thu, 19 Nov 2009 06:36:45 +0000 (15:36 +0900)]
libata: retry failed FLUSH if device didn't fail it

If ATA device failed FLUSH, it means that the device failed to write
out some amount of data and the error needs to be reported to upper
layers. As retries can't recover the lost data, FLUSH failures need to
be reported immediately in general.

However, if FLUSH fails due to transmission errors, the FLUSH needs to
be retried; otherwise, filesystems may switch to RO mode and/or raid
array may drop a drive for a random transmission glitch.

This condition can be rather easily reproduced on certain ahci
controllers which go through a PHY event after powersave mode switch +
ext4 combination.  Powersave mode switch is often closely followed by
flush from the filesystem failing the FLUSH with ATA bus error which
makes the filesystem code believe that data is lost and drop to RO
mode.  This was reported in the following bugzilla bug.

  http://bugzilla.kernel.org/show_bug.cgi?id=14543

This patch makes libata EH retry FLUSH if it wasn't failed by the
device.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrey Vihrov <andrey.vihrov@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agosata_fsl: Add asynchronous notification support
ashish kalra [Wed, 1 Jul 2009 15:29:43 +0000 (20:59 +0530)]
sata_fsl: Add asynchronous notification support

Enable device hot-plug support on Port multiplier fan-out ports

Signed-off-by: Ashish Kalra <Ashish.Kalra@freescale.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_hpt{37x,3x2n}: add debounce delay to cable detection methods
Bartlomiej Zolnierkiewicz [Thu, 19 Nov 2009 19:31:31 +0000 (20:31 +0100)]
pata_hpt{37x,3x2n}: add debounce delay to cable detection methods

Alan Cox reported that cable detection sometimes works unreliably
for HPT3xxN and that the issue is fixed by adding debounce delay
as used by the vendor driver.

Sergei Shtylyov also noticed that debounce delay is needed for all
HPT37x and HPT3xxN chipsets according to vendor drivers.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_hpt3x2n: fix cable detection
Bartlomiej Zolnierkiewicz [Thu, 19 Nov 2009 17:38:11 +0000 (18:38 +0100)]
pata_hpt3x2n: fix cable detection

The detection was reversed between primary and secondary ports.

Fix it to match hpt366 and the vendor driver.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agoata: Don't require newlines for link_power_management_policy
Matthew Garrett [Tue, 17 Nov 2009 16:09:03 +0000 (11:09 -0500)]
ata: Don't require newlines for link_power_management_policy

sysfs attributes shouldn't require newlines. Make it possible to set the
link power management policy without a trailing newline.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata-it821x: use PCI_DEVICE_ID_RDC_D1010 define
Otavio Salvador [Tue, 17 Nov 2009 13:11:16 +0000 (11:11 -0200)]
pata-it821x: use PCI_DEVICE_ID_RDC_D1010 define

Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_hpt37x: unify ->pre_reset methods
Bartlomiej Zolnierkiewicz [Thu, 19 Nov 2009 18:12:24 +0000 (19:12 +0100)]
pata_hpt37x: unify ->pre_reset methods

We can use the same ->pre_reset method for all HPT37x chipsets now.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_hpt37x: add proper cable detection methods
Bartlomiej Zolnierkiewicz [Thu, 19 Nov 2009 18:10:44 +0000 (19:10 +0100)]
pata_hpt37x: add proper cable detection methods

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agoahci: disable SNotification capability for ich8
Shaohua Li [Mon, 16 Nov 2009 01:56:05 +0000 (09:56 +0800)]
ahci: disable SNotification capability for ich8

I obseved there is a sata_async_notification() for every ahci
interrupt. But the async notification does nothing (this is hard
disk drive and no pmp). This cause cpu wastes some time on sntf
register access.

It appears ICH AHCI doesn't support SNotification register, but the
controller reports it does. After quirking it, the async notification
disappears.

PS. it appears all ICH don't support SNotification register from ICH
manual, don't know if we need quirk all ICH. I don't have machines
with all kinds of ICH.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agosata_sil24: MSI support, disabled by default
Vivek Mahajan [Mon, 16 Nov 2009 06:19:22 +0000 (11:49 +0530)]
sata_sil24: MSI support, disabled by default

The following patch adds MSI support. Some platforms
may have broken MSI, so those are defaulted to use
legacy PCI interrupts.

Signed-off-by: Vivek Mahajan <vivek.mahajan@freescale.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agolibata: remove experimental tag on PATA drivers
Robert Hancock [Fri, 13 Nov 2009 02:13:40 +0000 (20:13 -0600)]
libata: remove experimental tag on PATA drivers

Remove the experimental tag on Parallel ATA drivers. Though some of the
individual PATA drivers are still marked as experimental, as a group they can
hardly be considered to be, given they've been used in various distros for some
time.

Signed-off-by: Robert Hancock <hancockrwd@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agosata_mv: Clean up hard coded array size calculation.
Thiago Farina [Sun, 8 Nov 2009 19:30:57 +0000 (14:30 -0500)]
sata_mv: Clean up hard coded array size calculation.

Use ARRAY_SIZE macro of kernel api instead.

Signed-off-by: Thiago Farina <tfransosi@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_via: fix double put on isa bridge
Jiri Slaby [Wed, 4 Nov 2009 16:11:03 +0000 (17:11 +0100)]
pata_via: fix double put on isa bridge

In via_init_one, when via_isa_bridges iterator reaches
PCI_DEVICE_ID_VIA_ANON and last but one via_isa_bridges bridge is
found but rev doesn't match, pci_dev_put(isa) is called twice.

Do pci_dev_put only once.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agopata_cs5536: use 32-bit BM DMA template instead of 16-bit.
Krzysztof Halasa [Tue, 10 Nov 2009 23:58:16 +0000 (00:58 +0100)]
pata_cs5536: use 32-bit BM DMA template instead of 16-bit.

Tested on IXP425 + CS5536.

Signed-off-by: Krzysztof HaƂasa <khc@pm.waw.pl>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agolibata-acpi: missing _SDD is not an error
Tejun Heo [Wed, 18 Nov 2009 13:24:21 +0000 (22:24 +0900)]
libata-acpi: missing _SDD is not an error

Missing _SDD is not an error.  Don't treat it as one.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
14 years agoKVM: VMX: Fix comparison of guest efer with stale host value
Avi Kivity [Wed, 2 Dec 2009 10:28:47 +0000 (12:28 +0200)]
KVM: VMX: Fix comparison of guest efer with stale host value

update_transition_efer() masks out some efer bits when deciding whether
to switch the msr during guest entry; for example, NX is emulated using the
mmu so we don't need to disable it, and LMA/LME are handled by the hardware.

However, with shared msrs, the comparison is made against a stale value;
at the time of the guest switch we may be running with another guest's efer.

Fix by deferring the mask/compare to the actual point of guest entry.

Noted by Marcelo.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c
Carsten Otte [Mon, 30 Nov 2009 16:14:41 +0000 (17:14 +0100)]
KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c

This patch corrects the checking of the new address for the prefix register.
On s390, the prefix register is used to address the cpu's lowcore (address
0...8k). This check is supposed to verify that the memory is readable and
present.
copy_from_guest is a helper function, that can be used to read from guest
memory. It applies prefixing, adds the start address of the guest memory in
user, and then calls copy_from_user. Previous code was obviously broken for
two reasons:
- prefixing should not be applied here. The current prefix register is
  going to be updated soon, and the address we're looking for will be
  0..8k after we've updated the register
- we're adding the guest origin (gmsor) twice: once in subject code
  and once in copy_from_guest

With kuli, we did not hit this problem because (a) we were lucky with
previous prefix register content, and (b) our guest memory was mmaped
very low into user address space.

Cc: stable@kernel.org
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reported-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Drop user return notifier when disabling virtualization on a cpu
Avi Kivity [Sat, 28 Nov 2009 12:18:47 +0000 (14:18 +0200)]
KVM: Drop user return notifier when disabling virtualization on a cpu

This way, we don't leave a dangling notifier on cpu hotunplug or module
unload.  In particular, module unload leaves the notifier pointing into
freed memory.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: VMX: Disable unrestricted guest when EPT disabled
Sheng Yang [Fri, 27 Nov 2009 08:46:26 +0000 (16:46 +0800)]
KVM: VMX: Disable unrestricted guest when EPT disabled

Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest
supported processors.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: limit instructions to 15 bytes
Avi Kivity [Tue, 24 Nov 2009 13:20:15 +0000 (15:20 +0200)]
KVM: x86 emulator: limit instructions to 15 bytes

While we are never normally passed an instruction that exceeds 15 bytes,
smp games can cause us to attempt to interpret one, which will cause
large latencies in non-preempt hosts.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: s390: Make psw available on all exits, not just a subset
Carsten Otte [Thu, 19 Nov 2009 13:21:16 +0000 (14:21 +0100)]
KVM: s390: Make psw available on all exits, not just a subset

This patch moves s390 processor status word into the base kvm_run
struct and keeps it up-to date on all userspace exits.

The userspace ABI is broken by this, however there are no applications
in the wild using this.  A capability check is provided so users can
verify the updated API exists.

Cc: stable@kernel.org
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86: Add KVM_GET/SET_VCPU_EVENTS
Jan Kiszka [Thu, 12 Nov 2009 00:04:25 +0000 (01:04 +0100)]
KVM: x86: Add KVM_GET/SET_VCPU_EVENTS

This new IOCTL exports all yet user-invisible states related to
exceptions, interrupts, and NMIs. Together with appropriate user space
changes, this fixes sporadic problems of vmsave/restore, live migration
and system reset.

[avi: future-proof abi by adding a flags field]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Report unexpected simultaneous exceptions as internal errors
Avi Kivity [Wed, 4 Nov 2009 09:59:01 +0000 (11:59 +0200)]
KVM: VMX: Report unexpected simultaneous exceptions as internal errors

These happen when we trap an exception when another exception is being
delivered; we only expect these with MCEs and page faults.  If something
unexpected happens, things probably went south and we're better off reporting
an internal error and freezing.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Allow internal errors reported to userspace to carry extra data
Avi Kivity [Wed, 4 Nov 2009 09:54:59 +0000 (11:54 +0200)]
KVM: Allow internal errors reported to userspace to carry extra data

Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available.  Add extra data to internal error
reporting so we can expose it to the debugger.  Extra data is specific
to the suberror.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Reorder IOCTLs in main kvm.h
Jan Kiszka [Mon, 2 Nov 2009 16:20:28 +0000 (17:20 +0100)]
KVM: Reorder IOCTLs in main kvm.h

Obviously, people tend to extend this header at the bottom - more or
less blindly. Ensure that deprecated stuff gets its own corner again by
moving things to the top. Also add some comments and reindent IOCTLs to
make them more readable and reduce the risk of number collisions.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
Jan Kiszka [Fri, 30 Oct 2009 11:46:59 +0000 (12:46 +0100)]
KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG

Decouple KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP from
KVM_GUESTDBG_ENABLE, their are actually orthogonal. At this chance,
avoid triggering the WARN_ON in kvm_queue_exception if there is already
an exception pending and reject such invalid requests.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: only clear irq_source_id if irqchip is present
Marcelo Tosatti [Thu, 29 Oct 2009 15:44:17 +0000 (13:44 -0200)]
KVM: only clear irq_source_id if irqchip is present

Otherwise kvm might attempt to dereference a NULL pointer.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
Marcelo Tosatti [Thu, 29 Oct 2009 15:44:16 +0000 (13:44 -0200)]
KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic

Otherwise kvm might attempt to dereference a NULL pointer.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86: disallow multiple KVM_CREATE_IRQCHIP
Marcelo Tosatti [Thu, 29 Oct 2009 15:44:15 +0000 (13:44 -0200)]
KVM: x86: disallow multiple KVM_CREATE_IRQCHIP

Otherwise kvm will leak memory on multiple KVM_CREATE_IRQCHIP.
Also serialize multiple accesses with kvm->lock.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Remove vmx->msr_offset_efer
Avi Kivity [Thu, 29 Oct 2009 09:00:16 +0000 (11:00 +0200)]
KVM: VMX: Remove vmx->msr_offset_efer

This variable is used to communicate between a caller and a callee; switch
to a function argument instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: update invlpg handler comment
Marcelo Tosatti [Mon, 26 Oct 2009 18:50:14 +0000 (16:50 -0200)]
KVM: MMU: update invlpg handler comment

Large page translations are always synchronized (either in level 3
or level 2), so its not necessary to properly deal with them
in the invlpg handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: move CR3/PDPTR update to vmx_set_cr3
Marcelo Tosatti [Mon, 26 Oct 2009 18:48:33 +0000 (16:48 -0200)]
KVM: VMX: move CR3/PDPTR update to vmx_set_cr3

GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from
outside guest context. Similarly pdptrs are updated via load_pdptrs.

Let kvm_set_cr3 perform the update, removing it from the vcpu_run
fast path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: remove duplicated task_switch check
Gleb Natapov [Sun, 25 Oct 2009 15:45:07 +0000 (17:45 +0200)]
KVM: remove duplicated task_switch check

Probably introduced by a bad merge.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: powerpc: Fix BUILD_BUG_ON condition
Hollis Blanchard [Fri, 23 Oct 2009 00:35:30 +0000 (00:35 +0000)]
KVM: powerpc: Fix BUILD_BUG_ON condition

The old BUILD_BUG_ON implementation didn't work with __builtin_constant_p().
Fixing that revealed this test had been inverted for a long time without
anybody noticing...

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Use shared msr infrastructure
Avi Kivity [Mon, 7 Sep 2009 08:14:12 +0000 (11:14 +0300)]
KVM: VMX: Use shared msr infrastructure

Instead of reloading syscall MSRs on every preemption, use the new shared
msr infrastructure to reload them at the last possible minute (just before
exit to userspace).

Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be
reloaded as well).

[jan: fix slot index missing indirection]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86 shared msr infrastructure
Avi Kivity [Mon, 7 Sep 2009 08:12:18 +0000 (11:12 +0300)]
KVM: x86 shared msr infrastructure

The various syscall-related MSRs are fairly expensive to switch.  Currently
we switch them on every vcpu preemption, which is far too often:

- if we're switching to a kernel thread (idle task, threaded interrupt,
  kernel-mode virtio server (vhost-net), for example) and back, then
  there's no need to switch those MSRs since kernel threasd won't
  be exiting to userspace.

- if we're switching to another guest running an identical OS, most likely
  those MSRs will have the same value, so there's little point in reloading
  them.

- if we're running the same OS on the guest and host, the MSRs will have
  identical values and reloading is unnecessary.

This patch uses the new user return notifiers to implement last-minute
switching, and checks the msr values to avoid unnecessary reloading.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area
Avi Kivity [Sun, 6 Sep 2009 12:55:37 +0000 (15:55 +0300)]
KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area

Currently MSR_KERNEL_GS_BASE is saved and restored as part of the
guest/host msr reloading.  Since we wish to lazy-restore all the other
msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using
the common code.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: SVM: init_vmcb(): remove redundant save->cr0 initialization
Eduardo Habkost [Sat, 24 Oct 2009 04:50:00 +0000 (02:50 -0200)]
KVM: SVM: init_vmcb(): remove redundant save->cr0 initialization

The svm_set_cr0() call will initialize save->cr0 properly even when npt is
enabled, clearing the NW and CD bits as expected, so we don't need to
initialize it manually for npt_enabled anymore.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: SVM: Reset cr0 properly on vcpu reset
Eduardo Habkost [Sat, 24 Oct 2009 04:49:59 +0000 (02:49 -0200)]
KVM: SVM: Reset cr0 properly on vcpu reset

svm_vcpu_reset() was not properly resetting the contents of the guest-visible
cr0 register, causing the following issue:
https://bugzilla.redhat.com/show_bug.cgi?id=525699

Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine
with paging enabled, making the vcpu get a pagefault exception while trying to
run it.

Instead of setting vmcb->save.cr0 directly, the new code just resets
kvm->arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on
vmcb->save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0().

kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure
kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Use macros instead of hex value on cr0 initialization
Eduardo Habkost [Sat, 24 Oct 2009 04:49:58 +0000 (02:49 -0200)]
KVM: VMX: Use macros instead of hex value on cr0 initialization

This should have no effect, it is just to make the code clearer.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Enable 32bit dirty log pointers on 64bit host
Arnd Bergmann [Thu, 22 Oct 2009 12:19:27 +0000 (14:19 +0200)]
KVM: Enable 32bit dirty log pointers on 64bit host

With big endian userspace, we can't quite figure out if a pointer
is 32 bit (shifted >> 32) or 64 bit when we read a 64 bit pointer.

This is what happens with dirty logging. To get the pointer interpreted
correctly, we thus need Arnd's patch to implement a compat layer for
the ioctl:

A better way to do this is to add a separate compat_ioctl() method that
converts this for you.

Based on initial patch from Arnd Bergmann.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: allow userspace to adjust kvmclock offset
Glauber Costa [Fri, 16 Oct 2009 19:28:36 +0000 (15:28 -0400)]
KVM: allow userspace to adjust kvmclock offset

When we migrate a kvm guest that uses pvclock between two hosts, we may
suffer a large skew. This is because there can be significant differences
between the monotonic clock of the hosts involved. When a new host with
a much larger monotonic time starts running the guest, the view of time
will be significantly impacted.

Situation is much worse when we do the opposite, and migrate to a host with
a smaller monotonic clock.

This proposed ioctl will allow userspace to inform us what is the monotonic
clock value in the source host, so we can keep the time skew short, and
more importantly, never goes backwards. Userspace may also need to trigger
the current data, since from the first migration onwards, it won't be
reflected by a simple call to clock_gettime() anymore.

[marcelo: future-proof abi with a flags field]
[jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]

Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: fix irq_source_id size verification
Marcelo Tosatti [Sun, 18 Oct 2009 01:47:23 +0000 (22:47 -0300)]
KVM: fix irq_source_id size verification

find_first_zero_bit works with bit numbers, not bytes.

Fixes

https://sourceforge.net/tracker/?func=detail&aid=2847560&group_id=180599&atid=893831

Reported-by: "Xu, Jiajun" <jiajun.xu@intel.com>
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: SVM: Cleanup NMI singlestep
Jan Kiszka [Sun, 18 Oct 2009 11:24:54 +0000 (13:24 +0200)]
KVM: SVM: Cleanup NMI singlestep

Push the NMI-related singlestep variable into vcpu_svm. It's dealing
with an AMD-specific deficit, nothing generic for x86.

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
 arch/x86/include/asm/kvm_host.h |    1 -
 arch/x86/kvm/svm.c              |   12 +++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86: Fix guest single-stepping while interruptible
Jan Kiszka [Sun, 18 Oct 2009 11:24:44 +0000 (13:24 +0200)]
KVM: x86: Fix guest single-stepping while interruptible

Commit 705c5323 opened the doors of hell by unconditionally injecting
single-step flags as long as guest_debug signaled this. This doesn't
work when the guest branches into some interrupt or exception handler
and triggers a vmexit with flag reloading.

Fix it by saving cs:rip when user space requests single-stepping and
restricting the trace flag injection to this guest code position.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Xen PV-on-HVM guest support
Ed Swierk [Thu, 15 Oct 2009 22:21:43 +0000 (15:21 -0700)]
KVM: Xen PV-on-HVM guest support

Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

[jan: fix i386 build warning]
[avi: future proof abi with a flags field]

Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86: Drop unneeded CONFIG_HAS_IOMEM check
Jan Kiszka [Mon, 12 Oct 2009 06:51:40 +0000 (08:51 +0200)]
KVM: x86: Drop unneeded CONFIG_HAS_IOMEM check

This (broken) check dates back to the days when this code was shared
across architectures. x86 has IOMEM, so drop it.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>