The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.
Here is what happens in pretty verbose form:
CPU 0 CPU 1
------------ ------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.
IRQ assignment is changed to this CPU,
but the expire timer stays scheduled on
the other CPU.
New connection from same ip:port comes
in right before the timer expires, we
find the inactive connection in our
connection table and get a reference to
it. We proper lock the connection in
tcp_state_transition() and read the
connection flags in set_tcp_state().
ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!
While still holding proper locks we
write the connection flags in
set_tcp_state() and this sets the hashed
flag again.
ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.
We drop the reference we held on the
connection and schedule the expire timer
for timeouting the connection on this
CPU. Further packets won't be able to
find this connection in our connection
table.
ip_vs_conn_expire() gets called again,
we think it's already hashed, but the
list_head is dangling and while removing
the connection from our connection table
we write to the memory location where
this list_head points to.
The result will probably be a kernel oops at some other point in time.
This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The notifier for address down should only be called if address is completely
gone, not just being marked as tentative on link transition. The code
in net-next would case bonding/sctp/s390 to see address disappear on link
down, but they would never see it reappear on link up.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Recent changes preserve IPv6 address when link goes down (good).
But would cause address to point to dead dst entry (bad).
The simplest fix is to just not delete route if address is
being held for later use.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix subsequent suspends by issuing tpm_continue_selftest during resume.
Otherwise, the tpm chip seems to be not fully initialized and will reject
the save state command during suspend, thus preventing the whole system
to suspend.
You generally have two cases where DDC lines are shared:
- HDMI + VGA
- HDMI + DVI-D
HDMI + VGA is easy to deal with because you can check the EDID for the
to see if the attached monitor is digital. A shared DDC line with two
digital connectors is more complex. You can't use the hdmi bits in the
EDID since they may not be there with DVI<->HDMI adapters. In this case
all we can do is check the HPD pins to see which is connected as we have
no way of knowing using the EDID.
Reported-by: trapdoor6@gmail.com Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Code did not handle projected 2d and depth coordinates, meaning potentially
set 3d or cube special handling might stick.
(Not sure what depth coord actually does, but I guess handling it
like a normal coordinate is the right thing to do.)
Might be related to https://bugs.freedesktop.org/show_bug.cgi?id=26428
Signed-off-by: sroland@vmware.com Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fixes an Ironlake laptop with a 68.940MHz 1280x800 panel and 120MHz SSC
reference clock.
More generally, the 0.488% tolerance used before is just too tight to
reliably find a PLL setting. I extracted the search algorithm and
modified it to find the dot clocks with maximum error over the valid
range for the given output type:
A lot of 945GMs have had stability issues for a long time, this manifested as X hangs, blitter engine hangs, and lots of crashes.
one such report is at:
https://bugs.freedesktop.org/show_bug.cgi?id=20560
along with numerous distro bugzillas.
This only took a week of digging and hair ripping to figure out.
Tracked down and tested on a 945GM Lenovo T60,
previously running
x11perf -copypixwin500
or
x11perf -copywinpix500
repeatedly would cause the GPU to wedge within 4 or 5 tries, with random busy bits set.
After this patch no hangs were observed.
Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The hibernate issues that got fixed in commit 985b823b9192 ("drm/i915:
fix hibernation since i915 self-reclaim fixes") turn out to have been
incomplete. Vefa Bicakci tested lots of hibernate cycles, and without
the __GFP_RECLAIMABLE flag the system eventually fails to resume.
With the flag added, Vefa can apparently hibernate forever (or until he
gets bored running his automated scripts, whichever comes first).
The reclaimable flag was there originally, and was one of the flags that
were dropped (unintentionally) by commit 4bdadb978569 ("drm/i915:
Selectively enable self-reclaim") that introduced all these problems,
but I didn't want to just blindly add back all the flags in commit 985b823b9192, and it looked like __GFP_RECLAIM wasn't necessary. It
clearly was.
I still suspect that there is some subtle reason we're missing that
causes the problems, but __GFP_RECLAIMABLE is certainly not wrong to use
in this context, and is what the code historically used. And we have no
idea what the causes the corruption without it.
Reported-and-tested-by: M. Vefa Bicakci <bicave@superonline.com> Cc: Dave Airlie <airlied@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The register offset for FW_BLC_SELF is a totally different set of bits
on Broadwater (it's actually MI_RDRET_STATE), so don't treat it like
FW_BLC_SELF on 965G chips.
Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
i915 page allocator where we weren't before due to some over-eager
removal of the page mapping gfp_flags games the code used to play.
This caused hibernate on Intel hardware to result in a lot of memory
corruptions on resume. See for example
http://bugzilla.kernel.org/show_bug.cgi?id=13811
Reported-by: Evengi Golov (in bugzilla) Signed-off-by: Dave Airlie <airlied@redhat.com> Tested-by: M. Vefa Bicakci <bicave@superonline.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Move the call to ddebug_remove_module() down into free_module(). In this
way it should be called from all error paths. Currently, we are missing
the remove if the module init routine fails.
Signed-off-by: Jason Baron <jbaron@redhat.com> Reported-by: Thomas Renninger <trenn@suse.de> Tested-by: Thomas Renninger <trenn@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Based on Intel Vol3b (March 2010), the event
SNOOPQ_REQUEST_OUTSTANDING is restricted to counters 0,1 so
update the event table for Intel Westmere accordingly.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
ocfs2_zero_extend() does its zeroing block by block, but it calls a
function named ocfs2_write_zero_page(). Let's have
ocfs2_write_zero_page() handle the page level. From
ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When ocfs2 fills a hole, it does so by allocating clusters. When a
cluster is larger than the write, ocfs2 must zero the portions of the
cluster outside of the write. If the clustersize is smaller than a
pagecache page, this is handled by the normal pagecache mechanisms, but
when the clustersize is larger than a page, ocfs2's write code will zero
the pages adjacent to the write. This makes sure the entire cluster is
zeroed correctly.
Currently ocfs2 behaves exactly the same when writing past i_size.
However, this means ocfs2 is writing zeroed pages for portions of a new
cluster that are beyond i_size. The page writeback code isn't expecting
this. It treats all pages past the one containing i_size as left behind
due to a previous truncate operation.
Thankfully, ocfs2 calculates the number of pages it will be working on
up front. The rest of the write code merely honors the original
calculation. We can simply trim the number of pages to only cover the
actual file data.
Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.
2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around). I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't. Even if it's not
exploitable, it couldn't hurt to be safe.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds a missing element of the ReadPubEK command output,
that prevents future overflow of this buffer when copying the
TPM output result into it.
Prevents a kernel panic in case the user tries to read the
pubek from sysfs.
Use an irq spinlock to hold off the IRQ handler until
enough early card init is complete such that the handler
can run without faulting.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If bit 29 is set, MAC H/W can attempt to decrypt the received aggregate
with WEP or TKIP, eventhough the received frame may be a CRC failed
corrupted frame. If this bit is set, H/W obeys key type in keycache.
If it is not set and if the key type in keycache is neither open nor
AES, H/W forces key type to be open. But bit 29 should be set to 1
for AsyncFIFO feature to encrypt/decrypt the aggregate with WEP or TKIP.
Reported-by: Johan Hovold <johan.hovold@lundinova.se> Signed-off-by: Vivek Natarajan <vnatarajan@atheros.com> Signed-off-by: Ranga Rao Ravuri <ranga.ravuri@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Compiling in the MPC5200 sound drivers results in the following build error:
sound/soc/fsl/mpc5200_psc_ac97.o: In function `to_psc_dma_stream':
mpc5200_psc_ac97.c:(.text+0x0): multiple definition of `to_psc_dma_stream'
sound/soc/fsl/mpc5200_dma.o:mpc5200_dma.c:(.text+0x0): first defined here
sound/soc/fsl/efika-audio-fabric.o: In function `to_psc_dma_stream':
efika-audio-fabric.c:(.text+0x0): multiple definition of `to_psc_dma_stream'
sound/soc/fsl/mpc5200_dma.o:mpc5200_dma.c:(.text+0x0): first defined here
make[3]: *** [sound/soc/fsl/built-in.o] Error 1
make[2]: *** [sound/soc/fsl] Error 2
make[1]: *** [sound/soc] Error 2
make: *** [sound] Error 2
This patch fixes it by declaring the inline function in the header file to
also be a static.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Jon Smirl <jonsmirl@gmail.com> Tested-by: John Hilmar Linkhorst <John.Linkhorst@rwth-aachen.de> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Peter Korsgaard <jacmet@sunsite.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If the attempt to read the calldir fails, then instead of storing the read
bytes, we currently discard them. This leads to a garbage final result when
upon re-entry to the same routine, we read the remaining bytes.
Fixes the regression in bugzilla number 16213. Please see
https://bugzilla.kernel.org/show_bug.cgi?id=16213
When implementing the test_iqr() method, I forgot that this driver is not an
ordinary PCI driver and also needs to support VLB variant of the chip. Moreover,
'hwif->dev' should be NULL, potentially causing oops in pci_read_config_byte().
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The kernel's math-emu code contains a macro _FP_FROM_INT() which is
used to convert an integer to a raw normalized floating-point value.
It does this basically in three steps:
1. Compute the exponent from the number of leading zero bits.
2. Downshift large fractions to put the MSB in the right position
for normalized fractions.
3. Upshift small fractions to put the MSB in the right position.
There is an boundary error in step 2, causing a fraction with its
MSB exactly one bit above the normalized MSB position to not be
downshifted. This results in a non-normalized raw float, which when
packed becomes a massively inaccurate representation for that input.
The impact of this depends on a number of arch-specific factors,
but it is known to have broken emulation of FXTOD instructions
on UltraSPARC III, which was originally reported as GCC bug 44631
<http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44631>.
Any arch which uses math-emu to emulate conversions from integers to
same-size floats may be affected.
The fix is simple: the exponent comparison used to determine if the
fraction should be downshifted must be "<=" not "<".
I'm sending a kernel module to test this as a reply to this message.
There are also SPARC user-space test cases in the GCC bug entry.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When configuring DMVPN (GRE + openNHRP) and a GRE remote
address is configured a kernel Oops is observed. The
obserseved Oops is caused by a NULL header_ops pointer
(neigh->dev->header_ops) in neigh_update_hhs() when
is executed. The dev associated with the NULL header_ops is
the GRE interface. This patch guards against the
possibility that header_ops is NULL.
This Oops was first observed in kernel version 2.6.26.8.
Signed-off-by: Doug Kehn <rdkehn@yahoo.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It can happen that there are no packets in queue while calling
tcp_xmit_retransmit_queue(). tcp_write_queue_head() then returns
NULL and that gets deref'ed to get sacked into a local var.
There is no work to do if no packets are outstanding so we just
exit early.
This oops was introduced by 08ebd1721ab8fd (tcp: remove tp->lost_out
guard to make joining diff nicer).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de> Tested-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix problem in reading the tx_queue recorded in a socket. In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.
We fix this by only calling sk_tx_queue_get which does the proper
checks. The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1. sk_tx_queue_recorded is no longer used so it can be
completely removed.
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sky2_phy_reinit is called by the ethtool helpers sky2_set_settings,
sky2_nway_reset and sky2_set_pauseparam when netif_running.
However, at the end of sky2_phy_init GM_GP_CTRL has GM_GPCR_RX_ENA and
GM_GPCR_TX_ENA cleared. So, doing these commands causes the device to
stop working:
$ ethtool -r eth0
$ ethtool -A eth0 autoneg off
Fix this issue by enabling Rx/Tx after running sky2_phy_init in
sky2_phy_reinit.
Signed-off-by: Brandon Philips <bphilips@suse.de> Tested-by: Brandon Philips <bphilips@suse.de> Cc: stable@kernel.org Tested-by: Mike McCormack <mikem@ring3k.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Multicast settings will be lost on reset, so restore them.
Signed-off-by: Mike McCormack <mikem@ring3k.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If the call to phy_connect fails, we will return directly instead of freeing
the previously allocated struct net_device.
Signed-off-by: Florian Fainelli <florian@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Many codecs now clear the pin controls at suspend via snd_hda_shutup_pins()
for reducing the click noise at power-off. But this leaves some pins
uninitialized, and they'll be never recovered after resume.
This patch adds the proper recovery of cleared pin controls on resume.
Also it adds a check of bus->shutdown so that pins won't be cleared at
module unloading.
Fix the security problem in the CIFS filesystem DNS lookup code in which a
malicious redirect could be installed by a random user by simply adding a
result record into one of their keyrings with add_key() and then invoking a
CIFS CFS lookup [CVE-2010-2524].
This is done by creating an internal keyring specifically for the caching of
DNS lookups. To enforce the use of this keyring, the module init routine
creates a set of override credentials with the keyring installed as the thread
keyring and instructs request_key() to only install lookup result keys in that
keyring.
The override is then applied around the call to request_key().
This has some additional benefits when a kernel service uses this module to
request a key:
(1) The result keys are owned by root, not the user that caused the lookup.
(2) The result keys don't pop up in the user's keyrings.
(3) The result keys don't come out of the quota of the user that caused the
lookup.
The keyring can be viewed as root by doing cat /proc/keys:
# keyctl list 0x2a0ca6c3
1 key in keyring: 726766307: --alswrv 0 0 dns_resolver: foo.bar.com
Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve French <smfrench@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This bug appears to be the result of a cut-and-paste mistake from the
NTLMv1 code. The function to generate the MAC key was commented out, but
not the conditional above it. The conditional then ended up causing the
session setup key not to be copied to the buffer unless this was the
first session on the socket, and that made all but the first NTLMv2
session setup fail.
Fix this by removing the conditional and all of the commented clutter
that made it difficult to see.
The IT8720F has no VIN7 pin, so VCCH should always be routed
internally to VIN7 with an internal divider. Curiously, there still
is a configuration bit to control this, which means it can be set
incorrectly. And even more curiously, many boards out there are
improperly configured, even though the IT8720F datasheet claims that
the internal routing of VCCH to VIN7 is the default setting. So we
force the internal routing in this case.
It turns out that all boards with the wrong setting are from Gigabyte,
so I suspect a BIOS bug. But it's easy enough to workaround in the
driver, so let's do it.
i5k_amb.ko uses dynamically allocated memory (by kmalloc) for
attributes passed to sysfs. So, sysfs_attr_init() should be called
for working happy with lockdep.
Reported temperature for ASB1 CPUs is too high.
Add ASB1 CPU revisions (these are also non-desktop variants) to the
list of CPUs for which the temperature fixup is not required.
Example: (from LENOVO ThinkPad Edge 13, 01972NG, system was idle)
Current kernel reports
$ sensors
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +74.0 C
Core0 Temp: +70.0 C
Core1 Temp: +69.0 C
Core1 Temp: +70.0 C
With this patch I have
$ sensors
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +54.0 C
Core0 Temp: +51.0 C
Core1 Temp: +48.0 C
Core1 Temp: +49.0 C
Cc: Rudolf Marek <r.marek@assembler.cz> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit a2e066bba2aad6583e3ff648bf28339d6c9f0898 introduced core
swapping for CPU models 64 and later. I recently had a report about
a Sempron 3200+, model 95, for which this patch broke temperature
reading. It happens that this is a single-core processor, so the
effect of the swapping was to read a temperature value for a core
that didn't exist, leading to an incorrect value (-49 degrees C.)
Disabling core swapping on singe-core processors should fix this.
Additional comment from Andreas:
The BKDG says
Thermal Sensor Core Select (ThermSenseCoreSel)-Bit 2. This bit
selects the CPU whose temperature is reported in the CurTemp
field. This bit only applies to dual core processors. For
single core processors CPU0 Thermal Sensor is always selected.
k8temp_probe() correctly detected that SEL_CORE can't be used on single
core CPU. Thus k8temp did never update the temperature values stored
in temp[1][x] and -49 degrees was reported. For single core CPUs we
must use the values read into temp[0][x].
Signed-off-by: Jean Delvare <khali@linux-fr.org> Tested-by: Rick Moritz <rhavin@gmx.net> Acked-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Christoph Fritz [Sun, 11 Jul 2010 23:26:15 +0000 (18:26 -0500)]
ssb: Handle Netbook devices where the SPROM address is changed
For some Netbook computers with Broadcom BCM4312 wireless interfaces,
the SPROM has been moved to a new location. When the ssb driver tries to
read the old location, the systems hangs when trying to read a
non-existent location. Such freezes are particularly bad as they do not
log the failure.
This patch is modified from commit da1fdb02d9200ff28b6f3a380d21930335fe5429 with some pieces from other
mainline changes so that it can be applied to stable 2.6.34.Y.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
For some reason one of the changes to sys_perf_event_open() got
mis-applied, thus breaking (at least) error handling paths (pointed
out by means of a compiler warning).
Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
netdev_printk() follows the net_device's parent device pointer, so
we must set that earlier than we previously did.
Reported-by: Luís Picciochi Oliveira <pitxyoki@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Its better to make a route lookup in appropriate namespace.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 33ad798c924b4a (tcp: options clean up) introduced a problem
if MD5+SACK+timestamps were used in initial SYN message.
Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP
sessions, but since 40 bytes of tcp options space are not enough to
store all the bits needed, we chose to disable timestamps in this case.
We send a SYN-ACK _without_ timestamp option, but socket has timestamps
enabled and all further outgoing messages contain a TS block, all with
the initial timestamp of the remote peer.
Fix is to really disable timestamps option for the whole session.
Reported-by: Bijay Singh <Bijay.Singh@guavus.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Realtek confirmed that a 20us delay is needed after mdio_read and
mdio_write operations. Reduce the delay in mdio_write, and add it
to mdio_read too. Also add a comment that the 20us is from hw specs.
Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some configurations need delay between the "write completed" indication
and new write to work reliably.
Realtek driver seems to use longer delay when polling the "write complete"
bit, so it waits long enough between writes with high probability (but
could probably break too). This patch adds a new udelay to make sure we
wait unconditionally some time after the write complete indication.
This caused a regression with XID 18000000 boards when the board specific
phy configuration writing many mdio registers was added in commit 2e955856ff (r8169: phy init for the 8169scd). Some of the configration
mdio writes would almost always fail, and depending on failure might leave
the PHY in non-working state.
Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit f4f914b5 (net: ipv6 bind to device issue) caused
a regression with Mobile IPv6 when it changed the meaning
of fl->oif to become a strict requirement of the route
lookup. Instead, only force strict mode when
sk->sk_bound_dev_if is set on the calling socket, getting
the intended behavior and fixing the regression.
Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When GRO produces fraglist entries, and the resulting skb hits
an interface that is incapable of TSO but capable of FRAGLIST,
we end up producing a bogus packet with gso_size non-zero.
This was reported in the field with older versions of KVM that
did not set the TSO bits on tuntap.
This patch fixes that.
Reported-by: Igor Zhang <yugzhang@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Because MIPS's EDQUOT value is 1133(0x46d).
It's larger than u8.
Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It is common in end-node, non STP bridges to set forwarding
delay to zero; which causes the forwarding database cleanup
to run every clock tick. Change to run only as soon as needed
or at next ageing timer interval which ever is sooner.
Use round_jiffies_up macro rather than attempting round up
by changing value.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We currently fill all of RX ring, then add_buf
returns ENOSPC, which gets mis-detected as an out of
memory condition and causes us to reschedule the work,
and so on forever. Fix this by oom = err == -ENOMEM;
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
add_buf returns ring size on out of memory,
this is not what devices expect.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
virtio-pci resets the device at startup by writing to the status
register, but this does not clear the pci config space,
specifically msi enable status which affects register
layout.
This breaks things like kdump when they try to use e.g. virtio-blk.
Fix by forcing msi off at startup. Since pci.c already has
a routine to do this, we export and use it instead of duplicating code.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: linux-pci@vger.kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Clear the floating point exception flag before returning to
user space. This is needed, else the libc trampoline handler
may hit the same SIGFPE again while building up a trampoline
to a signal handler.
Joerg Roedel [Wed, 5 May 2010 14:04:45 +0000 (16:04 +0200)]
KVM: SVM: Don't allow nested guest to VMMCALL into host
This patch disables the possibility for a l2-guest to do a
VMMCALL directly into the host. This would happen if the
l1-hypervisor doesn't intercept VMMCALL and the l2-guest
executes this instruction.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 0d945bd9351199744c1e89d57a70615b6ee9f394)
Joerg Roedel [Thu, 6 May 2010 09:38:43 +0000 (11:38 +0200)]
KVM: x86: Inject #GP with the right rip on efer writes
This patch fixes a bug in the KVM efer-msr write path. If a
guest writes to a reserved efer bit the set_efer function
injects the #GP directly. The architecture dependent wrmsr
function does not see this, assumes success and advances the
rip. This results in a #GP in the guest with the wrong rip.
This patch fixes this by reporting efer write errors back to
the architectural wrmsr function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit b69e8caef5b190af48c525f6d715e7b7728a77f6)
Avi Kivity [Tue, 4 May 2010 12:00:37 +0000 (15:00 +0300)]
KVM: Fix wallclock version writing race
Wallclock writing uses an unprotected global variable to hold the version;
this can cause one guest to interfere with another if both write their
wallclock at the same time.
Acked-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 9ed3c444ab8987c7b219173a2f7807e3f71e234e)
Shane Wang [Thu, 29 Apr 2010 16:09:01 +0000 (12:09 -0400)]
KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
Per document, for feature control MSR:
Bit 1 enables VMXON in SMX operation. If the bit is clear, execution
of VMXON in SMX operation causes a general-protection exception.
Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution
of VMXON outside SMX operation causes a general-protection exception.
This patch is to enable this kind of check with SMX for VMXON in KVM.
Signed-off-by: Shane Wang <shane.wang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit cafd66595d92591e4bd25c3904e004fc6f897e2d)
Avi Kivity [Wed, 12 May 2010 08:48:18 +0000 (11:48 +0300)]
KVM: MMU: Segregate shadow pages with different cr0.wp
When cr0.wp=0, we may shadow a gpte having u/s=1 and r/w=0 with an spte
having u/s=0 and r/w=1. This allows excessive access if the guest sets
cr0.wp=1 and accesses through this spte.
Fix by making cr0.wp part of the base role; we'll have different sptes for
the two cases and the problem disappears.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 3dbe141595faa48a067add3e47bba3205b79d33c)
Glauber Costa [Tue, 11 May 2010 16:17:40 +0000 (12:17 -0400)]
x86, paravirt: Add a global synchronization point for pvclock
In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.
This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.
Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.
Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.
A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.
We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.
OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.
After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.
Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Zachary Amsden <zamsden@redhat.com> CC: Jeremy Fitzhardinge <jeremy@goop.org> CC: Avi Kivity <avi@redhat.com> CC: Marcelo Tosatti <mtosatti@redhat.com> CC: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 489fb490dbf8dab0249ad82b56688ae3842a79e8)
KVM: SVM: Report emulated SVM features to userspace
This patch implements the reporting of the emulated SVM
features to userspace instead of the real hardware
capabilities. Every real hardware capability needs emulation
in nested svm so the old behavior was broken.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit c2c63a493924e09a1984d1374a0e60dfd54fc0b0)
KVM: x86: Add callback to let modules decide over some supported cpuid bits
This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit d4330ef2fb2236a1e3a176f0f68360f4c0a8661b)
Joerg Roedel [Fri, 19 Feb 2010 15:23:01 +0000 (16:23 +0100)]
KVM: SVM: Fix wrong interrupt injection in enable_irq_windows
The nested_svm_intr() function does not execute the vmexit
anymore. Therefore we may still be in the nested state after
that function ran. This patch changes the nested_svm_intr()
function to return wether the irq window could be enabled.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 8fe546547cf6857a9d984bfe2f2194910f3fc5d0)
Joerg Roedel [Fri, 19 Feb 2010 15:23:06 +0000 (16:23 +0100)]
KVM: SVM: Don't sync nested cr8 to lapic and back
This patch makes syncing of the guest tpr to the lapic
conditional on !nested. Otherwise a nested guest using the
TPR could freeze the guest.
Another important change this patch introduces is that the
cr8 intercept bits are no longer ORed at vmrun emulation if
the guest sets VINTR_MASKING in its VMCB. The reason is that
nested cr8 accesses need alway be handled by the nested
hypervisor because they change the shadow version of the
tpr.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 88ab24adc7142506c8583ac36a34fa388300b750)
Joerg Roedel [Fri, 19 Feb 2010 15:23:05 +0000 (16:23 +0100)]
KVM: SVM: Fix nested msr intercept handling
The nested_svm_exit_handled_msr() function maps only one
page of the guests msr permission bitmap. This patch changes
the code to use kvm_read_guest to fix the bug.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 4c7da8cb43c09e71a405b5aeaa58a1dbac3c39e9)
Joerg Roedel [Fri, 19 Feb 2010 15:23:03 +0000 (16:23 +0100)]
KVM: SVM: Sync all control registers on nested vmexit
Currently the vmexit emulation does not sync control
registers were the access is typically intercepted by the
nested hypervisor. But we can not count on that intercepts
to sync these registers too and make the code
architecturally more correct.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit cdbbdc1210223879450555fee04c29ebf116576b)
Joerg Roedel [Fri, 19 Feb 2010 15:23:00 +0000 (16:23 +0100)]
KVM: SVM: Don't use kmap_atomic in nested_svm_map
Use of kmap_atomic disables preemption but if we run in
shadow-shadow mode the vmrun emulation executes kvm_set_cr3
which might sleep or fault. So use kmap instead for
nested_svm_map.
Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
(Cherry-picked from commit 7597f129d8b6799da7a264e6d6f7401668d3a36d)
J.R. Okajima reports that the call to sync_inode() in nfs_wb_page() can
deadlock with other writeback flush calls. It boils down to the fact
that we cannot ever call writeback_single_inode() while holding a page
lock (even if we do set nr_to_write to zero) since another process may
already be waiting in the call to do_writepages(), and so will deny us
the I_SYNC lock.
If we exit from nfs_commit_inode() without ensuring that the COMMIT rpc
call has been completed, we must re-mark the inode as dirty. Otherwise,
future calls to sync_inode() with the WB_SYNC_ALL flag set will fail to
ensure that the data is on the disk.