IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo)
interface. After down-up, routes of other interface's IPv6 addresses through
'lo' are lost.
IPv6 addresses assigned to all interfaces are routed through 'lo' for internal
communication. Once 'lo' is down, those routing entries are removed from routing
table. But those removed entries are not being re-created properly when 'lo' is
brought up. So IPv6 addresses of other interfaces becomes unreachable from the
same machine. Also this breaks communication with other machines because of
NDISC packet processing failure.
This patch fixes this issue by reading all interface's IPv6 addresses and adding
them to IPv6 routing table while bringing up 'lo'.
==Testing==
Before applying the patch:
$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
2000::20/128 :: Un 0 1 0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$
After applying the patch:
$ route -A inet6
Kernel IPv6 routing
table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
2000::20/128 :: Un 0 1 0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
2000::20/128 :: Un 0 1 0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$
Signed-off-by: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com> Signed-off-by: Maruthi Thotad <Maruthi.Thotad@ap.sony.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
currently cbq works incorrectly for limits > 10% real link bandwidth,
and practically does not work for limits > 50% real link bandwidth.
Below are results of experiments taken on 1 Gbit link
In shaper | Actual Result
-----------+---------------
100M | 108 Mbps
200M | 244 Mbps
300M | 412 Mbps
500M | 893 Mbps
This happen because of q->now changes incorrectly in cbq_dequeue():
when it is called before real end of packet transmitting,
L2T is greater than real time delay, q_now gets an extra boost
but never compensate it.
To fix this problem we prevent change of q->now until its synchronization
with real time.
Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This is my example conversion of a few existing mmap users. The HPET
case is simple, widely available, and easy to test (Clemens Ladisch sent
a trivial test-program for it).
Test-program-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This is my example conversion of a few existing mmap users. The
fb_mmap() case is a good example because it is a bit more complicated
than some: fb_mmap() mmaps one of two different memory areas depending
on the page offset of the mmap (but happily there is never any mixing of
the two, so the helper function still works).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: fold in the relevant part of commit 314e51b9851b
'mm: kill vma flag VM_RESERVED and mm->reserved_vm counter'] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Various drivers end up replicating the code to mmap() their memory
buffers into user space, and our core memory remapping function may be
very flexible but it is unnecessarily complicated for the common cases
to use.
Our internal VM uses pfn's ("page frame numbers") which simplifies
things for the VM, and allows us to pass physical addresses around in a
denser and more efficient format than passing a "phys_addr_t" around,
and having to shift it up and down by the page size. But it just means
that drivers end up doing that shifting instead at the interface level.
It also means that drivers end up mucking around with internal VM things
like the vma details (vm_pgoff, vm_start/end) way more than they really
need to.
So this just exports a function to map a certain physical memory range
into user space (using a phys_addr_t based interface that is much more
natural for a driver) and hides all the complexity from the driver.
Some drivers will still end up tweaking the vm_page_prot details for
things like prefetching or cacheability etc, but that's actually
relevant to the driver, rather than caring about what the page offset of
the mapping is into the particular IO memory region.
Commit b4cbb197c7e7 ("vm: add vm_iomap_memory() helper function") added
a helper function wrapper around io_remap_pfn_range(), and every other
architecture defined it in <asm/pgtable.h>.
The s390 choice of <asm/io.h> may make sense, but is not very convenient
for this case, and gratuitous differences like that cause unexpected errors like this:
mm/memory.c: In function 'vm_iomap_memory':
mm/memory.c:2439:2: error: implicit declaration of function 'io_remap_pfn_range' [-Werror=implicit-function-declaration]
Glory be the kbuild test robot who noticed this, bisected it, and
reported it to the guilty parties (ie me).
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: the macro was not defined, so this is an addition
and not a move] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The valid mask for both offcore_response_0 and
offcore_response_1 was wrong for SNB/SNB-EP,
IVB/IVB-EP. It was possible to write to
reserved bit and cause a GP fault crashing
the kernel.
This patch fixes the problem by correctly marking the
reserved bits in the valid mask for all the processors
mentioned above.
A distinction between desktop and server parts is introduced
because bits 24-30 are only available on the server parts.
This version of the patch is just a rebase to perf/urgent tree
and should apply to older kernels as well.
Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: gregkh@linuxfoundation.org Cc: security@kernel.org Cc: ak@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust context; drop the IVB case] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Trinity discovered that we fail to check all 64 bits of
attr.config passed by user space, resulting to out-of-bounds
access of the perf_swevent_enabled array in
sw_perf_event_destroy().
Introduced in commit b0a873ebb ("perf: Register PMU
implementations").
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We first tried to avoid updating atime/mtime entirely (commit b0de59b5733d: "TTY: do not update atime/mtime on read/write"), and then
limited it to only update it occasionally (commit 37b7f3c76595: "TTY:
fix atime/mtime regression"), but it turns out that this was both
insufficient and overkill.
It was insufficient because we let people attach to the shared ptmx node
to see activity without even reading atime/mtime, and it was overkill
because the "only once a minute" means that you can't really tell an
idle person from an active one with 'w'.
So this tries to fix the problem properly. It marks the shared ptmx
node as un-notifiable, and it lowers the "only once a minute" to a few
seconds instead - still long enough that you can't time individual
keystrokes, but short enough that you can tell whether somebody is
active or not.
Reported-by: Simon Kirby <sim@hostway.ca> Acked-by: Jiri Slaby <jslaby@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In commit b0de59b5733d ("TTY: do not update atime/mtime on read/write")
we removed timestamps from tty inodes to fix a security issue and waited
if something breaks. Well, 'w', the utility to find out logged users
and their inactivity time broke. It shows that users are inactive since
the time they logged in.
To revert to the old behaviour while still preventing attackers to
guess the password length, we update the timestamps in one-minute
intervals by this patch.
Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: For 3.2, use Greg's backported version] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
On http://vladz.devzero.fr/013_ptmx-timing.php, we can see how to find
out length of a password using timestamps of /dev/ptmx. It is
documented in "Timing Analysis of Keystrokes and Timing Attacks on
SSH". To avoid that problem, do not update time when reading
from/writing to a TTY.
I am afraid of regressions as this is a behavior we have since 0.97
and apps may expect the time to be current, e.g. for monitoring
whether there was a change on the TTY. Now, there is no change. So
this would better have a lot of testing before it goes upstream.
Reviwed-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Trying to run an application which was trying to put data into half of
memory using shmget(), we found that having a shmall value below 8EiB-8TiB
would prevent us from using anything more than 8TiB. By setting
kernel.shmall greater than 8EiB-8TiB would make the job work.
In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.
[akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int] Signed-off-by: Robin Holt <holt@sgi.com> Reported-by: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
select_parent() populates the dispose list with dentries which
shrink_dentry_list() then deletes. select_parent() carefully uses
need_resched() to avoid doing too much work at once. But neither
shrink_dcache_parent() nor its called functions call cond_resched(). So
once need_resched() is set select_parent() will return single dentry
dispose list which is then deleted by shrink_dentry_list(). This is
inefficient when there are a lot of dentry to process. This can cause
softlockup and hurts interactivity on non preemptable kernels.
This change adds cond_resched() in shrink_dcache_parent(). The benefit
of this is that need_resched() is quickly cleared so that future calls
to select_parent() are able to efficiently return a big batch of dentry.
These additional cond_resched() do not seem to impact performance, at
least for the workload below.
Here is a program which can cause soft lockup if other system activity
sets need_resched().
int main()
{
struct rlimit rlim;
int i;
int f[100000];
char buf[20];
struct timeval t1, t2;
double diff;
When we run the crackerjack testsuite, the inotify_add_watch test is
stalled.
This is caused by the invalid mask 0 - the task is waiting for the event
but it never comes. inotify_add_watch() should return -EINVAL as it did
before commit 676a0675cf92 ("inotify: remove broken mask checks causing
unmount to be EINVAL"). That commit removes the invalid mask check, but
that check is needed.
Check the mask's ALL_INOTIFY_BITS before the inotify_arg_to_mask() call.
If none are set, just return -EINVAL.
Because IN_UNMOUNT is in ALL_INOTIFY_BITS, this change will not trigger
the problem that above commit fixed.
[akpm@linux-foundation.org: fix build] Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Acked-by: Jim Somerville <Jim.Somerville@windriver.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.
So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.
The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.
Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
There's a bug where rtc alarms are ignored after the rtc cmos suspends
but before the system finishes suspend. Since hpet emulation is
disabled and it still handles the interrupts, a wake event is never
registered which is done from the rtc layer.
This patch reverts commit d1b2efa83fbf ("rtc: disable hpet emulation on
suspend") which disabled hpet emulation. To fix the problem mentioned
in that commit, hpet_rtc_timer_init() is called directly on resume.
Signed-off-by: Derek Basehore <dbasehore@chromium.org> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
There is a kernel memory leak observed when the proc file
/proc/fs/fscache/stats is read.
The reason is that in fscache_stats_open, single_open is called and the
respective release function is not called during release. Hence fix
with correct release function - single_release().
In case a machine supports memory hotplug all active memory increments
present at IPL time have been initialized with a "usecount" of 1.
This is wrong if the memory increment size is larger than the memory
section size of the memory hotplug code. If that is the case the
usecount must be initialized with the number of memory sections that
fit into one memory increment.
Otherwise it is possible to put a memory increment into standby state
even if there are still active sections.
Afterwards addressing exceptions might happen which cause the kernel
to panic.
However even worse, if a memory increment was put into standby state
and afterwards into active state again, it's contents would have been
zeroed, leading to memory corruption.
This was only an issue for machines that support standby memory and
have at least 256GB memory.
This is broken since commit fdb1bb15 "[S390] sclp/memory hotplug: fix
initial usecount of increments".
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In __after_prom_start we copy the kernel down to zero in two calls to
copy_and_flush. After the first call (copy from 0 to copy_to_here:)
we jump to the newly copied code soon after.
Unfortunately there's no isync between the copy of this code and the
jump to it. Hence it's possible that stale instructions could still be
in the icache or pipeline before we branch to it.
We've seen this on real machines and it's results in no console output
after:
calling quiesce...
returning from prom_init
The below adds an isync to ensure that the copy and flushing has
completed before any branching to the new instructions occurs.
Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Previously, the ixgbe_msix_other was writing the full 32bits of the set
interrupts, instead of only the ones which the ixgbe_msix_other is
handling. This resulted in a loss of performance when the X540's PPS feature is
enabled due to sometimes clearing queue interrupts which resulted in the driver
not getting the interrupt for cleaning the q_vector rings often enough. The fix
is to simply mask the lower 16bits off so that this handler does not write them
in the EICR, which causes them to remain high and be properly handled by the
clean_rings interrupt routine as normal.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Vitaliy reported that a per cpu HPET timer interrupt crashes the
system during hibernation. What happens is that the per cpu HPET timer
gets shut down when the nonboot cpus are stopped. When the nonboot
cpus are onlined again the HPET code sets up the MSI interrupt which
fires before the clock event device is registered. The event handler
is still set to hrtimer_interrupt, which then crashes the machine due
to highres mode not being active.
See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333
There is no real good way to avoid that in the HPET code. The HPET
code alrady has a mechanism to detect spurious interrupts when event
handler == NULL for a similar reason.
We can handle that in the clockevent/tick layer and replace the
previous functional handler with a dummy handler like we do in
tick_setup_new_device().
The original clockevents code did this in clockevents_exchange_device(),
but that got removed by commit 7c1e76897 (clockevents: prevent
clockevent event_handler ending up handler_noop) which forgot to fix
it up in tick_shutdown(). Same issue with the broadcast device.
Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: 700333@bugs.debian.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
An early draft of the PHC patch series included an alarm in the
gianfar driver. During the review process, the alarm code was dropped,
but the capability removal was overlooked. This patch fixes the issue
by advertising zero alarms.
This patch should be applied to every 3.x stable kernel.
Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reported-by: Chris LaRocque <clarocq@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The USB_DT_CS_ENDPOINT class-specific endpoint descriptor is usually
stuffed directly after the standard USB endpoint descriptor, and this is
where the driver currently expects it to be.
There are, however, devices in the wild that have it the other way
around in their descriptor sets, so the USB_DT_CS_ENDPOINT comes
*before* the standard enpoint. Devices known to implement it that way
are "Sennheiser BTD-500" and Plantronics USB headsets.
When the driver can't find the USB_DT_CS_ENDPOINT, it won't be able to
change sample rates, as the bitmask for the validity of this command is
storen in bmAttributes of that descriptor.
Fix this by searching the entire interface instead of just the extra
bytes of the first endpoint, in case the latter fails.
Signed-off-by: Daniel Mack <zonque@gmail.com> Reported-and-tested-by: Torstein Hegge <hegge@resisty.net> Reported-and-tested-by: Yves G <alsa-user@vivigatt.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In commit 85fe402 (fs: do not assign default i_ino in new_inode), the
initialisation of i_ino was removed from new_inode() and pushed down
into the callers. However spufs_new_inode() was not updated.
This exhibits as no files appearing in /spu, because all our dirents
have a zero inode, which readdir() seems to dislike.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
fbcon: don't lose the console font across generic->chip driver switch
uses a pointer in vc->vc_font.data to load font into the new driver.
However if the font is actually freed, we need to clear the data
so that we don't reload font from dangling pointer.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=892340 Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The seconds field of an nfstime4 structure is 64bit, but we are assuming
that the first 32bits are zero-filled. So if the client tries to set
atime to a value before the epoch (touch -t 196001010101), then the
server will save the wrong value on disk.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The TX_FIFO register is 10 bits wide. The lower 8 bits are the data to be
written, while the upper two bits are flags to indicate stop/start.
The driver apparently attempted to optimize write access, by only writing a
byte in those cases where the stop/start bits are zero. However, we have
seen cases where the lower byte is duplicated onto the upper byte by the
hardware, which causes inadvertent stop/starts.
This patch changes the write access to the transmit FIFO to always be 16 bits
wide.
Signed off by: Steven A. Falco <sfalco@harris.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Many cards based on CY7C68300A/B/C use the USB ID 04b4:6830 but only the
B and C variants (EZ-USB AT2LP) support the ATA Command Block
functionality, according to the data sheets. The A variant (EZ-USB AT2)
locks up if ATACB is attempted, until a typical 30 seconds timeout runs
out and a USB reset is performed.
https://bugs.launchpad.net/bugs/428469
It seems that one way to spot a CY7C68300A (at least where the card
manufacturer left Cypress' EEPROM default vaules, against Cypress'
recommendations) is to look at the USB string descriptor indices.
A http://media.digikey.com/pdf/Data%20Sheets/Cypress%20PDFs/CY7C68300A.pdf
B http://www.farnell.com/datasheets/43456.pdf
C http://www.cypress.com/?rID=14189
Note that a CY7C68300B/C chip appears as CY7C68300A if it is running
in Backward Compatibility Mode, and if ATACB would be supported in this
case there is anyway no way to tell which chip it really is.
For 5 years my external USB drive has been locking up for half a minute
when plugged in and ata_id is run by udev, or anytime hdparm or similar
is run on it.
Finally looking at the /correct/ datasheet I think I found the reason. I
am aware the quirk in this patch is a bit hacky, but the hardware
manufacturers haven't made it easy for us.
The current ST Micro Connect Lite uses the FT4232H hi-speed quad USB
UART FTDI chip. It is also possible to drive STM reference targets
populated with an on-board JTAG debugger based on the FT2232H chip with
the same STMicroelectronics tools.
For this reason, the ST Micro Connect Lite PIDs should be
ST_STMCLT_2232_PID: 0x3746
ST_STMCLT_4232_PID: 0x3747
Signed-off-by: Adrian Thomasset <adrian.thomasset@st.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fox the Kconfig documentation for CONFIG_EXT4_DEBUG to match the
change made by commit a0b30c1229: ext4: use module parameters instead
of debugfs for mballoc_debug
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
After a server reboot, the reclaimer thread will recover all the existing
locks. For locks that are blocked, however, it will change the value
of block->b_status to nlm_lck_denied_grace_period in order to signal that
they need to wake up and resend the original blocking lock request.
Due to a bug, however, the block->b_status never gets reset after the
blocked locks have been woken up, and so the process goes into an
infinite loop of resends until the blocked lock is satisfied.
Reported-by: Marc Eshel <eshel@us.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
As discussed in this thread
http://lists.freedesktop.org/archives/dri-devel/2013-April/037411.html
GMBUS based DVO transmitter detection seems to be unreliable which could
result in an unusable DVO port.
The attached patch fixes this by falling back to bit banging mode for
the time DVO transmitter detection is in progress.
Signed-off-by: David Müller <d.mueller@elsoft.ch> Tested-by: David Müller <d.mueller@elsoft.ch> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When the stations are being restored because of unassoc
RXON, the LQ cmd may not have been initialized because it
is initialized only after association.
Sending zeroed LQ_CMD makes the fw unhappy: it raises
SYSASSERT_2078.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com>
[move zero_lq and make static const] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
[bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Backlight cleanup in the eDP connector destroy callback caused the
backlight device to be removed on some systems that first initialized LVDS
and then attempted to initialize eDP. Prevent multiple backlight
initializations, and ensure backlight cleanup is only done once by moving
it to modeset cleanup.
A small wrinkle is the introduced asymmetry in backlight
setup/cleanup. This could be solved by adding refcounting, but it seems
overkill considering that there should only ever be one backlight device.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=55701 Signed-off-by: Jani Nikula <jani.nikula@intel.com> Tested-by: Peter Verthez <peter.verthez@skynet.be> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
[bwh: Backported to 3.2:
- Adjust context
- s/dev_priv->backlight\.device/dev_priv->backlight/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In order to fully serialize access to the fenced region and the update
to the fence register we need to take extreme measures on SNB+, and
manually flush writes to memory prior to writing the fence register in
conjunction with the memory barriers placed around the register write.
Fixes i-g-t/gem_fence_thrash
v2: Bring a bigger gun
v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
v4: Remove changes for working generations.
v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
v6: Rewrite comments to ellide forgotten history.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jon Bloomfield <jon.bloomfield@intel.com> Tested-by: Jon Bloomfield <jon.bloomfield@intel.com> (v2) Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
[bwh: Backported to 3.2: insert the cache flush in i915_gem_object_get_fence()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When usbfs receives a ctrl-request from userspace it calls check_ctrlrecip,
which for a request with USB_RECIP_ENDPOINT tries to map this to an interface
to see if this interface is claimed, except for ctrl-requests with a type of
USB_TYPE_VENDOR.
When trying to use this device: http://www.akaipro.com/eiepro
redirected to a Windows vm running on qemu on top of Linux.
The windows driver makes a ctrl-req with USB_TYPE_CLASS and
USB_RECIP_ENDPOINT with index 0, and the mapping of the endpoint (0) to
the interface fails since ep 0 is the ctrl endpoint and thus never is
part of an interface.
This patch fixes this ctrl-req failing by skipping the checkintf call for
USB_RECIP_ENDPOINT ctrl-reqs on the ctrl endpoint.
Reported-by: Dave Stikkolorum <d.r.stikkolorum@hhs.nl> Tested-by: Dave Stikkolorum <d.r.stikkolorum@hhs.nl> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The hardware revision of the codec is based at 0x40. Subtract that
before convering to ASCII. The same as it is done for 98095.
Signed-off-by: Dylan Reid <dgreid@chromium.org> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
smpboot: Booting Node 0 Processor 1 APIC 0x2
installing Xen timer for CPU 1
BUG: sleeping function called from invalid context at /home/konrad/ssd/konrad/linux/mm/slab.c:3179
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/1
Pid: 0, comm: swapper/1 Not tainted 3.9.0-rc6upstream-00001-g3884fad #1
Call Trace:
[<ffffffff810c1fea>] __might_sleep+0xda/0x100
[<ffffffff81194617>] __kmalloc_track_caller+0x1e7/0x2c0
[<ffffffff81303758>] ? kasprintf+0x38/0x40
[<ffffffff813036eb>] kvasprintf+0x5b/0x90
[<ffffffff81303758>] kasprintf+0x38/0x40
[<ffffffff81044510>] xen_setup_timer+0x30/0xb0
[<ffffffff810445af>] xen_hvm_setup_cpu_clockevents+0x1f/0x30
[<ffffffff81666d0a>] start_secondary+0x19c/0x1a8
The solution to that is use kasprintf in the CPU hotplug path
that 'online's the CPU. That is, do it in in xen_hvm_cpu_notify,
and remove the call to in xen_hvm_setup_cpu_clockevents.
Unfortunatly the later is not a good idea as the bootup path
does not use xen_hvm_cpu_notify so we would end up never allocating
timer%d interrupt lines when booting. As such add the check for
atomic() to continue.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
While we don't use the spinlock interrupt line (see for details
commit f10cd522c5fbfec9ae3cc01967868c9c2401ed23 -
xen: disable PV spinlocks on HVM) - we should still do the proper
init / deinit sequence. We did not do that correctly and for the
CPU init for PVHVM guest we would allocate an interrupt line - but
failed to deallocate the old interrupt line.
This resulted in leakage of an irq_desc but more importantly this splat
as we online an offlined CPU:
There is the oddity of the 'spinlock1' still being present.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In the PVHVM path when we do CPU online/offline path we would
leak the timer%d IRQ line everytime we do a offline event. The
online path (xen_hvm_setup_cpu_clockevents via
x86_cpuinit.setup_percpu_clockev) would allocate a new interrupt
line for the timer%d.
But we would still use the old interrupt line leading to:
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When a full scan 2.4 and 5 GHz scan is scheduled, but then the 2.4 GHz
part of the scan disables a 5.2 GHz channel due to, e.g. receiving
country or frequency information, that 5.2 GHz channel might already
be in the list of channels to scan next. Then, when the driver checks
if it should do a passive scan, that will return false and attempt an
active scan. This is not only wrong but can also lead to the iwlwifi
device firmware crashing since it checks regulatory as well.
Fix this by not setting the channel flags to just disabled but rather
OR'ing in the disabled flag. That way, even if the race happens, the
channel will be scanned passively which is still (mostly) correct.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit b51306c (PCI: Set device power state to PCI_D0 for device
without native PM support) modified pci_platform_power_transition()
by adding code causing dev->current_state for devices that don't
support native PCI PM but are power-manageable by the platform to be
changed to PCI_D0 regardless of the value returned by the preceding
platform_pci_set_power_state(). In particular, that also is done
if the platform_pci_set_power_state() has been successful, which
causes the correct power state of the device set by
pci_update_current_state() in that case to be overwritten by PCI_D0.
Fix that mistake by making the fallback to PCI_D0 only happen if
the platform_pci_set_power_state() has returned an error.
[bhelgaas: folded in Yinghai's simplification, added URL & stable info]
Reference: http://lkml.kernel.org/r/27806FC4E5928A408B78E88BBC67A2306F466BBA@ORSMSX101.amr.corp.intel.com Reported-by: Chris J. Benenati <chris.j.benenati@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The serial core uses device_find_child() but does not drop the reference to
the retrieved child after using it. This patch add the missing put_device().
What I have done to test this issue.
I used a machine with an AMBA PL011 serial driver. I tested the patch on
next-20120408 because the last branch [next-20120415] does not boot on this
board.
For test purpose, I added some pr_info() messages to print the refcount
after device_find_child() (lines: 1937,2009), and after put_device()
(lines: 1947, 2021).
Commit 88a8516a2128 (ALSA: usbaudio: implement USB autosuspend)
introduced autopm for all USB audio/MIDI devices. However, many MIDI
devices, such as synthesizers, do not merely transmit MIDI messages but
use their MIDI inputs to control other functions. With autopm, these
devices would get powered down as soon as the last MIDI port device is
closed on the host.
Even some plain MIDI interfaces could get broken: they automatically
send Active Sensing messages while powered up, but as soon as these
messages cease, the receiving device would interpret this as an
accidental disconnection.
Commit f5f165418cab (ALSA: usb-audio: Fix missing autopm for MIDI input)
introduced another regression: some devices (e.g. the Roland GAIA SH-01)
are self-powered but do a reset whenever the USB interface's power state
changes.
To work around all this, just disable autopm for all USB MIDI devices.
Reported-by: Laurens Holst Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When recording at 176.2KHz or 192Khz, the device adds a 32-bit length
header to the capture packets, which obviously needs to be ignored for
recording to work properly.
Also, the last byte of the length header being interpreted as L0 of
the first sample caused spikes every 0.5ms, resulting in a loud 16KHz
tone (about the highest 'B' on a piano) being present throughout
captures.
Tested at all sample rates on an E-Mu 0404USB, and tested for
regressions on a generic USB headset.
Signed-off-by: Calvin Owens <jcalvinowens@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
[bwh: Backported to 3.2: adjust filenames, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Don't actually close any opens until we don't need them at all.
This means being left with write access when it's not really necessary,
but that's better than putting a file that might still have posix locks
held on it, as we have been.
Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
One can trigger an overflow when using ktime_add_ns() on a 32bit
architecture not supporting CONFIG_KTIME_SCALAR.
When passing a very high value for u64 nsec, e.g. 7881299347898368000
the do_div() function converts this value to seconds (7881299347) which
is still to high to pass to the ktime_set() function as long. The result
in is a negative value.
The problem on my system occurs in the tick-sched.c,
tick_nohz_stop_sched_tick() when time_delta is set to
timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is
valid, thus ktime_add_ns() is called with a too large value resulting in
a negative expire value. This leads to an endless loop in the ticker code:
time_delta: 7881299347898368000
expires = ktime_add_ns(last_update, time_delta)
expires: negative value
This fix caps the value to KTIME_MAX.
This error doesn't occurs on 64bit or architectures supporting
CONFIG_KTIME_SCALAR (e.g. ARM, x86-32).
Signed-off-by: David Engraf <david.engraf@sysgo.com>
[jstultz: Minor tweaks to commit message & header] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When we change the system time to a low value like this, the value of
timekeeper->offs_real will be a negative value.
It seems that the WARN occurs because an hrtimer has been started in the time
between the releasing of the timekeeper lock and the IPI call (via a call to
on_each_cpu) in clock_was_set() in the do_settimeofday() code. The end result
is that a REALTIME_CLOCK timer has been added with softexpires = expires =
KTIME_MAX. The hrtimer_interrupt() fires/is called and the loop at
kernel/hrtimer.c:1289 is executed. In this loop the code subtracts the
clock base's offset (which was set to timekeeper->offs_real in
do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which
was KTIME_MAX):
KTIME_MAX - (a negative value) = overflow
A simple check for an overflow can resolve this problem. Using KTIME_MAX
instead of the overflow value will result in the hrtimer function being run,
and the reprogramming of the timer after that.
Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
[jstultz: Tweaked commit subject] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The "Mobile Sandy Bridge CPUs" in the Fujitsu Esprimo Q900
mini desktop PCs are probably misleading the LVDS detection
code in intel_lvds_supported. Nothing is connected to the
LVDS ports in these systems.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In order to demonstrace this issue one should mount ext4 with mount -o
discard option on SSD disk. This makes callback longer and race
window becomes wider.
In order to fix this we should mark transaction as finished only after
callbacks have completed
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
[bwh: Backported to 3.2: s/jbd2_journal_free_transaction/kfree/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.
Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().
Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time. To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.
As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started. This
should have a small (but probably not measurable) improvement for
ext4's scalability.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
we should return error status directly when nfs4_preprocess_stateid_op
return error.
Signed-off-by: fanchaoting <fanchaoting@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The inode->i_mutex isn't hold when updating filp->f_pos
in read()/write(), so the filp->f_pos might be read as
0 or 1 in readdir() when there is concurrent read()/write()
on this same file, then may cause use after free in readdir().
The bug can be reproduced with Li Zefan's test code on the
link:
This patch fixes the use after free under this situation.
Reported-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: file position is child inode number, not hash] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
But this was hideously wrong. First of all these initializations
are now down far too late. Specifically after all the other cpus
have been brought up and initialized their own CMC vectors from
smp_callin(). Also ia64_mca_late_init() may be called from any cpu
so the line:
ia64_mca_cmc_vector_setup(); /* Setup vector on BSP */
is generally not executed on the BSP, and so the CMC vector isn't
setup at all on that processor.
Make use of the arch_early_irq_init() hook to get this code executed
at just the right moment: not too early, not too late.
Reported-by: Fred Hartnett <fred.hartnett@hp.com> Tested-by: Fred Hartnett <fred.hartnett@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Monitors seem to prefer it. Fixes:
https://bugs.freedesktop.org/show_bug.cgi?id=37696
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[bwh: Backported to 3.2:
- Adjust context
- Add to pll->flags, not radeon_crtc->pll_flags] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Found problem on system that firmware that could handle pci aer.
Firmware get error reporting after pci injecting error, before os boots.
But after os boots, firmware can not get report anymore, even pci=noaer
is passed.
Root cause: BIOS _OSC has problem with query bit checking.
It turns out that BIOS vendor is copying example code from ACPI Spec.
In ACPI Spec 5.0, page 290:
If (Not(And(CDW1,1))) // Query flag clear?
{ // Disable GPEs for features granted native control.
If (And(CTRL,0x01)) // Hot plug control granted?
{
Store(0,HPCE) // clear the hot plug SCI enable bit
Store(1,HPCS) // clear the hot plug SCI status bit
}
...
}
When Query flag is set, And(CDW1,1) will be 1, Not(1) will return 0xfffffffe.
So it will get into code path that should be for control set only.
BIOS acpi code should be changed to "If (LEqual(And(CDW1,1), 0)))"
Current kernel code is using _OSC query to notify firmware about support
from OS and then use _OSC to set control bits.
During query support, current code is using all possible controls.
So will execute code that should be only for control set stage.
That will have problem when pci=noaer or aer firmware_first is used.
As firmware have that control set for os aer already in query support stage,
but later will not os aer handling.
We should avoid passing all possible controls, just use osc_control_set
instead.
That should workaround BIOS bugs with affected systems on the field
as more bios vendors are copying sample code from ACPI spec.
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.
The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().
Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The list of output registers is
: "=r"(ret) : "r"(iha), "r"(pte):"memory");
The constraint "=r" means that the GCC has to maintain that these vars
are in registers and contain valid info when the program flow leaves
the assembly block (output registers).
But "=r" also means that GCC can put them in registers that are used
as input registers. Input registers are iha, pte on the example.
If the predicate p7 is true, the 8th assembly instruction
"(p7) mov %0=r0;"
is the first one which writes to a register which is maintained by the
register constraints; it sets %0. %0 means the first register operand;
it is ret here.
This instruction might overwrite the %2 register (pte) which is needed
by the next instruction:
"(p7) st8 [%2]=r9;;"
Whether it really happens depends on how GCC decides what registers it
uses and how it optimizes the code.
The attached patch fixes the register operand constraints in
arch/ia64/kvm/vtlb.c.
The register constraints should be
: "=&r"(ret) : "r"(iha), "r"(pte):"memory");
The & means that GCC must not use any of the input registers to place
this output register in.
This is Debian bug#702639
(http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=702639).
The patch is applicable on Kernel 3.9-rc1, 3.2.35 and many other versions.
Signed-off-by: Stephan Schreiber <info@fs-driver.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The list of output registers is
: "=r" (r8), "=r" (prev)
The constraint "=r" means that the GCC has to maintain that these vars
are in registers and contain valid info when the program flow leaves
the assembly block (output registers).
But "=r" also means that GCC can put them in registers that are used
as input registers. Input registers are uaddr, newval, oldval on the
example.
The second assembly instruction
" mov %0=r0 \n"
is the first one which writes to a register; it sets %0 to 0. %0 means
the first register operand; it is r8 here. (The r0 is read-only and
always 0 on the Itanium; it can be used if an immediate zero value is
needed.)
This instruction might overwrite one of the other registers which are
still needed.
Whether it really happens depends on how GCC decides what registers it
uses and how it optimizes the code.
The objdump utility can give us disassembly.
The futex_atomic_cmpxchg_inatomic() function is inline, so we have to
look for a module that uses the funtion. This is the
cmpxchg_futex_value_locked() function in
kernel/futex.c:
static int cmpxchg_futex_value_locked(u32 *curval, u32 __user *uaddr,
u32 uval, u32 newval)
{
int ret;
pagefault_disable();
ret = futex_atomic_cmpxchg_inatomic(curval, uaddr, uval, newval);
pagefault_enable();
return ret;
}
Now the disassembly. At first from the Kernel package 3.2.23 which has
been compiled with GCC 4.4, remeber this Kernel seemed to work:
objdump -d linux-3.2.23/debian/build/build_ia64_none_mckinley/kernel/futex.o
The lines
2b0: 0a 00 00 00 22 00 [MMI] mf;;
2b6: 80 00 00 00 42 00 mov r8=r0
2bc: 00 00 04 00 nop.i 0x0
2c0: 0b 00 20 40 2a 04 [MMI] mov.m ar.ccv=r8;;
2c6: 10 1a 85 22 20 00 cmpxchg4.acq r33=[r33],r35,ar.ccv
2cc: 00 00 04 00 nop.i 0x0;;
are the instructions of the assembly block.
The line
2b6: 80 00 00 00 42 00 mov r8=r0
sets the r8 register to 0 and after that
2c0: 0b 00 20 40 2a 04 [MMI] mov.m ar.ccv=r8;;
prepares the 'oldvalue' for the cmpxchg but it takes it from r8. This
is wrong.
What happened here is what I explained above: An input register is
overwritten which is still needed.
The register operand constraints in futex.h are wrong.
(The problem doesn't occur when the Kernel is compiled with GCC 4.6.)
The attached patch fixes the register operand constraints in futex.h.
The code after patching of it:
static inline int
futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
u32 oldval, u32 newval)
{
if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
return -EFAULT;
I also initialized the 'r8' var with the C programming language.
The _asm qualifier on the definition of the 'r8' var forces GCC to use
the r8 processor register for it.
I don't believe that we should use inline assembly for zeroing out a
local variable.
The constraint is
"+r" (r8)
what means that it is both an input register and an output register.
Note that the page fault handler will modify the r8 register which
will be the return value of the function.
The real fix is
"=&r" (prev)
The & means that GCC must not use any of the input registers to place
this output register in.
Patched the Kernel 3.2.23 and compiled it with GCC4.4:
Much better.
There is a
270: 05 40 00 00 00 e1 [MLX] mov r8=r0
which was generated by C code r8 = 0. Below
2b6: 00 10 81 54 08 00 mov.m ar.ccv=r34
what means that oldval is no longer overwritten.
This is Debian bug#702641
(http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=702641).
The patch is applicable on Kernel 3.9-rc1, 3.2.23 and many other versions.
Signed-off-by: Stephan Schreiber <info@fs-driver.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
will dump out the ftrace buffers to the console when either a oops,
panic, or a sysrq-z occurs.
This was written a long time ago when ftrace was fragile to recursion.
But it wasn't written well even for that.
There's a possible deadlock that can occur if a ftrace_dump() is happening
and an NMI triggers another dump. This is because it grabs a lock
before checking if the dump ran.
It also totally disables ftrace, and tracing for no good reasons.
As the ring_buffer now checks if it is read via a oops or NMI, where
there's a chance that the buffer gets corrupted, it will disable
itself. No need to have ftrace_dump() do the same.
ftrace_dump() is now cleaned up where it uses an atomic counter to
make sure only one dump happens at a time. A simple atomic_inc_return()
is enough that is needed for both other CPUs and NMIs. No need for
a spinlock, as if one CPU is running the dump, no other CPU needs
to do it too.
The tracing_on variable is turned off and not turned on. The original
code did this, but it wasn't pretty. By just disabling this variable
we get the result of not seeing traces that happen between crashes.
For sysrq-z, it doesn't get turned on, but the user can always write
a '1' to the tracing_on file. If they are using sysrq-z, then they should
know about tracing_on.
The new code is much easier to read and less error prone. No more
deadlock possibility when an NMI triggers here.
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the depth reported in the stack tracer stack_trace file
does not match the stack_max_size file. This is because the stack_max_size
includes the overhead of stack tracer itself while the depth does not.
The first time a max is triggered, a calculation is not performed that
figures out the overhead of the stack tracer and subtracts it from
the stack_max_size variable. The overhead is stored and is subtracted
from the reported stack size for comparing for a new max.
Now the stack_max_size corresponds to the reported depth:
While testing against and older gcc on x86 that uses mcount instead
of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the
stack trace show one more function deep which was missing before.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When gcc 4.6 on x86 is used, the function tracer will use the new
option -mfentry which does a call to "fentry" at every function
instead of "mcount". The significance of this is that fentry is
called as the first operation of the function instead of the mcount
usage of being called after the stack.
This causes the stack tracer to show some bogus results for the size
of the last function traced, as well as showing "ftrace_call" instead
of the function. This is due to the stack frame not being set up
by the function that is about to be traced.
The 216 size for ftrace_call includes both the ftrace_call stack
(which includes the saving of registers it does), as well as the
stack size of the parent.
To fix this, if CC_USING_FENTRY is defined, then the stack_tracer
will reserve the first item in stack_dump_trace[] array when
calling save_stack_trace(), and it will fill it in with the parent ip.
Then the code will look for the parent pointer on the stack and
give the real size of the parent's stack pointer:
I'm Cc'ing stable, although it's not urgent, as it only shows bogus
size for item #0, the rest of the trace is legit. It should still be
corrected in previous stable releases.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Use the stack of stack_trace_call() instead of check_stack() as
the test pointer for max stack size. It makes it a bit cleaner
and a little more accurate.
Adding stable, as a later fix depends on this patch.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
dprintk() shouldn't access @ring after it's unmapped.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: keep the second argument to kunmap_atomic()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The current code does not set the msg_namelen member to 0 and therefore
makes net/socket.c leak the local sockaddr_storage variable to userland
-- 128 bytes of kernel stack memory. Fix that.
Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked. There's no reason to trigger BUG while holding rq
lock crashing the whole system.
Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This patch adds support for kvm_gfn_to_hva_cache_init functions for
reads and writes that will cross a page. If the range falls within
the same memslot, then this will be a fast operation. If the range
is split between two memslots, then the slower kvm_read_guest and
kvm_write_guest are used.
Tested: Test against kvm_clock unit tests.
Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
[bwh: Backported to 3.2:
- Drop change in lapic.c
- Keep using __gfn_to_memslot() in kvm_gfn_to_hva_cache_init()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If the guest specifies a IOAPIC_REG_SELECT with an invalid value and follows
that with a read of the IOAPIC_REG_WINDOW KVM does not properly validate
that request. ioapic_read_indirect contains an
ASSERT(redir_index < IOAPIC_NUM_PINS), but the ASSERT has no effect in
non-debug builds. In recent kernels this allows a guest to cause a kernel
oops by reading invalid memory. In older kernels (pre-3.3) this allows a
guest to read from large ranges of host memory.
Tested: tested against apic unit tests.
Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
There is a potential use after free issue with the handling of
MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable
memory such as frame buffers then KVM might continue to write to that
address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins
the page in memory so it's unlikely to cause an issue, but if the user
space component re-purposes the memory previously used for the guest, then
the guest will be able to corrupt that memory.
Tested: Tested against kvmclock unit test
Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[bwh: Backported to 3.2:
- Adjust context
- We do not implement the PVCLOCK_GUEST_STOPPED flag] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If the guest sets the GPA of the time_page so that the request to update the
time straddles a page then KVM will write onto an incorrect page. The
write is done byusing kmap atomic to get a pointer to the page for the time
structure and then performing a memcpy to that page starting at an offset
that the guest controls. Well behaved guests always provide a 32-byte aligned
address, however a malicious guest could use this to corrupt host kernel
memory.
Tested: Tested against kvmclock unit test.
Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Okay so Alan's patch handled the case where there was no registered fbcon,
however the other path entered in set_con2fb_map pit.
In there we called fbcon_takeover, but we also took the console lock in a couple
of places. So push the console lock out to the callers of set_con2fb_map,
this means fbmem and switcheroo needed to take the lock around the fb notifier
entry points that lead to this.
This should fix the efifb regression seen by Maarten.
Tested-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Tested-by: Lu Hua <huax.lu@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This code was broken because it assumed that all MTD devices were map-based.
Disable it for now, until it can be fixed properly for the next merge window.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
It would cause no link after suspending or shutdowning when the
nic changes the speed to 10M and connects to a link partner which
forces the speed to 100M.
Check the link partner ability to determine which speed to set.
Signed-off-by: Hayes Wang <hayeswang@realtek.com> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
that use a value generated by queue_var_store independent of whether
that value was set or not.
block/blk-sysfs.c: In function 'queue_store_nonrot':
block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
Unlike most other such warnings, this one is not a false positive,
writing any non-number string into the sysfs files indeed has
an undefined result, rather than returning an error.
* Added the device ID to the modalias list and assinged ALC662 patches
for it
* Added 4 port support for the device ID 0671 in alc662_parse_auto_config