Commit 46034dca515bc4ddca0399ae58106d1f5f0d809f (USB: musb_gadget_ep0: stop
abusing musb_gadget_set_halt()) forgot to restart a queued request after
clearing the endpoint halt feature. This results in a couple of USB resets
while enumerating the file-backed storage gadget due to CSW packet not being
sent for the MODE SENSE(10) command.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
For shared fifo hw endpoint(with FIFO_TXRX style), only ep_in
field of musb_hw_ep is intialized in musb_g_init_endpoints, and
ep_out is not initialized, but musb_g_rx and rxstate may access
ep_out field of musb_hw_ep by the method below:
musb_ep = &musb->endpoints[epnum].ep_out
which can cause the kernel panic[1] below, this patch fixes the issue
by getting 'musb_ep' from '&musb->endpoints[epnum].ep_in' for shared fifo
endpoint.
Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Sergei Shtylyov <sshtylyov@mvista.com> Cc: David Brownell <dbrownell@users.sourceforge.net> Cc: Anand Gadiyar <gadiyar@ti.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Recent changes in the usbhid layer exposed a bug in usbcore. If
CONFIG_USB_DYNAMIC_MINORS is enabled then an interface may be assigned
a minor number of 0. However interfaces that aren't registered as USB
class devices also have their minor number set to 0, during
initialization. As a result usb_find_interface() may return the
wrong interface, leading to a crash.
This patch (as1418) fixes the problem by initializing every
interface's minor number to -1. It also cleans up the
usb_register_dev() function, which besides being somewhat awkwardly
written, does not unwind completely on all its error paths.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Philip J. Turmel <philip@turmel.org> Tested-by: Gabriel Craciunescu <nix.or.die@googlemail.com> Tested-by: Alex Riesen <raa.lkml@gmail.com> Tested-by: Matthias Bayer <jackdachef@gmail.com> CC: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When a driver module is unloaded and the last still open file is a raw
MIDI device, the card and its devices will be actually freed in the
snd_card_file_remove() call when that file is closed. Afterwards, rmidi
and rmidi->card point into freed memory, so the module pointer is likely
to be garbage.
(This was introduced by commit 9a1b64caac82aa02cb74587ffc798e6f42c6170a.)
Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Reported-by: Krzysztof Foltman <wdev@foltman.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The snd_ctl_new() function in sound/core/control.c allocates space for a
snd_kcontrol struct by performing arithmetic operations on a
user-provided size without checking for integer overflow. If a user
provides a large enough size, an overflow will occur, the allocated
chunk will be too small, and a second user-influenced value will be
written repeatedly past the bounds of this chunk. This code is
reachable by unprivileged users who have permission to open
a /dev/snd/controlC* device (on many distros, this is group "audio") via
the SNDRV_CTL_IOCTL_ELEM_ADD and SNDRV_CTL_IOCTL_ELEM_REPLACE ioctls.
On the HT-Omega Claro halo card, the ADC data must be captured from the
second I2S input. Using the default first input, which isn't connected
to anything, would result in silence.
Signed-off-by: Erik J. Staab <ejs@insightbb.com> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The SNDRV_HDSP_IOCTL_GET_CONFIG_INFO and
SNDRV_HDSP_IOCTL_GET_CONFIG_INFO ioctls in hdspm.c and hdsp.c allow
unprivileged users to read uninitialized kernel stack memory, because
several fields of the hdsp{m}_config_info structs declared on the stack
are not altered or zeroed before being copied back to the user. This
patch takes care of it.
James Dingwall [Mon, 27 Sep 2010 08:37:17 +0000 (09:37 +0100)]
Xen: fix typo in previous patch
Correctly name the irq_chip structure to fix an immediate failure when booting
as a xen pv_ops guest with a NULL pointer exception. The missing 'x' was
introduced in commit [fb412a178502dc498430723b082a932f797e4763] applied to
2.6.3[25]-stable trees. The commit to mainline was
[aaca49642b92c8a57d3ca5029a5a94019c7af69f] which did not have the problem.
[ Backport to .32 by Tomáš Janoušek <tomi@nomi.cz> ]
xchg() and cmpxchg() modify their memory operands, not merely read
them. For some versions of gcc the "memory" clobber has apparently
dealt with the situation, but not for all.
Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Glauber Costa <glommer@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Peter Palfrader <peter@palfrader.org> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Zachary Amsden <zamsden@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com>
LKML-Reference: <4C4F7277.8050306@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When compiling alpha generic build get errors such as:
arch/alpha/kernel/err_marvel.c: In function ‘marvel_print_err_cyc’:
arch/alpha/kernel/err_marvel.c:119: error: format ‘%ld’ expects type ‘long int’, but argument 6 has type ‘u64’
Replaced a number of %ld format specifiers with %lld since u64
is unsigned long long.
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
SIS 760 is listed in the device tables for both amd64-agp and sis-agp.
amd64-agp is apparently preferable since it has workarounds for some
BIOS misconfigurations that sis-agp doesn't handle.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
On Monday 04 January 2010 02:30:24 pm Russell King wrote:
> Found the problem - getting rid of the read of the alt status register
> after the command has been written fixes the UDMA CRC errors on write:
>
> @@ -676,7 +676,8 @@ void ata_sff_exec_command(struct ata_port *ap, const struct
> ata_taskfile *tf)
> DPRINTK("ata%u: cmd 0x%X\n", ap->print_id, tf->command);
>
> iowrite8(tf->command, ap->ioaddr.command_addr);
> - ata_sff_pause(ap);
> + ndelay(400);
> +// ata_sff_pause(ap);
> }
> EXPORT_SYMBOL_GPL(ata_sff_exec_command);
>
>
> This rather makes sense. The PDC20247 handles the UDMA part of the
> protocol. It has no way to tell the PDC20246 to wait while it suspends
> UDMA, so that a normal register access can take place - the 246 ploughs
> on with the register access without any regard to the state of the 247.
>
> If the drive immediately starts the UDMA protocol after a write to the
> command register (as it probably will for the DMA WRITE command), then
> we'll be accessing the taskfile in the middle of the UDMA setup, which
> can't be good. It's certainly a violation of the ATA specs.
Fix it by adding custom ->sff_exec_command method for UDMA33 chipsets.
Debugged-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Several MIPS platforms don't set pci_controller::io_map_base for their
PCI bridges. This results in a panic in pci_iomap(). (The panic is
conditional on CONFIG_PCI_DOMAINS, but that is now enabled for all PCI
MIPS systems.)
Input core displays capabilities bitmasks in form of one or more longs printed
in hex form and separated by spaces. Unfortunately it does not work well
for 32-bit applications running on 64-bit kernels since applications expect
that number is "worth" only 32 bits when kernel advances by 64 bits.
Fix that by ensuring that output produced for compat tasks uses 32-bit units.
"hostap: Protect against initialization interrupt" (which reinstated
"wireless: hostap, fix oops due to early probing interrupt")
reintroduced Bug 16111. This is because hostap_pci wasn't setting
dev->base_addr, which is now checked in prism2_interrupt. As a result,
initialization was failing for PCI-based hostap devices. This corrects
that oversight.
Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When GRO produces fraglist entries, and the resulting skb hits
an interface that is incapable of TSO but capable of FRAGLIST,
we end up producing a bogus packet with gso_size non-zero.
This was reported in the field with older versions of KVM that
did not set the TSO bits on tuntap.
This patch fixes that.
Reported-by: Igor Zhang <yugzhang@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since commit 98962465ed9e6ea99c38e0af63fe1dcb5a79dc25 ("nohz: Prevent
clocksource wrapping during idle"), the CPU of an R2D board never goes
to idle. This commit assumes that mult and shift are assigned before
the clocksource is registered. As a consequence the safe maximum sleep
time is negative and the CPU never goes into idle.
This patch fixes the problem by moving mult and shift initialization
from sh_tmu_clocksource_enable() to sh_tmu_register_clocksource().
Partition boundary calculation fails for DASD FBA disks under the
following conditions:
- disk is formatted with CMS FORMAT with a blocksize of more than
512 bytes
- all of the disk is reserved to a single CMS file using CMS RESERVE
- the disk is accessed using the DIAG mode of the DASD driver
Under these circumstances, the partition detection code tries to
read the CMS label block containing partition-relevant information
from logical block offset 1, while it is in fact located at physical
block offset 1.
Fix this problem by using the correct CMS label block location
depending on the device type as determined by the DASD SENSE ID
information.
Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[bwh: Adjust for 2.6.32] Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Setting new MAC address only worked when device was set to promiscuous mode.
Fix MAC address by writing new address to device using undocumented command
AX_CMD_READ_NODE_ID+1. Patch is tested with AX88772 device.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Acked-by: David Hollis <dhollis@davehollis.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The driver attempts to select an IRQ for the NIC automatically by
testing which of the supported IRQs are available and then probing
each available IRQ with probe_irq_{on,off}(). There are obvious race
conditions here, besides which:
1. The test for availability is done by passing a NULL handler, which
now always returns -EINVAL, thus the device cannot be opened:
<http://bugs.debian.org/566522>
2. probe_irq_off() will report only the first ISA IRQ handled,
potentially leading to a false negative.
There was another bug that meant it ignored all error codes from
request_irq() except -EBUSY, so it would 'succeed' despite this
(possibly causing conflicts with other ISA devices). This was fixed
by ab08999d6029bb2c79c16be5405d63d2bedbdfea 'WARNING: some
request_irq() failures ignored in el2_open()', which exposed bug 1.
This patch:
1. Replaces the use of probe_irq_{on,off}() with a real interrupt handler
2. Adds a delay before checking the interrupt-seen flag
3. Disables interrupts on all failure paths
4. Distinguishes error codes from the second request_irq() call,
consistently with the first
Compile-tested only.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sctp_packet_config() is called when getting the packet ready
for appending of chunks. The function should not touch the
current state, since it's possible to ping-pong between two
transports when sending, and that can result packet corruption
followed by skb overlfow crash.
Reported-by: Thomas Dreibholz <dreibh@iem.uni-due.de> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
id = fork();
if (id == 0) { sleep(1); exit(0); }
kill(id, SIGSTOP);
alarm(1);
waitid(P_PID, id, &infop, WCONTINUED);
return 0;
}
to call waitid() on a stopped process results in access to the child task's
credentials without the RCU read lock being held - which may be replaced in the
meantime - eliciting the following warning:
This is fixed by holding the RCU read lock in wait_task_continued() to ensure
that the task's current credentials aren't destroyed between us reading the
cred pointer and us reading the UID from those credentials.
Furthermore, protect wait_task_stopped() in the same way.
We don't need to keep holding the RCU read lock once we've read the UID from
the credentials as holding the RCU read lock doesn't stop the target task from
changing its creds under us - so the credentials may be outdated immediately
after we've read the pointer, lock or no lock.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: dann frazier <dannf@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature. Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet. When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled. This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.
This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.
[akpm@linux-foundation.org: avoid modifying incoming args] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.
This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.
This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure. In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
queue restart tasklets need to be stopped after napi handlers are stopped
since the latter can restart them. So stop them after stopping napi.
Signed-off-by: Divy Le Ray <divy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brandon Philips <bphilips@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If ->open() and ->close() are called multiple times, the same napi structs
will be added to dev->napi_list multiple times, corrupting the dev->napi_list.
This causes free_netdev() to hang during rmmod.
We fix this by calling netif_napi_del() during ->close().
Also, bnx2_init_napi() must not be in the __devinit section since it is
called by ->open().
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The bnx2 driver calls netif_napi_add() for all the NAPI structs during
->probe() time but not all of them will be used if we're not in MSI-X
mode. This creates a problem for netpoll since it will poll all the
NAPI structs in the dev_list whether or not they are scheduled, resulting
in a crash when we access structure fields not initialized for that vector.
We fix it by moving the netif_napi_add() call to ->open() after the number
of IRQ vectors has been determined.
Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Here is the _BCM method of this laptop:
Method (_BCM, 1, NotSerialized)
{
If (LGreaterEqual (OSFG, OSVT))
{
If (LNotEqual (OSFG, OSW7))
{
Store (One, BCMD)
Store (GCBL (Arg0), Local0)
Subtract (0x0F, Local0, LBTN)
^^^SBRG.EC0.STBR ()
...
}
Else
{
DBGR (0x0B, Zero, Zero, Arg0)
Store (Arg0, LBTN)
^^^SBRG.EC0.STBR ()
...
}
}
}
LBTN is used to store the index of the brightness level in the _BCL.
GCBL is a method that convert the percentage value to the index value.
If _OSI(Windows 2009) is not disabled, LBTN is stored a percentage
value which is surely beyond the end of _BCL package.
http://bugzilla.kernel.org/show_bug.cgi?id=14753
Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Cc: maximilian attems <max@stro.at> Cc: Paolo Ornati <ornati@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The VIAFB_GET_INFO device ioctl allows unprivileged users to read 246
bytes of uninitialized stack memory, because the "reserved" member of
the viafb_ioctl_info struct declared on the stack is not altered or
zeroed before being copied back to the user. This patch takes care of
it.
The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
bytes of uninitialized stack memory, because the fsxattr struct
declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
the 12-byte fsx_pad member before copying it back to the user. This
patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> Cc: dann frazier <dannf@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix a bug in keyctl_session_to_parent() whereby it tries to check the ownership
of the parent process's session keyring whether or not the parent has a session
keyring [CVE-2010-2960].
If the system is using pam_keyinit then it mostly protected against this as all
processes derived from a login will have inherited the session keyring created
by pam_keyinit during the log in procedure.
To test this, pam_keyinit calls need to be commented out in /etc/pam.d/.
There's an protected access to the parent process's credentials in the middle
of keyctl_session_to_parent(). This results in the following RCU warning:
Tony's fix (f574c843191728d9407b766a027f779dcd27b272) has a small bug,
it incorrectly uses "r3" as a scratch register in the first of the two
unlock paths ... it is also inefficient. Optimize the fast path again.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When ia64 converted to using ticket locks, an inline implementation
of trylock/unlock in fsys.S was missed. This was not noticed because
in most circumstances it simply resulted in using the slow path because
the siglock was apparently not available (under old spinlock rules).
Problems occur when the ticket spinlock has value 0x0 (when first
initialised, or when it wraps around). At this point the fsys.S
code acquires the lock (changing the 0x0 to 0x1. If another process
attempts to get the lock at this point, it will change the value from
0x1 to 0x2 (using new ticket lock rules). Then the fsys.S code will
free the lock using old spinlock rules by writing 0x0 to it. From
here a variety of bad things can happen.
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I may have an explanation for the LSI 1068 HBA hangs provoked by ATA
pass-through commands, in particular by smartctl.
First, my version of the symptoms. On an LSI SAS1068E B3 HBA running
01.29.00.00 firmware, with SATA disks, and with smartd running, I'm seeing
occasional task, bus, and host resets, some of which lead to hard faults of
the HBA requiring a reboot. Abusively looping the smartctl command,
# while true; do smartctl -a /dev/sdb > /dev/null; done
dramatically increases the frequency of these failures to nearly one per
minute. A high IO load through the HBA while looping smartctl seems to
improve the chance of a full scsi host reset or a non-recoverable hang.
I reduced what smartctl was doing down to a simple test case which
causes the hang with a single IO when pointed at the sd interface. See
the code at the bottom of this e-mail. It uses an SG_IO ioctl to issue
a single pass-through ATA identify device command. If the buffer
userspace gives for the read data has certain alignments, the task is
issued to the HBA but the HBA fails to respond. If run against the sg
interface, neither the test code nor smartctl causes a hang.
sd and sg handle the SG_IO ioctl slightly differently. Unless you
specifically set a flag to do direct IO, sg passes a buffer of its own,
which is page-aligned, to the block layer and later copies the result
into the userspace buffer regardless of its alignment. sd, on the other
hand, always does direct IO unless the userspace buffer fails an
alignment test at block/blk-map.c line 57, in which case a page-aligned
buffer is created and used for the transfer.
The alignment test currently checks for word-alignment, the default
setup by scsi_lib.c; therefore, userspace buffers of almost any
alignment are given directly to the HBA as DMA targets. The LSI 1068
hardware doesn't seem to like at least a couple of the alignments which
cross a page boundary (see the test code below). Curiously, many
page-boundary-crossing alignments do work just fine.
So, either the hardware has an bug handling certain alignments or the
hardware has a stricter alignment requirement than the driver is
advertising. If stricter alignment is required, then in no case should
misaligned buffers from userspace be allowed through without being
bounced or at least causing an error to be returned.
It seems the mptsas driver could use blk_queue_dma_alignment() to advertise
a stricter alignment requirement. If it does, sd does the right thing and
bounces misaligned buffers (see block/blk-map.c line 57). The following
patch to 2.6.34-rc5 makes my symptoms go away. I'm sure this is the wrong
place for this code, but it gets my idea across.
Tavis Ormandy pointed out that do_io_submit does not do proper bounds
checking on the passed-in iocb array:
if (unlikely(nr < 0))
return -EINVAL;
if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
return -EFAULT; ^^^^^^^^^^^^^^^^^^
The attached patch checks for overflow, and if it is detected, the
number of iocbs submitted is scaled down to a number that will fit in
the long. This is an ok thing to do, as sys_io_submit is documented as
returning the number of iocbs submitted, so callers should handle a
return value of less than the 'nr' argument passed in.
pcpu_first/last_unit_cpu are used to track which cpu has the first and
last units assigned. This in turn is used to determine the span of a
chunk for man/unmap cache flushes and whether an address belongs to
the first chunk or not in per_cpu_ptr_to_phys().
When the number of possible CPUs isn't power of two, a chunk may
contain unassigned units towards the end of a chunk. The logic to
determine pcpu_last_unit_cpu was incorrect when there was an unused
unit at the end of a chunk. It failed to ignore the unused unit and
assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.
This was discovered through kdump failure which was caused by
malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
CPUs by CAI Qian.
The FBIOGET_VBLANK device ioctl allows unprivileged users to read 16 bytes
of uninitialized stack memory, because the "reserved" member of the
fb_vblank struct declared on the stack is not altered or zeroed before
being copied back to the user. This patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Cc: Thomas Winischhofer <thomas@winischhofer.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
drivers/pci/intel-iommu.c: In function `__iommu_calculate_agaw':
drivers/pci/intel-iommu.c:437: sorry, unimplemented: inlining failed in call to 'width_to_agaw': function body not available
drivers/pci/intel-iommu.c:445: sorry, unimplemented: called from here
Move the offending function (and its siblings) to top-of-file, remove the
forward declaration.
These devices don't do any writeback but their device inodes still can get
dirty so mark bdi appropriately so that bdi code does the right thing and files
inodes to lists of bdi carrying the device inodes.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds CPU type detection for the Intel Celeron 540, which is
part of the Core 2 family according to Wikipedia; the family and ID pair
is absent from the Volume 3B table referenced in the source code
comments. I have tested this patch on an Intel Celeron 540 machine
reporting itself as Family 6 Model 22, and OProfile runs on the machine
without issue.
We have 32-bit variable overflow possibility when multiply in
task_times() and thread_group_times() functions. When the
overflow happens then the scaled utime value becomes erroneously
small and the scaled stime becomes i erroneously big.
It turns out that the setpgid() system call fails to enter an RCU
read-side critical section before doing a PID-to-task_struct translation.
This commit therefore does rcu_read_lock() before the translation, and
also does rcu_read_unlock() after the last use of the returned pointer.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The members of struct llc_sock are unsigned so if we pass a negative
value for "opt" it can cause a sign bug. Also it can cause an integer
overflow when we multiply "opt * HZ".
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
"param->u.wpa_associate.wpa_ie_len" comes from the user. We should
check it so that the copy_from_user() doesn't overflow the buffer.
Also further down in the function, we assume that if
"param->u.wpa_associate.wpa_ie_len" is set then "abyWPAIE[0]" is
initialized. To make that work, I changed the test here to say that if
"wpa_ie_len" is set then "wpa_ie" has to be a valid pointer or we return
-EINVAL.
Oddly, we only use the first element of the abyWPAIE[] array. So I
suspect there may be some other issues in this function.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It was recently brought to my attention that 802.3ad mode bonds would no
longer form when using some network hardware after a driver update.
After snooping around I realized that the particular hardware was using
page-based skbs and found that skb->data did not contain a valid LACPDU
as it was not stored there. That explained the inability to form an
802.3ad-based bond. For balance-alb mode bonds this was also an issue
as ARPs would not be properly processed.
This patch fixes the issue in my tests and should be applied to 2.6.36
and as far back as anyone cares to add it to stable.
Thanks to Alexander Duyck <alexander.h.duyck@intel.com> and Jesse
Brandeburg <jesse.brandeburg@intel.com> for the suggestions on this one.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The EQL_GETMASTRCFG device ioctl allows unprivileged users to read 16
bytes of uninitialized stack memory, because the "master_name" member of
the master_config_t struct declared on the stack in eql_g_master_cfg()
is not altered or zeroed before being copied back to the user. This
patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The CHELSIO_GET_QSET_NUM device ioctl allows unprivileged users to read
4 bytes of uninitialized stack memory, because the "addr" member of the
ch_reg struct declared on the stack in cxgb_extension_ioctl() is not
altered or zeroed before being copied back to the user. This patch
takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The TIOCGICOUNT device ioctl allows unprivileged users to read
uninitialized stack memory, because the "reserved" member of the
serial_icounter_struct struct declared on the stack in hso_get_count()
is not altered or zeroed before being copied back to the user. This
patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is based upon a report by Meelis Roos showing that it's possible
that we'll try to fetch a property that is 32K in size with some
devices. With the current fixed 3K buffer we use for moving data in
and out of the firmware during PROM calls, that simply won't work.
In fact, it will scramble random kernel data during bootup.
The reasoning behind the temporary buffer is entirely historical. It
used to be the case that we had problems referencing dynamic kernel
memory (including the stack) early in the boot process before we
explicitly told the firwmare to switch us over to the kernel trap
table.
So what we did was always give the firmware buffers that were locked
into the main kernel image.
But we no longer have problems like that, so get rid of all of this
indirect bounce buffering.
Besides fixing Meelis's bug, this also makes the kernel data about 3K
smaller.
It was also discovered during these conversions that the
implementation of prom_retain() was completely wrong, so that was
fixed here as well. Currently that interface is not in use.
Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Realtek confirmed that a 20us delay is needed after mdio_read and
mdio_write operations. Reduce the delay in mdio_write, and add it
to mdio_read too. Also add a comment that the 20us is from hw specs.
Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some configurations need delay between the "write completed" indication
and new write to work reliably.
Realtek driver seems to use longer delay when polling the "write complete"
bit, so it waits long enough between writes with high probability (but
could probably break too). This patch adds a new udelay to make sure we
wait unconditionally some time after the write complete indication.
This caused a regression with XID 18000000 boards when the board specific
phy configuration writing many mdio registers was added in commit 2e955856ff (r8169: phy init for the 8169scd). Some of the configration
mdio writes would almost always fail, and depending on failure might leave
the PHY in non-working state.
Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We assumed that unix_autobind() never fails if kzalloc() succeeded.
But unix_autobind() allows only 1048576 names. If /proc/sys/fs/file-max is
larger than 1048576 (e.g. systems with more than 10GB of RAM), a local user can
consume all names using fork()/socket()/bind().
If all names are in use, those who call bind() with addr_len == sizeof(short)
or connect()/sendmsg() with setsockopt(SO_PASSCRED) will continue
while (1)
yield();
loop at unix_autobind() till a name becomes available.
This patch adds a loop counter in order to give up after 1048576 attempts.
Calling yield() for once per 256 attempts may not be sufficient when many names
are already in use, for __unix_find_socket_byname() can take long time under
such circumstance. Therefore, this patch also adds cond_resched() call.
Note that currently a local user can consume 2GB of kernel memory if the user
is allowed to create and autobind 1048576 UNIX domain sockets. We should
consider adding some restriction for autobind operation.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If peer uses tiny MSS (say, 75 bytes) and similarly tiny advertised
window, the SWS logic will packetize to half the MSS unnecessarily.
This causes problems with some embedded devices.
However for large MSS devices we do want to half-MSS packetize
otherwise we never get enough packets into the pipe for things
like fast retransmit and recovery to work.
Be careful also to handle the case where MSS > window, otherwise
we'll never send until the probe timer.
Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
struct rds_rdma_notify contains a 32 bits hole on 64bit arches,
make sure it is zeroed before copying it to user.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
tcp_read_sock() can have a eat skbs without immediately advancing copied_seq.
This can cause a panic in tcp_collapse() if it is called as a result
of the recv_actor dropping the socket lock.
A userspace program that splices data from a socket to either another
socket or to a file can trigger this bug.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Any time we call into the IP stack we have to make sure the state
there is as expected by the ipv4 code.
With help from Eric Dumazet and Herbert Xu.
Reported-by: Brandan Das <brandan.das@stratus.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The bridge protocol lives dangerously by having incestuous relations
with the IP stack. In this instance an abomination has been created
where a bogus IPCB area from a bridged packet leads to a crash in
the IP stack because it's interpreted as IP options.
This patch papers over the problem by clearing the IPCB area in that
particular spot. To fix this properly we'd also need to parse any
IP options if present but I'm way too lazy for that.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As discovered by Anton Blanchard, current code to autotune
tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and
sysctl_max_syn_backlog makes little sense.
The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB
machine in Anton's case.
(tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket))
is much bigger if spinlock debugging is on. Its wrong to select bigger
limits in this case (where kernel structures are also bigger)
bhash_size max is 65536, and we get this value even for small machines.
A better ground is to use size of ehash table, this also makes code
shorter and more obvious.
Based on a patch from Anton, and another from David.
Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As reported by Anton Blanchard when we use
percpu_counter_read_positive() to make our orphan socket limit checks,
the check can be off by up to num_cpus_online() * batch (which is 32
by default) which on a 128 cpu machine can be as large as the default
orphan limit itself.
Fix this by doing the full expensive sum check if the optimized check
triggers.
Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This issue come from ruby language community. Below test program
hang up when only run on Linux.
% uname -mrsv
Linux 2.6.26-2-486 #1 Sat Dec 26 08:37:39 UTC 2009 i686
% ruby -rsocket -ve '
BasicSocket.do_not_reverse_lookup = true
serv = TCPServer.open("127.0.0.1", 0)
s1 = TCPSocket.open("127.0.0.1", serv.addr[1])
s2 = serv.accept
s2.close
s1.write("a") rescue p $!
s1.write("a") rescue p $!
Thread.new {
s1.write("a")
}.join'
ruby 1.9.3dev (2010-07-06 trunk 28554) [i686-linux]
#<Errno::EPIPE: Broken pipe>
[Hang Here]
FreeBSD, Solaris, Mac doesn't. because Ruby's write() method call
select() internally. and tcp_poll has a bug.
SUS defined 'ready for writing' of select() as following.
| A descriptor shall be considered ready for writing when a call to an output
| function with O_NONBLOCK clear would not block, whether or not the function
| would transfer data successfully.
That said, EPIPE situation is clearly one of 'ready for writing'.
We don't have read-side issue because tcp_poll() already has read side
shutdown care.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If irda_open_tsap() fails, the irda_bind() code tries to destroy
the ->ias_obj object by hand, but does so wrongly.
In particular, it fails to a) release the hashbin attached to the
object and b) reset the self->ias_obj pointer to NULL.
Fix both problems by using irias_delete_object() and explicitly
setting self->ias_obj to NULL, just as irda_release() does.
Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The patch: "gro: fix different skb headrooms" in its part:
"2) allocate a minimal skb for head of frag_list" is buggy. The copied
skb has p->data set at the ip header at the moment, and skb_gro_offset
is the length of ip + tcp headers. So, after the change the length of
mac header is skipped. Later skb_set_mac_header() sets it into the
NET_SKB_PAD area (if it's long enough) and ip header is misaligned at
NET_SKB_PAD + NET_IP_ALIGN offset. There is no reason to assume the
original skb was wrongly allocated, so let's copy it as it was.
Packets entering GRO might have different headrooms, even for a given
flow (because of implementation details in drivers, like copybreak).
We cant force drivers to deliver packets with a fixed headroom.
1) fix skb_segment()
skb_segment() makes the false assumption headrooms of fragments are same
than the head. When CHECKSUM_PARTIAL is used, this can give csum_start
errors, and crash later in skb_copy_and_csum_dev()
2) allocate a minimal skb for head of frag_list
skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to
allocate a fresh skb. This adds NET_SKB_PAD to a padding already
provided by netdevice, depending on various things, like copybreak.
Use alloc_skb() to allocate an exact padding, to reduce cache line
needs:
NET_SKB_PAD + NET_IP_ALIGN
The TIOCGICOUNT device ioctl in both mos7720.c and mos7840.c allows
unprivileged users to read uninitialized stack memory, because the
"reserved" member of the serial_icounter_struct struct declared on the
stack is not altered or zeroed before being copied back to the user.
This patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Otherwise when disabling the output we switch to the new fb (which is
likely NULL) and skip the call to mode_set -- leaking driver private
state on the old_fb.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857 Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Dave Airlie <airlied@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Arguably this is a bug in drm-core in that we should not be called twice
in succession with DPMS_ON, however this is still occuring and we see
FDI link training failures on the second call leading to the occassional
blank display. For the time being ignore the repeated call.
Original patch by Dave Airlie <airlied@redhat.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
copy_to_user() returns the number of bytes remaining to be copied and
I'm pretty sure we want to return a negative error code here.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
copy_to_user returns the number of bytes remaining to be copied, but we
want to return a negative error code here. These are returned to
userspace.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If rpc_queue_upcall() adds a new upcall to the rpci->pipe list just
after rpc_pipe_release calls rpc_purge_list(), but before it calls
gss_pipe_release (as rpci->ops->release_pipe(inode)), then the latter
will free a message without deleting it from the rpci->pipe list.
We will be left with a freed object on the rpc->pipe list. Most
frequent symptoms are kernel crashes in rpc.gssd system calls on the
pipe in question.
Reported-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
All bits in the values read from registers to be used for the next
write were getting overwritten, avoid doing so to not mess with the
current configuration.
The spec notes that fan0 and fan1 control mode bits are located in bits
7-6 and 5-4 respectively, but the FAN_CTRL_MODE macro was making the
bits shift by 5 instead of by 4.
If a signal hits us outside of a syscall and another gets delivered
when we are in sigreturn (e.g. because it had been in sa_mask for
the first one and got sent to us while we'd been in the first handler),
we have a chance of returning from the second handler to location one
insn prior to where we ought to return. If r0 happens to contain -513
(-ERESTARTNOINTR), sigreturn will get confused into doing restart
syscall song and dance.
Incredible joy to debug, since it manifests as random, infrequent and
very hard to reproduce double execution of instructions in userland
code...
The fix is simple - mark it "don't bother with restarts" in wrapper,
i.e. set r8 to 0 in sys_sigreturn and sys_rt_sigreturn wrappers,
suppressing the syscall restart handling on return from these guys.
They can't legitimately return a restart-worthy error anyway.
Since ALC259/269 use the same parser of ALC268, the pin 0x1b was ignored
as an invalid widget. Just add this NID to handle properly.
This will add the missing mixer controls for some devices.
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are
enabled we can call cpuacct_update_stats with values much larger
than percpu_counter_batch. This means the call to
percpu_counter_add will always add to the global count which is
protected by a spinlock and we end up with a global spinlock in
the scheduler.
Based on an idea by KOSAKI Motohiro, this patch scales the batch
value by cputime_one_jiffy such that we have the same batch
limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled.
His patch did this once at boot but that initialisation happened
too early on PowerPC (before time_init) and it was never updated
at runtime as a result of a hotplug cpu add/remove.
This patch instead scales percpu_counter_batch by
cputime_one_jiffy at runtime, which keeps the batch correct even
after cpu hotplug operations. We cap it at INT_MAX in case of
overflow.
For architectures that do not support
CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1
and gcc is smart enough to optimise min(s32
percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC. So there is no need to add an #ifdef.
On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark
is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT
disabled kernel:
Issues in the current select_idle_sibling() logic in select_task_rq_fair()
in the context of a task wake-up:
a) Once we select the idle sibling, we use that domain (spanning the cpu that
the task is currently woken-up and the idle sibling that we found) in our
wake_affine() decisions. This domain is completely different from the
domain(we are supposed to use) that spans the cpu that the task currently
woken-up and the cpu where the task previously ran.
b) We do select_idle_sibling() check only for the cpu that the task is
currently woken-up on. If select_task_rq_fair() selects the previously run
cpu for waking the task, doing a select_idle_sibling() check
for that cpu also helps and we don't do this currently.
c) In the scenarios where the cpu that the task is woken-up is busy but
with its HT siblings are idle, we are selecting the task be woken-up
on the idle HT sibling instead of a core that it previously ran
and currently completely idle. i.e., we are not taking decisions based on
wake_affine() but directly selecting an idle sibling that can cause
an imbalance at the SMT/MC level which will be later corrected by the
periodic load balancer.
Fix this by first going through the load imbalance calculations using
wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
then choose a possible idle sibling for waking up the task on.
Dave reported that his large SPARC machines spend lots of time in
hweight64(), try and optimize some of those needless cpumask_weight()
invocations (esp. with the large offstack cpumasks these are very
expensive indeed).
Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Don't bother with selection when the current cpu is idle. Recent load
balancing changes also make it no longer necessary to check wake_affine()
success before returning the selected sibling, so we now always use it.
Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301369.6785.36.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.
Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1262612696.15495.15.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Instead of only considering SD_WAKE_AFFINE | SD_PREFER_SIBLING
domains also allow all SD_PREFER_SIBLING domains below a
SD_WAKE_AFFINE domain to change the affinity target.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091112145610.909723612@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>