Randy Dunlap [Thu, 20 Mar 2008 00:00:40 +0000 (17:00 -0700)]
mm: fix various kernel-doc comments
Fix various kernel-doc notation in mm/:
filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quentin Barnes [Thu, 20 Mar 2008 00:00:39 +0000 (17:00 -0700)]
aio: bad AIO race in aio_complete() leads to process hang
My group ran into a AIO process hang on a 2.6.24 kernel with the process
sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
and it never would.
We ran the tests on x86_64 SMP. The hang only occurred on a Xeon box
("Clovertown") but not a Core2Duo ("Conroe"). On the Xeon, the L2 cache isn't
shared between all eight processors, but is L2 is shared between between all
two processors on the Core2Duo we use.
My analysis of the hang is if you go down to the second while-loop
in read_events(), what happens on processor #1:
1) add_wait_queue_exclusive() adds thread to ctx->wait
2) aio_read_evt() to check tail
3) if aio_read_evt() returned 0, call [io_]schedule() and sleep
In aio_complete() with processor #2:
A) info->tail = tail;
B) waitqueue_active(&ctx->wait)
C) if waitqueue_active() returned non-0, call wake_up()
The way the code is written, step 1 must be seen by all other processors
before processor 1 checks for pending events in step 2 (that were recorded by
step A) and step A by processor 2 must be seen by all other processors
(checked in step 2) before step B is done.
The race I believed I was seeing is that steps 1 and 2 were
effectively swapped due to the __list_add() being delayed by the L2
cache not shared by some of the other processors. Imagine:
proc 2: just before step A
proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
proc 1, step 2: checks tail and sees no pending events
proc 2, step A: updates tail
proc 1, step 3: calls [io_]schedule() and sleeps
proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
so proc 1 sleeps indefinitely
My patch adds a memory barrier between steps A and B. It ensures that the
update in step 1 gets seen on processor 2 before continuing. If processor 1
was just before step 1, the memory barrier makes sure that step A (update
tail) gets seen by the time processor 1 makes it to step 2 (check tail).
Before the patch our AIO process would hang virtually 100% of the time. After
the patch, we have yet to see the process ever hang.
Signed-off-by: Quentin Barnes <qbarnes+linux@yahoo-inc.com> Reviewed-by: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ We should probably disallow that "if (waitqueue_active()) wake_up()"
coding pattern, because it's so often buggy wrt memory ordering ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 19 Mar 2008 04:34:48 +0000 (21:34 -0700)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx:
async_tx: avoid the async xor_zero_sum path when src_cnt > device->max_xor
fsldma: Fix the DMA halt when using DMA_INTERRUPT async_tx transfer.
Linus Torvalds [Wed, 19 Mar 2008 04:26:24 +0000 (21:26 -0700)]
IDE: Make taskfile interface more robust wrt unexpected end-of-command
Now that we handle all the special commands using REQ_TYPE_ATA_TASKFILE
rather than using the old REQ_TYPE_ATA_CMD model, we need to also
emulate the lack of full taskfile data that comes with the old command
model (ie when commands are generated with the HDIO_DRIVE_CMD ioctl
rather than using the HDIO_DRIVE_TASK[FILE] ioctls).
In particular, this means that we should handle command completion the
more relaxed way that the old drive_cmd_intr() code did. It allows
commands to finish early even if they don't use up all the data that we
thought we had for them.
This fixes a regression seen by Anders Eriksson where some SMART
commands sent by smartd would cause a boot-time system hang on his
machine because the IDE command handling code didn't realize that the
command had completed.
Ingo Molnar [Sun, 16 Mar 2008 10:14:30 +0000 (11:14 +0100)]
sched: tune multi-core idle balancing
WAKE_IDLE is too agressive on multi-core CPUs with the new
wake-affine code, keep it on for SMT/HT balancing alone
(where there's no cache affinity at all between logical CPUs).
Ingo Molnar [Sat, 15 Mar 2008 16:10:34 +0000 (17:10 +0100)]
sched: wakeup-buddy tasks are cache-hot
Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)
Ingo Molnar [Wed, 19 Mar 2008 00:42:00 +0000 (01:42 +0100)]
sched: improve affine wakeups
improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.
Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.
( Also slightly move the preempt_check within try_to_wake_up() - this has
no effect on functionality but allows 'early wakeups' (for still-on-rq
tasks) to be correctly accounted as well.)
(the md5's changed because stack slots changed and some registers
get scheduled by gcc in a different order - but otherwise the before
and after assembly is instruction for instruction equivalent.)
Dan Williams [Wed, 19 Mar 2008 04:23:59 +0000 (21:23 -0700)]
async_tx: avoid the async xor_zero_sum path when src_cnt > device->max_xor
If the channel cannot perform the operation in one call to
->device_prep_dma_zero_sum, then fallback to the xor+page_is_zero path.
This only affects users with arrays larger than 16 devices on iop13xx or
32 devices on iop3xx.
Cc: <stable@kernel.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Zhang Wei [Wed, 19 Mar 2008 01:45:00 +0000 (18:45 -0700)]
fsldma: Fix the DMA halt when using DMA_INTERRUPT async_tx transfer.
The DMA_INTERRUPT async_tx is a NULL transfer, thus the BCR(count register)
is 0. When the transfer started with a byte count of zero, the DMA
controller will triger a PE(programming error) event and halt, not a normal
interrupt. I add special codes for PE event and DMA_INTERRUPT
async_tx testing.
Signed-off-by: Zhang Wei <wei.zhang@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Linus Torvalds [Tue, 18 Mar 2008 14:49:59 +0000 (07:49 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: update key codes for Apple aluminium
HID: fix comment in hid_input_report()
HID: BADPAD entry for NATSU Playstation USB adapter
HID: Use DIV_ROUND_UP
HID: remove HID_QUIRK_APPLE_ISO_KEYBOARD for 4th generation macbook
Linus Torvalds [Tue, 18 Mar 2008 14:43:14 +0000 (07:43 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
Revert "unexport bio_{,un}map_user"
relay: fix subbuf_splice_actor() adding too many pages
The ps2esdi driver was marked as BROKEN more than two years ago due to being
Linus Torvalds [Tue, 18 Mar 2008 14:32:23 +0000 (07:32 -0700)]
Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/ati_pcigart: fix the PCIGART to use drm_pci to allocate GART table.
drm/radeon: fixup RV550 chip family
drm/via: attempt again to stabilise the AGP DMA command submission.
drm: Fix race that can lockup the kernel
F5 and F6 have no second function printed on them. Thus their definitions have
been removed from the table.
KEY_CYCLEWINDOWS doesn't name the function of Mac OS X' Expose properly and
because we couldn't find a better key code, we decided to use KEY_FN_F4
instead.
We also changed KEY_BACK and KEY_FORWARD, which apply to browser functions, to
KEY_PREVIOUSSONG and KEY_NEXTSONG, since the keys are intended to control a
music player.
Signed-off-by: Michael Hanselmann <linux-kernel@hansmi.ch> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Holger Macht [Wed, 12 Mar 2008 00:07:27 +0000 (01:07 +0100)]
ACPI: Set flag DOCK_UNDOCKING when triggered via sysfs
begin_undock() is only called when triggered via a acpi notify handler
(pressing the undock button on the dock station), but complete_undock() is
always called after the eject. So if a undock is triggered through a sysfs
write, the flag DOCK_UNDOCKING has to be set for the dock station,
too. Otherwise this will freeze the system hard.
Signed-off-by: Holger Macht <hmacht@suse.de> Acked-by: Kristen Carlson Accardi <kristen.c.accardi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com>
Julia Lawall [Tue, 4 Mar 2008 23:00:13 +0000 (15:00 -0800)]
asus_acpi: remove misleading mask
led_out is boolean, so there is no functional change here,
but apparently an extra mask with 1 caused some style checkers
to flag this as logic bug.
Signed-off-by: Julia Lawall <julia@diku.dk> Acked-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com>
ACPI: battery: Don't return -EFAIL on broken packages.
Acer BIOS has a bug which is exposed when a dead battery is present.
The package template that is used to describe battery status is
over-written with sane values when the battery is live.
But when the batter is dead, a bogus reference in the template
is used. In this case, Linux returns a fault, when instead
it should simply return that it doesn't know the missing value.
Mark Lord [Mon, 17 Mar 2008 20:04:23 +0000 (16:04 -0400)]
pciehp: don't enable slot unless forced
This fixes a 2.6.25 regression reported by Alex Chiang.
Invoke pciehp_enable_slot() at startup only when pciehp_force=1.
Some HP equipment apparently cannot cope with it otherwise.
This restores the (previously working) 2.6.24 behaviour here,
while allowing machines that need a kick to use pciehp_force=1.
This was the original design back in October 2007,
but Kristen suggested we try without it first:
Kristen Carlson Accardi wrote:
>I think it would be ok to try allowing the slot to be enabled when not
>using pciehp_force mode. We can wrap it later if it proves to break things
This ended up breaking one of Alex's setups,
so it's time to put the wrapper back in now.
Signed-off-by: Mark Lord <mlord@pobox.com> Acked-by: Alex Chiang <achiang@hp.com> Acked-by: Kristen Carlson Accardi <kristen.c.accardi@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jens Axboe [Mon, 17 Mar 2008 20:14:40 +0000 (21:14 +0100)]
Revert "unexport bio_{,un}map_user"
Outside users like asmlib uses the mapping functions. API wise, the
export is definitely sane. It's a better idea to keep this export
than to require external users to open-code this piece of code instead.
slub page alloc fallback: Enable interrupts for GFP_WAIT.
The fallback path needs to enable interrupts like done for
the other page allocator calls. This was not necessary with
the alternate fast path since we handled irq enable/disable in
the slow path. The regular fastpath handles irq enable/disable
around calls to the slow path so we need to restore the proper
status before calling the page allocator from the slowpath.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Linus Torvalds [Mon, 17 Mar 2008 16:52:24 +0000 (09:52 -0700)]
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
ahci: Add Marvell 6121 SATA support
pata_ali: use atapi_cmd_type() to determine cmd type instead of transfer size
ahci: implement skip_host_reset parameter
ahci: request all PCI BARs
devres: implement pcim_iomap_regions_request_all()
libata-acpi: improve dock event handling
Tejun Heo [Tue, 11 Mar 2008 02:35:00 +0000 (11:35 +0900)]
pata_ali: use atapi_cmd_type() to determine cmd type instead of transfer size
pata_ali was using qc->nbytes to determine whether a command is
data transfer type or not. As now qc->nbytes can be extended by
padding and draining buffers, these tests are not useful anymore.
Use atapi_cmd_type() instead.
Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Tejun Heo [Mon, 10 Mar 2008 01:25:25 +0000 (10:25 +0900)]
ahci: implement skip_host_reset parameter
Under certain circumstances (SSP turned off by the BIOS) and for
debugging purposes, skipping global controller reset is helpful. Add
a kernel parameter for it.
Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Tejun Heo [Tue, 11 Mar 2008 10:52:31 +0000 (19:52 +0900)]
ahci: request all PCI BARs
ahci is often implemented with accompanying SFF compatible interface
and legacy IDE driver may attach to the legacy IO ports when the
controller is already claimed by ahci and vice-versa. This patch
makes ahci use pcim_iomap_regions_request_all() so that all IO regions
are claimed on attach.
Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
Some drivers need to reserve all PCI BARs to prevent other drivers
misusing unoccupied BARs. pcim_iomap_regions_request_all() requests
all BARs and iomap specified BARs.
Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
The problem can be triggered with a high amount of host->guest traffic.
I think its the following race:
poll says netif_rx_complete
poll calls enable_cb
enable_cb opens the interrupt mask
a new packet comes, an interrupt is triggered----\
enable_cb sees that there is more work |
enable_cb disables the interrupt |
. V
. interrupt is delivered
. skb_recv_done does atomic napi test, ok
some waiting disable_cb is called->check fails->bang!
.
poll would do napi check
poll would do disable_cb
The fix is to let enable_cb not disable the interrupt again, but expect the
caller to do the cleanup if it returns false. In that case, the interrupt is
only disabled, if the napi test_set_bit was successful.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned up doco)
Rusty Russell [Tue, 18 Mar 2008 03:58:15 +0000 (22:58 -0500)]
virtio: handle > 2 billion page balloon targets
If the host asks for a huge target towards_target() can overflow, and
we up oops as we try to release more pages than we have. The simple
fix is to use a 64-bit value.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Anthony Liguori [Sun, 2 Mar 2008 22:37:48 +0000 (16:37 -0600)]
virtio: Use spin_lock_irqsave/restore for virtio-pci
virtio-pci acquires its spin lock in an interrupt context so it's necessary
to use spin_lock_irqsave/restore variants. This patch fixes guest SMP when
using virtio devices in KVM.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Thomas Hellstrom [Mon, 17 Mar 2008 00:07:20 +0000 (10:07 +1000)]
drm/via: attempt again to stabilise the AGP DMA command submission.
It's worth remembering that all new bright ideas on how to make this command reader work properly and according to docs will probably fail :( Bring in some old code.
Also allow a larger SG-DMA download stride, and remove unnecessary waits for
command regulators pauses.
Mike Isely [Thu, 13 Mar 2008 20:30:35 +0000 (15:30 -0500)]
drm: Fix race that can lockup the kernel
The i915_vblank_swap() function schedules an automatic buffer swap
upon receipt of the vertical sync interrupt. Such an operation is
lengthy so it can't be allowed to happen in normal interrupt context,
thus the DRM implements this by scheduling the work in a kernel
softirq-scheduled tasklet. In order for the buffer swap to work
safely, the DRM's central lock must be taken, via a call to
drm_lock_take() located in drivers/char/drm/drm_irq.c within the
function drm_locked_tasklet_func(). The lock-taking logic uses a
non-interrupt-blocking spinlock to implement the manipulations needed
to take the lock. This semantic would be safe if all attempts to use
the spinlock only happen from process context. However this buffer
swap happens from softirq context which is really a form of interrupt
context. Thus we have an unsafe situation, in that
drm_locked_tasklet_func() can block on a spinlock already taken by a
thread in process context which will never get scheduled again because
of the blocked softirq tasklet. This wedges the kernel hard.
To trigger this bug, run a dual-head cloned mode configuration which
uses the i915 drm, then execute an opengl application which
synchronizes buffer swaps against the vertical sync interrupt. In my
testing, a lockup always results after running anywhere from 5 minutes
to an hour and a half. I believe dual-head is needed to really
trigger the problem because then the vertical sync interrupt handling
is no longer predictable (due to being interrupt-sourced from two
different heads running at different speeds). This raises the
probability of the tasklet trying to run while the userspace DRI is
doing things to the GPU (and manipulating the DRM lock).
The fix is to change the relevant spinlock semantics to be the
interrupt-blocking form. After this change I am no longer able to
trigger the lockup; the longest test run so far was 20 hours (test
stopped after that point).
Note: I have examined the places where this spinlock is being
employed; all are reasonably short bounded sequences and should be
suitable for interrupts being blocked without impacting overall kernel
interrupt response latency.
Signed-off-by: Mike Isely <isely@pobox.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Helge Deller [Wed, 26 Dec 2007 17:07:01 +0000 (18:07 +0100)]
[PARISC] head.S: section mismatch fixes
- move boot_args[] into the init section
- move $global$ into the read_mostly section
- fix the following two section mismatches:
WARNING: vmlinux.o(.text+0x9c): Section mismatch: reference to .init.text:start_kernel (between '$pgt_fill_loop' and '$is_pa20')
WARNING: vmlinux.o(.text+0xa0): Section mismatch: reference to .init.text:start_kernel (between '$pgt_fill_loop' and '$is_pa20')
This is bogus on parisc, since page zero in kernel virtual space is the
gateway page for syscall entry, and should not be read from the kernel.
(That, and we really don't like the kernel faulting on its own address
space...)
Kyle McMartin [Sat, 1 Mar 2008 18:30:19 +0000 (10:30 -0800)]
[PARISC] clean up show_stack
When we show_regs, we obviously have a struct pt_regs of the calling
frame. Use these in show_stack so we don't have the entire bogus call trace
up to the show_stack call.
Adrian Bunk [Tue, 26 Feb 2008 19:55:17 +0000 (21:55 +0200)]
[PARISC] move defconfig to arch/parisc/configs/
This patch moves the default parisc defconfig to
arch/parisc/configs/generic_defconfig where it belongs and selects it as
the default defconfig through KBUILD_DEFCONFIG.
Signed-off-by: Adrian Bunk <adrian.bunk@movial.fi> Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
Kyle McMartin [Tue, 19 Feb 2008 07:34:34 +0000 (23:34 -0800)]
[PARISC] pdc_console: fix bizarre panic on boot
Commit 721fdf34167580ff98263c74cead8871d76936e6 introduced a subtle bug
by accidently removing the "static" from iodc_dbuf. This resulted in, what
appeared to be, a trap without *current set to a task. Probably the result of
a trap in real mode while calling firmware.
Also do other misc clean ups. Since the only input from firmware is non
blocking, share iodc_dbuf between input and output, and spinlock the
only callers.
Kyle McMartin [Tue, 19 Feb 2008 07:26:46 +0000 (23:26 -0800)]
[PARISC] dump_stack in show_regs
Originally, show_stack was used in BUG() output. However, a recent commit
changed it to print register state (no idea what that's supposed to help,
really...) and parisc was missing a backtrace because of it.
It did ugly things to the init sequence to populate the rootfs image
early, but that just ended up showing other problems with the whole
approach. The fact is, the VFS layer simply isn't initialized this
early, and the relevant ACPI code should either run much later, or this
shouldn't be done at all.
For 2.6.25, we'll just pick the latter option. We can revisit this
concept later if necessary.
Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Tilman Schmidt <tilman@imap.cc> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Renninger <trenn@suse.de> Cc: Eric Piel <eric.piel@tremplin-utc.net> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Markus Gaugusch <dsdt@gaugusch.at> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo Molnar [Fri, 14 Mar 2008 21:17:08 +0000 (22:17 +0100)]
sched: simplify sched_slice()
Use the existing calc_delta_mine() calculation for sched_slice(). This
saves a divide and simplifies the code because we share it with the
other /cfs_rq->load users.
It also improves code size:
text data bss dec hex filename
42659 2740 144 45543 b1e7 sched.o.before
42093 2740 144 44977 afb1 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Peter Zijlstra [Fri, 14 Mar 2008 20:12:12 +0000 (21:12 +0100)]
sched: fix overload performance: buddy wakeups
Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.
Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.
Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 14 Mar 2008 19:55:51 +0000 (20:55 +0100)]
sched: min_vruntime fix
Current min_vruntime tracking is incorrect and will cause serious
problems when we don't run the leftmost task for some reason.
min_vruntime does two things; 1) it's used to determine a forward
direction when the u64 vruntime wraps, 2) it's used to track the
leftmost vruntime to position newly enqueued tasks from.
The current logic advances min_vruntime whenever the current task's
vruntime advance. Because the current task may pass the leftmost task
still waiting we're failing the second goal. This causes new tasks to be
placed too far ahead and thus penalizes their runtime.
Fix this by making min_vruntime the min_vruntime of the waiting tasks by
tracking it in enqueue/dequeue, and compare against current's vruntime
to obtain the absolute minimum when placing new tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.
There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.
The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.
J. Bruce Fields [Fri, 14 Mar 2008 23:37:11 +0000 (19:37 -0400)]
nfsd: fix oops on access from high-numbered ports
This bug was always here, but before my commit 6fa02839bf9412e18e77
("recheck for secure ports in fh_verify"), it could only be triggered by
failure of a kmalloc(). After that commit it could be triggered by a
client making a request from a non-reserved port for access to an export
marked "secure". (Exports are "secure" by default.)
The result is a struct svc_export with a reference count one too low,
resulting in likely oopses next time the export is accessed.
The reference counting here is not straightforward; a later patch will
clean up fh_verify().
Thanks to Lukas Hejtmanek for the bug report and followup.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Lukas Hejtmanek <xhejtman@ics.muni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>