git.karo-electronics.de Git - linux-beck.git/log

sched: don't allow rt_runtime_us to be zero for groups having rt tasks

This patch checks if we can set the rt_runtime_us to 0. If there is a
realtime task in the group, we don't want to set the rt_runtime_us as 0
or bad things will happen. (that task wont get any CPU time despite
being TASK_RUNNNG)

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: rt-group: fixup schedulability constraints calculation

it was only possible to configure the rt-group scheduling parameters
beyond the default value in a very small range.

that's because div64_64() has a different calling convention than
do_div() :/

fix a few untidies while we are here; sysctl_sched_rt_period may overflow
due to that multiplication, so cast to u64 first. Also that RUNTIME_INF
juggling makes little sense although its an effective NOP.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: fix the wrong time slice value for SCHED_FIFO tasks

Function sys_sched_rr_get_interval returns wrong time slice value for
SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: export task_nice

The API is trivial, and so is the implementation.

Signed-off-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: balance RT task resched only on runqueue

Sripathi Kodi reported a crash in the -rt kernel:

https://bugzilla.redhat.com/show_bug.cgi?id=435674

this is due to a place that can reschedule a task without holding
the tasks runqueue lock. This was caused by the RT balancing code
that pulls RT tasks to the current run queue and will reschedule the
current task.

There's a slight chance that the pulling of the RT tasks will release
the current runqueue's lock and retake it (in the double_lock_balance).
During this time that the runqueue is released, the current task can
migrate to another runqueue.

In the prio_changed_rt code, after the pull, if the current task is of
lesser priority than one of the RT tasks pulled, resched_task is called
on the current task. If the current task had migrated in that small
window, resched_task will be called without holding the runqueue lock
for the runqueue that the task is on.

This race condition also exists in the mainline kernel and this patch
adds a check to make sure the task hasn't migrated before calling
resched_task.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tested-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: retain vruntime

Kei Tokunaga reported an interactivity problem when moving tasks
between control groups.

Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.

Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6.25

* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6.25:
  sh: Fix up the sh64 build.
  sh: Fix up SH7710 VoIP-GW build.
  sh: Flag PMB support as EXPERIMENTAL.
  sh: Update r7780mp defconfig.
  fb: hitfb: Balance probe/remove section annotations.
  sh: hp6xx: Fix up hp6xx_apm build failure.
  fb: pvr2fb: Fix up remaining section mismatch.
  sh: Fix up section mismatches.
  sh: hp6xx: Correct APM output.
  sh: update se7780 defconfig
  sh: replace remaining __FUNCTION__ occurrences
  sh: export copy-page() to modules
  sh_ksyms_32.c update for gcc 4.3
  sh/mm/pg-sh7705.c must #include <linux/fs.h>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6:
  [Blackfin] arch: current_l1_stack_save is a pointer, so use NULL rather than 0
  [Blackfin] arch: fix atomic and32/xor32 comments and ENDPROC markings
  [Blackfin] arch: fix bug - allow SDH driver to be used as module
  [Blackfin] arch: to kill syscalls missing warning by adding new timerfd syscalls

Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] kprobes arch consolidation build fix
  [IA64] update efi region debugging to use MB, GB and TB as well as KB
  [IA64] use dev_printk in video quirk
  [IA64] remove remaining __FUNCTION__ occurrences
  [IA64] remove unnecessary nfs includes from sys_ia32.c
  [IA64] remove CONFIG_SMP ifdef in ia64_send_ipi()
  [IA64] arch_ptrace() cleanup
  [IA64] remove duplicate code from arch_ptrace()
  [IA64] convert sys_ptrace to arch_ptrace
  [IA64] remove find_thread_for_addr()
  [IA64] do not sync RBS when changing PT_AR_BSP or PT_CFM
  [IA64] access user RBS directly

[IA64] kprobes arch consolidation build fix

ia64 named their handler kprobes_fault_handler while all other
arches used kprobe_fault_handler. Change the function definition
and header declaration.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] update efi region debugging to use MB, GB and TB as well as KB

When EFI_DEBUG is defined to a non-zero value in arch/ia64/kernel/efi.c,
the efi memory regions are displayed. This patch enhances the
display code in a few ways:

1. Use TB, GB and MB as well as KB as units.
   Although this introduces rounding errors (KB doesn't as
   size is always a multiple of 4Kb), it does make
   things a lot more readable.

   Also as the range is also shown, it is possible to note the exact size
   if it is important. In my experience, the size field is mostly useful
   for getting a general idea of the size of a region.

   On the rx2620 that I use, there actually is an 8TB region (though not
   backed by physical memory, and 8TB really is a lot more readable than
   8589934592KB.

2. pad the size field with leading spaces to further improve readability

   ...
   ... (   8MB)
   ... ( 928MB)
   ... (   3MB)
   ...

   vs

   ...
   ... (8MB)
   ... (928MB)
   ... (3MB)
   ...

3. Pad the attr field out to 64bits using leading zeros,
   to further improve readability.

   ...
   mem05: type= 2, attr=0x0000000000000008, range=[0x0000000004000000-0x000000000481f000) (   8MB)
   mem06: type= 7, attr=0x0000000000000008, range=[0x000000000481f000-0x000000003e876000) ( 928MB)
   mem07: type= 5, attr=0x8000000000000008, range=[0x000000003e876000-0x000000003eb8e000) (   3MB)
   mem08: type= 4, attr=0x0000000000000008, range=[0x000000003eb8e000-0x000000003ee7a000) (   2MB)
   ...

   ...
   mem05: type= 2, attr=0x8, range=[0x0000000004000000-0x000000000481f000) (   8MB)
   mem06: type= 7, attr=0x8, range=[0x000000000481f000-0x000000003e876000) ( 928MB)
   mem07: type= 5, attr=0x8000000000000008, range=[0x000000003e876000-0x000000003eb8e000) (   3MB)
   mem08: type= 4, attr=0x8, range=[0x000000003eb8e000-0x000000003ee7a000) (   2MB)
   ...

4. Use %d instead of %u for the index field, as i is a signed int.

N.B: This code is not compiled unless EFI_DEBUG is non 0.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] use dev_printk in video quirk

Convert quirk printks to dev_printk().

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] remove remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Long lines have been kept where they exist, some small spacing changes
have been done.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] remove unnecessary nfs includes from sys_ia32.c

Compilation of 2.6.25-rc2-mm1 on ia64 generates many warnings.

IA64 support 2 ELF format (IA64 binary and IA32 binary),
thus if 2 elf related header included, cause many warning or error.

about 2 week ago, J. Bruce Fields proposed this problem fixed patch.
(http://marc.info/?l=linux-ia64&m=120329313305695&w=2)

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] remove CONFIG_SMP ifdef in ia64_send_ipi()

When !CONFIG_SMP, cpu_physical_id() is ia64_get_lid(), which is
functionally identical to

(ia64_getreg(_IA64_REG_CR_LID) >> 16) & 0xffff

so there's no need for two versions of this code.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  [CRYPTO] xcbc: Fix crash with IPsec
  [CRYPTO] xts: Use proper alignment
  [CRYPTO] digest: Include internal.h for prototypes
  [CRYPTO] authenc: Add missing Kconfig dependency on BLKCIPHER
  [CRYPTO] skcipher: Move chainiv/seqiv into crypto_blkcipher module

Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6

* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] fix inode leak in xfs_iget_core()
[XFS] 977545 977545 977545 977545 977545 977545 xfsaild causing too many

Really unexport asm/page.h

Commit ed7b1889da256977574663689b598d88950bbd23 removed page.h from
include/asm-generic/Kbuild so that it shouldn't get exported.

However, it was redundantly listed in asm-mn10300/Kbuild and
asm-x86/Kbuild too. Remove those as well, so it really stops being
exported on those architectures. Also remove the redundant listing of
ptrace.h and termios.h from mn10300.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[CRYPTO] xcbc: Fix crash with IPsec

When using aes-xcbc-mac for authentication in IPsec,
the kernel crashes. It seems this algorithm doesn't
account for the space IPsec may make in scatterlist for authtag.
Thus when crypto_xcbc_digest_update2() gets called,
nbytes may be less than sg[i].length.
Since nbytes is an unsigned number, it wraps
at the end of the loop allowing us to go back
into loop and causing crash in memcpy.

I used update function in digest.c to model this fix.
Please let me know if it looks ok.

Signed-off-by: Joy Latten <latten@austin.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

[CRYPTO] xts: Use proper alignment

The XTS blockmode uses a copy of the IV which is saved on the stack
and may or may not be properly aligned. If it is not, it will break
hardware cipher like the geode or padlock.
This patch encrypts the IV in place so we don't have to worry about
alignment.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
Tested-by: Stefan Hellermann <stefan@the2masters.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

sh: Fix up the sh64 build.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: Fix up SH7710 VoIP-GW build.

The only board-specific bits that existed here were for setting up the
IRQs, which are now handled by the SH7710 CPU support code instead. As
there's nothing else to do for setup, kill off the board support code
and have the defconfig use the generic machvec instead.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: Flag PMB support as EXPERIMENTAL.

There's still work that needs to be done here, and this should not be
enabled by default on existing boards.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: Update r7780mp defconfig.

This disables the PMB/32BIT=y by default in r7780mp, as turning this on
presently results in build errors (for an admittedly experimental
feature).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

[XFS] fix inode leak in xfs_iget_core()

If the radix_tree_preload() fails, we need to destroy the inode we just
read in before trying again. This could leak xfs_vnode structures when
there is memory pressure. Noticed by Christoph Hellwig.

SGI-PV: 977823
SGI-Modid: xfs-linux-melb:xfs-kern:30606a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>

[XFS] 977545 977545 977545 977545 977545 977545 xfsaild causing too many
wakeups

Idle state is not being detected properly by the xfsaild push code. The
current idle state is detected by an empty list which may never happen
with mostly idle filesystem or one using lazy superblock counters. A
single dirty item in the list that exists beyond the push target can
result repeated looping attempting to push up to the target because it
fails to check if the push target has been acheived or not.

Fix by considering a dirty list with everything past the target as an idle
state and set the timeout appropriately.

SGI-PV: 977545
SGI-Modid: xfs-linux-melb:xfs-kern:30532a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>

fb: hitfb: Balance probe/remove section annotations.

hitfb presently has probe using __init whilst remove uses __devexit.
As this device can't possibly be hotplugged, switch to __exit and
__exit_p() instead.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: hp6xx: Fix up hp6xx_apm build failure.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

fb: pvr2fb: Fix up remaining section mismatch.

Building with CONFIG_DEBUG_SECTION_MISMATCH=y reports:

CC drivers/video/pvr2fb.o
LD drivers/video/built-in.o
WARNING: drivers/video/built-in.o(.text+0xb9b0): Section mismatch in reference from the function pvr2fb_check_var() to the variable .devinit.data:pvr2_fix
The function pvr2fb_check_var() references
the variable __devinitdata pvr2_fix.
This is often because pvr2fb_check_var lacks a __devinitdata
annotation or the annotation of pvr2_fix is wrong.

This is obviously crap as no such reference exists, but it's a bit
closer to reality from older versions which blamed the PCI table. The
real problem was a reference to pvr2_var.vmode from pvr2fb_check_var(),
as pvr2_var is flagged as __devinitdata (pvr2_fix is also, so at least
that part is right).

pvr2_var.vmode is just a fancy way of saying FB_VMODE_NONINTERLACED, so
we just reference that explicitly instead.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: Fix up section mismatches.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: hp6xx: Correct APM output.

This patch fixes the old non-verbose hp6xx apm code and enables some
very basic apm output. We now get percentage (battery) output
and basic time estimate.

Signed-off-by: Kristoffer Ericson <kristoffer.ericson@gmail.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: update se7780 defconfig

This patch updates se7780_defconfig

Signed-off-by: Yusuke Goda <goda.yusuke@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh: export copy-page() to modules

ERROR: "copy_page" [fs/unionfs/unionfs.ko] undefined!

like all the other architectures.

Cc: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh_ksyms_32.c update for gcc 4.3

This patch fixes the following build error with landisk_defconfig when
using gcc 4.3:

<--  snip  -->

...
  MODPOST 50 modules
ERROR: "__udivsi3_i4i" [net/sunrpc/sunrpc.ko] undefined!
ERROR: "__udivsi3_i4i" [net/appletalk/appletalk.ko] undefined!
ERROR: "__udivsi3_i4i" [fs/ufs/ufs.ko] undefined!
ERROR: "__udivsi3_i4i" [fs/ntfs/ntfs.ko] undefined!
ERROR: "__sdivsi3_i4i" [fs/ntfs/ntfs.ko] undefined!
ERROR: "__udivsi3_i4i" [fs/nfsd/nfsd.ko] undefined!
ERROR: "__sdivsi3_i4i" [fs/nfsd/nfsd.ko] undefined!
ERROR: "__udivsi3_i4i" [fs/nfs/nfs.ko] undefined!
ERROR: "__udivsi3_i4i" [fs/lockd/lockd.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/usb/storage/usb-storage.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/usb/serial/pl2303.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/usb/serial/pl2303.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/usb/serial/ftdi_sio.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/usb/misc/sisusbvga/sisusbvga.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/usb/misc/sisusbvga/sisusbvga.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/media/video/v4l1-compat.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/media/video/v4l1-compat.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/media/video/usbvideo/vicam.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/media/video/usbvideo/usbvideo.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/media/video/usbvideo/usbvideo.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/media/video/sn9c102/sn9c102.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/media/video/sn9c102/sn9c102.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/media/video/se401.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/media/video/pwc/pwc.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/md/raid0.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/md/md-mod.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/md/md-mod.ko] undefined!
ERROR: "__udivsi3_i4i" [drivers/md/linear.ko] undefined!
ERROR: "__sdivsi3_i4i" [drivers/hid/usbhid/usbhid.ko] undefined!
make[2]: *** [__modpost] Error 1

<--  snip  -->

Signed-off-by: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

sh/mm/pg-sh7705.c must #include <linux/fs.h>

This patch fixes the following compile error:

<--  snip  -->

...
  CC      arch/sh/mm/pg-sh7705.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/arch/sh/mm/pg-sh7705.c: In function 'ptep_get_and_clear':
/home/bunk/linux/kernel-2.6/git/linux-2.6/arch/sh/mm/pg-sh7705.c:130: error: implicit declaration of function 'mapping_writably_mapped'
make[2]: *** [arch/sh/mm/pg-sh7705.o] Error 1

<--  snip  -->

Signed-off-by: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

[Blackfin] arch: current_l1_stack_save is a pointer, so use NULL rather than 0

Signed-off-by: Mike Frysinger <vapier.adi@gmail.com>
Signed-off-by: Bryan Wu <cooloney@kernel.org>

Merge branch 'for-linus' of git://git.infradead.org/~dedekind/ubi-2.6

* 'for-linus' of git://git.infradead.org/~dedekind/ubi-2.6:
  UBI: mtd/ubi/vtbl.c: fix memory leak
  UBI: fix sparse errors in ubi.h
  UBI: fix error message
  UBI: silence warning

parisc: fix IOMMU's device boundary overflow bug on 32bits arch

On 32bits boxes, boundary_size becomes zero due to a overflow and we
hit BUG_ON in iommu_is_span_boundary.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Acked-by: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cpusets: fix obsolete comment

mm migration is no longer done in cpuset_update_task_memory_state() so it
can no longer take current->mm->mmap_sem, so fix the obsolete comment.

[ This changed in commit 04c19fa6f16047abff2288ddbc1f0798ede5a849
("cpuset: migrate all tasks in cpuset at once") when the mm migration
was moved from cpuset_update_task_memory_state() to update_nodemask() ]

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: (27 commits)
  [SCSI] mpt fusion: don't oops if NumPhys==0
  [SCSI] iscsi class: regression - fix races with state manipulation and blocking/unblocking
  [SCSI] qla4xxx: regression - add start scan callout
  [SCSI] qla4xxx: fix host reset dpc race
  [SCSI] tgt: fix build errors when dprintk is defined
  [SCSI] tgt: set the data length properly
  [SCSI] tgt: stop zero'ing scsi_cmnd
  [SCSI] ibmvstgt: set up scsi_host properly before __scsi_alloc_queue
  [SCSI] docbook: fix fusion source files
  [SCSI] docbook: fix scsi source file
  [SCSI] qla2xxx: Update version number to 8.02.00-k9.
  [SCSI] qla2xxx: Correct usage of inconsistent timeout values while issuing ELS commands.
  [SCSI] qla2xxx: Correct discrepancies during OVERRUN handling on FWI2-capable cards.
  [SCSI] qla2xxx: Correct needless clean-up resets during shutdown.
  [SCSI] arcmsr: update version and changelog
  [SCSI] ps3rom: disable clustering
  [SCSI] ps3rom: fix wrong resid calculation bug
  [SCSI] mvsas: fix phy sas address
  [SCSI] gdth: fix to internal commands execution
  [SCSI] gdth: bugfix for the at-exit problems
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
NFS: use new LSM interfaces to explicitly set mount options
LSM/SELinux: Interfaces to allow FS to control mount options

Merge branch 'fixes-25' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq

* 'fixes-25' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
  [CPUFREQ] fix section mismatch warnings
  [CPUFREQ] Remove debugging message from e_powersaver
  [CPUFREQ] Fix missing cpufreq_cpu_put() call in ->store
  [CPUFREQ] Fix missing cpufreq_cpu_put() call in ->show

Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6

* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [S390] incorrect reipl nss name.
  [S390] Load disabled wait psw if reipl fails.
  [S390] Fix IPL from NSS.
  [S390] zcrypt: fix ap_device_list handling
  [S390] sclp_vt220: speed up console output for interactive work
  [S390] dasd: fix reference counting in display method for proc/dasd/devices
  [S390] dasd: let dasd erp matching recognize alias recovery
  [S390] Get rid of memcpy gcc warning workaround.
  [S390] idle: Fix machine check handling in idle loop.
  [S390] Update default configuration.

[IA64] arch_ptrace() cleanup

Remove duplicate code, clean up goto's and indentation.

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] remove duplicate code from arch_ptrace()

Remove all code which does exactly the same thing as ptrace_request().

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] convert sys_ptrace to arch_ptrace

Convert sys_ptrace() to arch_ptrace().

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] remove find_thread_for_addr()

find_thread_for_addr() is no longer needed. It was only used to find
the correct kernel RBS for a given memory address, but since the kernel
RBS is not needed any longer, this function can go away.

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] do not sync RBS when changing PT_AR_BSP or PT_CFM

Syncing is no longer needed, because user RBS is already
up-to-date. Actually, if a debugger modified the contents
of the original RBS prior to changing PT_AR_BSP, the
modifications would get overwritten.

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>

[IA64] access user RBS directly

Because the user RBS of a process is now completely stored in
user-mode when the process is ptrace-stopped, accesses to the
RBS should no longer augment any part of the kernel RBS.

This means we can get rid of most ia64_peek() and ia64_poke()
calls.

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Tony Luck <tony.luck@intel.com>

NFS: use new LSM interfaces to explicitly set mount options

NFS and SELinux worked together previously because SELinux had NFS
specific knowledge built in.  This design was approved by both groups
back in 2004 but the recent NFS changes to use nfs_parsed_mount_data and
the usage of nfs_clone_mount_data showed this to be a poor fragile
solution.  This patch fixes the NFS functionality regression by making
use of the new LSM interfaces to allow an FS to explicitly set its own
mount options.

The explicit setting of mount options is done in the nfs get_sb
functions which are called before the generic vfs hooks try to set mount
options for filesystems which use text mount data.

This does not currently support NFSv4 as that functionality did not
exist in previous kernels and thus there is no regression.  I will be
adding the needed code, which I believe to be the exact same as the v3
code, in nfs4_get_sb for 2.6.26.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: James Morris <jmorris@namei.org>

LSM/SELinux: Interfaces to allow FS to control mount options

Introduce new LSM interfaces to allow an FS to deal with their own mount
options.  This includes a new string parsing function exported from the
LSM that an FS can use to get a security data blob and a new security
data blob.  This is particularly useful for an FS which uses binary
mount data, like NFS, which does not pass strings into the vfs to be
handled by the loaded LSM.  Also fix a BUG() in both SELinux and SMACK
when dealing with binary mount data.  If the binary mount data is less
than one page the copy_page() in security_sb_copy_data() can cause an
illegal page fault and boom.  Remove all NFSisms from the SELinux code
since they were broken by past NFS changes.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: James Morris <jmorris@namei.org>

[SCSI] mpt fusion: don't oops if NumPhys==0

Don't oops if NumPhys==0, instead return -ENODEV.
This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=9909

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Acked-by: Eric Moore <Eric.Moore@lsi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

[CPUFREQ] fix section mismatch warnings

Fix the following warnings:
WARNING: vmlinux.o(.text+0xfe6711): Section mismatch in reference from the function cpufreq_unregister_driver() to the variable .cpuinit.data:cpufreq_cpu_notifier
WARNING: vmlinux.o(.text+0xfe68af): Section mismatch in reference from the function cpufreq_register_driver() to the variable .cpuinit.data:cpufreq_cpu_notifier
WARNING: vmlinux.o(.exit.text+0xc4fa): Section mismatch in reference from the function cpufreq_stats_exit() to the variable .cpuinit.data:cpufreq_stat_cpu_notifier

The warnings were casued by references to unregister_hotcpu_notifier()
from normal functions or exit functions.
This is flagged by modpost as a potential error because
it does not know that for the non HOTPLUG_CPU
scenario the unregister_hotcpu_notifier() is a nop.
Silence the warning by replacing the __initdata
annotation with a __refdata annotation.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>

[CPUFREQ] Remove debugging message from e_powersaver

We don't need to printk a message every time we transition.
Leave the code there, but ifdef'd out, as it's useful when
adding support for new processors.

Reported-by: Petr Titěra <P.Titera@century.cz>
Signed-off-by: Dave Jones <davej@redhat.com>

[CPUFREQ] Fix missing cpufreq_cpu_put() call in ->store

refactor to use gotos instead of explicit exit paths

Signed-off-by: Dave Jones <davej@redhat.com>

[CPUFREQ] Fix missing cpufreq_cpu_put() call in ->show

refactor to use gotos instead of explicit exit paths

Signed-off-by: Dave Jones <davej@redhat.com>

[SCSI] iscsi class: regression - fix races with state manipulation and blocking/unblocking

For qla4xxx, we could be starting a session, but some error (network,
target, IO from a device that got started, etc) could cause the session
to fail and curring the block/unblock and state manipulation could race
with each other. This patch just has those operations done in the
single threaded iscsi eh work queue, so that way they are serialized.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

[SCSI] qla4xxx: regression - add start scan callout

We are seeing EXIST errors from sysfs during device addition.
We need a start scan callout so we do not start scanning sessions
found during hba setup, before the async scsi scan code is ready.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: David C Somayajulu <david.somayajulu@qlogic.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

[SCSI] qla4xxx: fix host reset dpc race

The host reset callout could be starting to reset the hba at the same
time the dpc thread is. This creates lots of problems because they both
want to do wierd things with the firmware and interrupts, etc.

This patch just has the host reset function fully shutdown the dpc
thread before resetting the hba.

This patch also moves the setting of the session online bit to fix
a potential race with the dpc thread and iscsi recovery thread.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: David C Somayajulu <david.somayajulu@qlogic.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

ahci: work around ATI SB600 h/w quirk

This addresses the recent ATI SB600 errata, where the hardware does
not like 256-length PRD entries during FPDMA (aka NCQ).

It hurts performance on SB600, but it is more important to get a
correct patch eliminating the data corruption/lockups, and then later
on tune for performance.

We simply limit each command to a maximum of 255 sectors, on SB600.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

pata_hpt*, pata_serverworks: fix UDMA masking

When masking, mask out the modes that are unsupported not the ones
that are supported. This makes life happier.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[S390] incorrect reipl nss name.

/sys/firmware/reipl/nss/name contains the nss name when defsys or
savesys command has been executed. If the defsys or savesys command
fails the kernel_nss_name has to be cleared since a reipl on that
nss name won't be possible.

Signed-off-by: Hongjie Yang <hongjie@us.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Load disabled wait psw if reipl fails.

Normally this should not happen, but it's cleaner to do it that way.

Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Fix IPL from NSS.

IPL from NSS didn't work because the memory detection routine omits any
memory sections with a size lower than what MAX_ORDER defines.
This causes the detection routine to skip the first memory segment which
has a size of 1MB. Which later on will let the kernel think that there
is no memory available at all.
Since in addition the z/VM memory increment size is 1MB force MAX_ORDER
to be 9, so we can support 1MB segments.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] zcrypt: fix ap_device_list handling

In ap_device_probe() we can add the new ap device to the internal
device list only if the device probe function successfully returns.
Otherwise we might end up with an invalid device in the internal ap
device list.

Signed-off-by: Ralph Wuerthner <rwuerthn@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] sclp_vt220: speed up console output for interactive work

Currently an output buffer can wait up to HZ/2 until the buffer is
flushed. The wait time is noticeable in interactive tools like mc.

Change the value to HZ/20, which seems enough for interactive work.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] dasd: fix reference counting in display method for proc/dasd/devices

Using the /proc/dasd/devices interface leaves the reference counter
of alias devices in an inconsistent state. A process that tries to set
such a device offline afterwards will hang.
The dasd_devices_show function returns immediately for alias devices
and this code path was missing a dasd_put_device call.

Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] dasd: let dasd erp matching recognize alias recovery

When a request fails that was started on an alias device then the
first recovery step is to retry it on the base device. If the
recovery request fails again with the same symptoms, the next step
should not be a simple retry, but should be a proper recovery based
on sense data, etc. To do so, the dasd recovery functions need to
recognize the alias recovery step in the erp chain by comparing
the start devices.

Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Get rid of memcpy gcc warning workaround.

Compile smp.o with -Wno-nonnull so gcc stops warning about memcpy
being used with a null parameter. Also remove the workaround code
and use a char * cast instead of a void * cast to do computations.

Cc: Bastian Blank <bastian@waldi.eu.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] idle: Fix machine check handling in idle loop.

If a machine check handling is pending when the idle loop is entered
default_idle will be left with timer ticks and virtual timer disabled.
Fix this by "calling" the idle_chain. Also a BUG_ON(!in_interrupt) in
start_hz_timer must be removed since the function now gets called from
non interrupt context as well.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[S390] Update default configuration.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

[CRYPTO] digest: Include internal.h for prototypes

Every file should include the headers containing the externs for its
global code (in this case for struct crypto_{init,exit}_digest_ops()).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Linux 2.6.25-rc4

module: allow ndiswrapper to use GPL-only symbols

A change after 2.6.24 broke ndiswrapper by accidentally removing its
access to GPL-only symbols. Revert that change and add comments about
the reasons why ndiswrapper and driverloader are treated in a special
way.

Signed-off-by: Pavel Roskin <proski@gnu.org>
Acked-by: Greg KH <gregkh@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
  [IPCONFIG]: The kernel gets no IP from some DHCP servers
  b43legacy: Fix module init message
  rndis_wlan: fix broken data copy
  libertas: compare the current command with response
  libertas: fix sanity check on sequence number in command response
  p54: fix eeprom parser length sanity checks
  p54: fix EEPROM structure endianness
  ssb: Add pcibios_enable_device() return value check
  rc80211-pid: fix rate adjustment
  [ESP]: Add select on AUTHENC
  [TCP]: Improve ipv4 established hash function.
  [NETPOLL]: Revert two bogus cleanups that broke netconsole.
  [PPPOL2TP]: Add missing sock_put() in pppol2tp_tunnel_closeall()
  Subject: [PPPOL2TP] add missing sock_put() in pppol2tp_recv_dequeue()
  [BLUETOOTH]: l2cap info_timer delete fix in hci_conn_del
  [NET]: Fix race in generic address resolution.
  iucv: fix build error on !SMP
  [TCP]: Must count fack_count also when skipping
  [TUN]: Fix RTNL-locking in tun/tap driver
  [SCTP]: Use proc_create to setup de->proc_fops.
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC]: Fix link errors with gcc-4.3
  sparc64: replace remaining __FUNCTION__ occurances
  sparc: replace remaining __FUNCTION__ occurances
  [SPARC]: Add reboot_command[] extern decl to asm/system.h
  [SPARC]: Mark linux_sparc_{fpu,chips} static.

[IPCONFIG]: The kernel gets no IP from some DHCP servers

From: Stephen Hemminger <shemminger@linux-foundation.org>

Based upon a patch by Marcel Wappler:

   This patch fixes a DHCP issue of the kernel: some DHCP servers
   (i.e.  in the Linksys WRT54Gv5) are very strict about the contents
   of the DHCPDISCOVER packet they receive from clients.

   Table 5 in RFC2131 page 36 requests the fields 'ciaddr' and
   'siaddr' MUST be set to '0'.  These DHCP servers ignore Linux
   kernel's DHCP discovery packets with these two fields set to
   '255.255.255.255' (in contrast to popular DHCP clients, such as
   'dhclient' or 'udhcpc').  This leads to a not booting system.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/linville/wireless-2.6

Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] fix ia64 kprobes compilation
  [IA64] move gcc_intrin.h from header-y to unifdef-y
  [IA64] workaround tiger ia64_sal_get_physical_id_info hang
  [IA64] move defconfig to arch/ia64/configs/
  [IA64] Fix irq migration in multiple vector domain
  [IA64] signal(ia64_ia32): add a signal stack overflow check
  [IA64] signal(ia64): add a signal stack overflow check
  [IA64] CONFIG_SGI_SN2 - auto select NUMA and ACPI_NUMA

Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  debugfs: fix sparse warnings
  Driver core: Fix cleanup when failing device_add().
  driver core: Remove dpm_sysfs_remove() from error path of device_add()
  PM: fix new mutex-locking bug in the PM core
  PM: Do not acquire device semaphores upfront during suspend
  kobject: properly initialize ksets
  sysfs: CONFIG_SYSFS_DEPRECATED fix
  driver core: fix up Kconfig text for CONFIG_SYSFS_DEPRECATED

Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6:
  pci: hotplug: pciehp: fix error code path in hpc_power_off_slot
  PCI: Add DECLARE_PCI_DEVICE_TABLE macro
  PCI: fix up error messages for pci_bus registering
  PCI: fix section mismatch warning in pci_scan_child_bus
  PCI: consolidate duplicated MSI enable functions
  PCI: use dev_printk in quirk messages

Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6:
  USB: ftdi_sio - really enable EM1010PC
  USB: remove incorrect struct class_device from the printer gadget
  USB: pxa2xx_udc: fix misuse of clock enable/disable calls
  USB: ftdi_sio: Workaround for broken Matrix Orbital serial port
  USB: Add support for AXESSTEL MV110H CDMA modem
  usb-storage: update earlier scatter-gather bug fix
  USB: isp116x: fix enumeration on boot
  USB: ehci: handle large bulk URBs correctly (again)
  USB: spruce up the device blacklist
  USB: fix comment of struct usb_interface
  USB: update Kconfig entry for USB_SUSPEND
  usb: Add support for the mos7820/7840-based B&B USB/RS485 converter to mos7840.c

kprobes: fix a null pointer bug in register_kretprobe()

Fix a bug in regiseter_kretprobe() which does not check rp->kp.symbol_name ==
NULL before calling kprobe_lookup_name.

For maintainability, this introduces kprobe_addr helper function which
resolves addr field. It is used by register_kprobe and register_kretprobe.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

input: add I2C to config since the driver makes several i2c*() calls

Add to help text that the Intel I2C ICH (i801) driver is also needed
for this kernel.

Add LEDS_CLASS to config since the driver makes les_classdev_*() calls:
ERROR: "led_classdev_register" [drivers/input/misc/apanel.ko] undefined!
ERROR: "__led_classdev_unregister" [drivers/input/misc/apanel.ko]
undefined!

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ext3: fix mount option parsing

The "resize" option won't be noticed as it comes after the NULL option, so if
you try to mount (or in this case remount) with that option it won't be
recognized.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: fix pool shrinking while in restricted cpuset

Adam Litke noticed that currently we grow the hugepage pool independent of any
cpuset the running process may be in, but when shrinking the pool, the cpuset
is checked.  This leads to inconsistency when shrinking the pool in a
restricted cpuset -- an administrator may have been able to grow the pool on a
node restricted by a containing cpuset, but they cannot shrink it there.

There are two options: either prevent growing of the pool outside of the
cpuset or allow shrinking outside of the cpuset.  >From previous discussions
on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
should not be restricted by cpusets.  So allow shrinking the pool by removing
pages from nodes outside of current's cpuset.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: close a difficult to trigger reservation race

A hugetlb reservation may be inadequately backed in the event of racing
allocations and frees when utilizing surplus huge pages.  Consider the
following series of events in processes A and B:

A) Allocates some surplus pages to satisfy a reservation
B) Frees some huge pages
A) A notices the extra free pages and drops hugetlb_lock to free some of
    its surplus pages back to the buddy allocator.
B) Allocates some huge pages
A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()

Avoid this by commiting the reservation after pages have been allocated but
before dropping the lock to free excess pages.  For parity, release the
reservation in return_unused_surplus_pages().

This patch also corrects the cpuset_mems_nr() error path in
hugetlb_acct_memory().  If the cpuset check fails, uncommit the
reservation, but also be sure to return any surplus huge pages that may
have been allocated to back the failed reservation.

Thanks to Andy Whitcroft for discovering this.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: the md RAID10 resync thread could cause a md RAID10 array deadlock

This message describes another issue about md RAID10 found by testing the
2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:

When a scsi error results in disabling a disk during RAID10 recovery, the
resync threads of md RAID10 could stall.

This case, the raid array has already been broken and it may not matter.  But
I think stall is not preferable.  If it occurs, even shutdown or reboot will
fail because of resource busy.

The deadlock mechanism:

The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
be handled when recovering.  The "remaining" counter is incremented when
building a BIO in sync_request() and is decremented when finish a BIO in
end_sync_write().

If building a BIO fails for some reasons in sync_request(), the "remaining"
should be decremented if it has already been incremented.  I found a case
where this decrement is forgotten.  This causes a md_do_sync() deadlock
because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
end_sync_write() never calls md_done_sync() because of the "remaining" counter
mismatch.

For example, this problem would be reproduced in the following case:

Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
      3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
      [>....................]  recovery =  2.2% (45376/1959808) finish=0.7min speed=45376K/sec

This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.

md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
      3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
      [=>...................]  recovery =  5.0% (99520/1959808) finish=5.9min speed=5237K/sec

2739 ?        S<     0:17 [md0_raid10]
28608 ?        D<     0:00 [md0_resync]
28629 pts/1    Ss     0:00 bash
28830 pts/1    R+     0:00 ps ax
31819 ?        D<     0:00 [kjournald]

The resync thread keeps working, but actually it is deadlocked.

Patch:
By this patch, the remaining counter will be decremented if needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: fix possible raid1/raid10 deadlock on read error during resync

Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.

If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes. Thus
a deadlock.

This patch fixes the problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: don't attempt read-balancing for raid10 'far' layouts

This patch changes the disk to be read for layout "far > 1" to always be the
disk with the lowest block address.

Thus the chunks to be read will always be (for a fully functioning array) from
the first band of stripes, and the raid will then work as a raid0 consisting
of the first band of stripes.

Some advantages:

The fastest part which is the outer sectors of the disks involved will be
used. The outer blocks of a disk may be as much as 100 % faster than the
inner blocks.

Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.

Mixed disks with different performance characteristics will work better, as
they will work as raid0, the sequential read rate will be number of disks
involved times the IO rate of the slowest disk.

If a disk is malfunctioning, the first disk which is working, and has the
lowest block address for the logical block will be used.

Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: lock access to rdev attributes properly

When we access attributes of an rdev (component device on an md array) through
sysfs, we really need to lock the array against concurrent changes.  We
currently do that when we change an attribute, but not when we read an
attribute.  We need to lock when reading as well else rdev->mddev could become
NULL while we are accessing it.

So add appropriate locking (mddev_lock) to rdev_attr_show.

rdev_size_store requires some extra care as well as it needs to unlock the
mddev while scanning other mddevs for overlapping regions.  We currently
assume that rdev->mddev will still be unchanged after the scan, but that
cannot be certain.  So take a copy of rdev->mddev for use at the end of the
function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: make sure a reshape is started when device switches to read-write

A resync/reshape/recovery thread will refuse to progress when the array is
marked read-only.  So whenever it mark it not read-only, it is important to
wake up thread resync thread.  There is one place we didn't do this.

The problem manifests if the start_ro module parameters is set, and a raid5
array that is in the middle of a reshape (restripe) is started.  The array
will initially be semi-read-only (meaning it acts like it is readonly until
the first write).  So the reshape will not proceed.

On the first write, the array will become read-write, but the reshape will not
be started, and there is no event which will ever restart that thread.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: clean up irregularity with raid autodetect

When a raid1 array is stopped, all components currently get added to the list
for auto-detection. However we should really only add components that were
found by autodetection in the first place. So add a flag to record that
information, and use it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: guard against possible bad array geometry in v1 metadata

Make sure the data doesn't start before the end of the superblock when the
superblock is at the start of the device.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: reduce CPU wastage on idle md array with a write-intent bitmap

On an md array with a write-intent bitmap, a thread wakes up every few seconds
and scans the bitmap looking for work to do. If the array is idle, there will
be no work to do, but a lot of scanning is done to discover this.

So cache the fact that the bitmap is completely clean, and avoid scanning the
whole bitmap when the cache is known to be clean.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

md: fix deadlock in md/raid1 and md/raid10 when handling a read error

When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.

This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).

However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

iommu: parisc: make the IOMMUs respect the segment boundary limits

Make PARISC's two IOMMU implementations not allocate a memory area spanning
LLD's segment boundary.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

iommu: parisc: pass struct device to iommu_alloc_range

This adds struct device argument to sba_alloc_range and ccio_alloc_range, a
preparation for modifications to fix the IOMMU segment boundary problem. This
change enables ccio_alloc_range to access to LLD's segment boundary limits.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

iommu: export iommu_is_span_boundary helper function

iommu_is_span_boundary is used internally in the IOMMU helper
(lib/iommu-helper.c), a primitive function that judges whether a memory area
spans LLD's segment boundary or not.

It's difficult to convert some IOMMUs to use the IOMMU helper but
iommu_is_span_boundary is still useful for them. So this patch exports it.

This is needed for the parisc iommu fixes.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>