git.karo-electronics.de Git - mv-sheeva.git/log

[PATCH] i386: allow disabling X86_FEATURE_SEP at boot

Allow the x86 "sep" feature to be disabled at bootup. This forces use of the
int80 vsyscall. Mainly for testing or benchmarking the int80 vsyscall code.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] i386: __devinit should be __cpuinit

Several places in arch/i386/kernel/cpu and kernel/cpu were using __devinit
when they should have been __cpuinit. Fixing that saves ~4K when
CONFIG_HOTPLUG && !CONFIG_HOTPLUG_CPU.

Noticed by Andrew Morton.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] x86: SMP alternatives

Implement SMP alternatives, i.e.  switching at runtime between different
code versions for UP and SMP.  The code can patch both SMP->UP and UP->SMP.
The UP->SMP case is useful for CPU hotplug.

With CONFIG_CPU_HOTPLUG enabled the code switches to UP at boot time and
when the number of CPUs goes down to 1, and switches to SMP when the number
of CPUs goes up to 2.

Without CONFIG_CPU_HOTPLUG or on non-SMP-capable systems the code is
patched once at boot time (if needed) and the tables are released
afterwards.

The changes in detail:

  * The current alternatives bits are moved to a separate file,
    the SMP alternatives code is added there.

  * The patch adds some new elf sections to the kernel:
    .smp_altinstructions
like .altinstructions, also contains a list
of alt_instr structs.
    .smp_altinstr_replacement
like .altinstr_replacement, but also has some space to
save original instruction before replaving it.
    .smp_locks
list of pointers to lock prefixes which can be nop'ed
out on UP.
    The first two are used to replace more complex instruction
    sequences such as spinlocks and semaphores.  It would be possible
    to deal with the lock prefixes with that as well, but by handling
    them as special case the table sizes become much smaller.

* The sections are page-aligned and padded up to page size, so they
   can be free if they are not needed.

* Splitted the code to release init pages to a separate function and
   use it to release the elf sections if they are unused.

Signed-off-by: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] i386: multi-column stack backtraces

Print stack backtraces in multiple columns, saving screen space.  Number of
columns is configurable and defaults to one so behavior is
backwards-compatible.

Also removes the brackets around addresses when printing more
that one entry per line so they print as:
    <address>
instead of:
    [<address>]
This helps multiple entries fit better on one line.

Original idea by Dave Jones, taken from x86_64.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] Make CONFIG_REGPARM enabled by default

Make CONFIG_REGPARM enabled by default. It's a noticable win both for size
and for performance, and gcc[34] handles it correctly.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] i386: let REGPARM no longer depend on EXPERIMENTAL

REGPARM has already gotten much testing, what about removing the
dependency on EXPERIMENTAL?

Additionally, this patch does:
- remove the useless "default n"
- remove note regarding binary only modules (nowadays, there are even
some binary only modules compiled with REGPARM=y available)

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] Bug fixes and cleanup for the BSD Secure Levels LSM

This patch address several issues in the current BSD Secure Levels code:

o plaintext_to_sha1: Missing check for a NULL return from __get_free_page

o passwd_write_file: A page is leaked if the password is wrong.

o fix securityfs registration order

o seclvl_init is a mess and can't properly tolerate failures, failure
path is upside down (deldif and delf should be switched)

Cleanups:

o plaintext_to_sha1: Use buffers passed in
o passwd_write_file: Use kmalloc() instead of get_zeroed_page()
o passwd_write_file: hashedPassword comparison is just memcmp
o s/ENOSYS/EINVAL/
o misc

(akpm: after some discussion it appears that the BSD secure levels feature
should be scheduled for removal. But for now, let's fix these problems up).

Signed-off-by: Davi Arnaut <davi.arnaut@gmail.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Cc: James Morris <jmorris@namei.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] cciss: fix use-after-free in cciss_init_one

free_hba() sets hba[i] to NULL, the dereference afterwards results in this
crash.  Setting busy_initializing to 0 actually looks unnecessary, but I'm
not entirely sure, which is why I left it in.

cciss: controller appears to be disabled
Unable to handle kernel NULL pointer dereference at virtual address 00000370
printing eip:
c1114d53
*pde = 00000000
Oops: 0002 [#1]
Modules linked in:
CPU:    0
EIP:    0060:[<c1114d53>]    Not tainted VLI
EFLAGS: 00010286   (2.6.16 #1)
EIP is at cciss_init_one+0x4e9/0x4fe
eax: 00000000   ebx: c132cd60   ecx: c13154e4   edx: c27d3c00
esi: 00000000   edi: c2748800   ebp: c2536ee4   esp: c2536eb8
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 1, threadinfo=c2536000 task=c2535a30)
Stack: <0>00000000 00000000 00000000 c13fdba0 c2536ee8 c13159c0 c2536f38
f7c74740
       c132cd60 c132cd60 ffffffed c2536ef0 c10c1d51 c2748800 c2536f04
c10c1d85
       c132cd60 c2748800 c132cd8c c2536f14 c10c1db8 c2748848 00000000
c2536f28
Call Trace:
[<c10031d5>] show_stack_log_lvl+0xa8/0xb0
[<c1003305>] show_registers+0x102/0x16a
[<c10034a2>] die+0xc1/0x13c
[<c1288160>] do_page_fault+0x38a/0x525
[<c1002e9b>] error_code+0x4f/0x54
[<c10c1d51>] pci_call_probe+0xd/0x10
[<c10c1d85>] __pci_device_probe+0x31/0x43
[<c10c1db8>] pci_device_probe+0x21/0x34
[<c110a654>] driver_probe_device+0x44/0x99
[<c110a73f>] __driver_attach+0x39/0x5d
[<c1109e1c>] bus_for_each_dev+0x35/0x5a
[<c110a777>] driver_attach+0x14/0x16
[<c110a220>] bus_add_driver+0x5c/0x8f
[<c110ab22>] driver_register+0x73/0x78
[<c10c1f6d>] __pci_register_driver+0x5f/0x71
[<c13bf935>] cciss_init+0x1a/0x1c
[<c13aa718>] do_initcalls+0x4c/0x96
[<c13aa77e>] do_basic_setup+0x1c/0x1e
[<c10002b1>] init+0x35/0x118
[<c1000cf5>] kernel_thread_helper+0x5/0xb
Code: 04 b5 e0 de 40 c1 8d 50 04 8b 40 34 e8 3f b7 f9 ff 8b 04 b5 e0 de
40 c1 e8 aa f3 ff ff 89 f0 e8 e8 fa ff ff 8b 04 b5 e0 de 40 c1 <c7> 80
70 03 00 00 00 00 00 00 83 c8 ff 8d 65 f4 5b 5e 5f 5d c3
<0>Kernel panic - not syncing: Attempted to kill init!

Signed-off-by: Patrick McHardy <kaber@trash.net>
Cc: <mike.miller@hp.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] DM: Fix bug: BIO_RW_BARRIER requests to md/raid1 hang.

Both R1BIO_Barrier and R1BIO_Returned are 4 !!!!

This means that barrier requests don't get returned (i.e. b_endio called)
because it looks like they already have been.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] fix scheduler deadlock

We have noticed lockups during boot when stress testing kexec on ppc64.
Two cpus would deadlock in scheduler code trying to grab already taken
spinlocks.

The double_rq_lock code uses the address of the runqueue to order the
taking of multiple locks. This address is a per cpu variable:

if (rq1 < rq2) {
spin_lock(&rq1->lock);
spin_lock(&rq2->lock);
} else {
spin_lock(&rq2->lock);
spin_lock(&rq1->lock);
}

On the other hand, the code in wake_sleeping_dependent uses the cpu id
order to grab locks:

for_each_cpu_mask(i, sibling_map)
spin_lock(&cpu_rq(i)->lock);

This means we rely on the address of per cpu data increasing as cpu ids
increase. While this will be true for the generic percpu implementation it
may not be true for arch specific implementations.

One way to solve this is to always take runqueues in cpu id order. To do
this we add a cpu variable to the runqueue and check it in the
double runqueue locking functions.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] proc: fix duplicate line in /proc/devices

Fix a duplicate block device line printed after the "Block device" header
in /proc/devices.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] sparc64: fix set_page_count merge clash

Merge clash will have broken sparc64. Synch up its online_page
implementation with powerpc, which was identical until the
set_page_count removal.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (78 commits)
  [PATCH] powerpc: Add FSL SEC node to documentation
  [PATCH] macintosh: tidy-up driver_register() return values
  [PATCH] powerpc: tidy-up of_register_driver()/driver_register() return values
  [PATCH] powerpc: via-pmu warning fix
  [PATCH] macintosh: cleanup the use of i2c headers
  [PATCH] powerpc: dont allow old RTC to be selected
  [PATCH] powerpc: make powerbook_sleep_grackle static
  [PATCH] powerpc: Fix warning in add_memory
  [PATCH] powerpc: update mailing list addresses
  [PATCH] powerpc: Remove calculation of io hole
  [PATCH] powerpc: iseries: Add bootargs to /chosen
  [PATCH] powerpc: iseries: Add /system-id, /model and /compatible
  [PATCH] powerpc: Add strne2a() to convert a string from EBCDIC to ASCII
  [PATCH] powerpc: iseries: Make more stuff static in platforms/iseries/mf.c
  [PATCH] powerpc: iseries: Remove pointless iSeries_(restart|power_off|halt)
  [PATCH] powerpc: iseries: mf related cleanups
  [PATCH] powerpc: Replace platform_is_lpar() with a firmware feature
  [PATCH] powerpc: trivial: Cleanup whitespace in cputable.h
  [PATCH] powerpc: Remove unused iommu_off logic from pSeries_init_early()
  [PATCH] powerpc: Unconfuse htab_bolt_mapping() callers
  ...

[PATCH] powerpc: Add FSL SEC node to documentation

Documentation: Added FSL SOC SEC node definition

Updated the documentation to include the definition of the SEC device
node format for Freescale SOC devices.

Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] macintosh: tidy-up driver_register() return values

Remove the assumption that driver_register() returns the number of devices
bound to the driver. In fact, it returns zero for success or a negative
error value.

All callers of macio_register_driver() either ignore the return value or
return it as the return value of a module_init() function.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: tidy-up of_register_driver()/driver_register() return values

Remove the assumption that driver_register() returns the number of devices
bound to the driver. In fact, it returns zero for success or a negative
error value.

Nobody uses the return value of of_register_driver() anyway.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: via-pmu warning fix

drivers/macintosh/via-pmu.c:164: warning: `sleep_in_progress' defined but not used

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] macintosh: cleanup the use of i2c headers

Cleanup the use of i2c headers in macintosh drivers.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: dont allow old RTC to be selected

Now powerpc uses the generic RTC stuff we should not enable the old RTC.
Doing so will result in hangs at boot.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: make powerbook_sleep_grackle static

powerbook_sleep_grackle is only called inside via-pmu, from pmu_ioctl()

Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

[PATCH] powerpc: Fix warning in add_memory

arch/powerpc/mm/mem.c: In function `add_memory':
arch/powerpc/mm/mem.c:128: warning: assignment makes integer from pointer without a cast

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6

* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  [PATCH] Use of uninitialized variable in drivers/net/depca.c
  [PATCH] Use after free in net/tulip/de2104x.c
  [PATCH] sis900 adm7001 PHY support
  [PATCH] sky2: more ethtool stats
  [PATCH] s390: qeth :allow setting of attribute "route6" to "no_router".
  [PATCH] s390: qeth driver cleanups
  [PATCH] s390: qeth driver statistics fixes
  [PATCH] AMD Au1xx0: fix Ethernet TX stats
  [PATCH] fix spidernet build issue

scsi: link in the debug driver last

If the debug driver is built-in, link it in last, so that any real
drivers will probe first, rather than having the debug driver pick the
first scsi slots..

Signed-off-by: Douglas Gilbert <dougg@torque.net>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [TCP]: Do not use inet->id of global tcp_socket when sending RST.
  [NETFILTER]: Fix undefined references to get_h225_addr
  [NETFILTER]: futher {ip,ip6,arp}_tables unification
  [NETFILTER]: Fix xt_policy address matching
  [NETFILTER]: nf_conntrack: support for layer 3 protocol load on demand
  [NETFILTER]: x_tables: set the protocol family in x_tables targets/matches
  [NETFILTER]: conntrack: cleanup the conntrack ID initialization
  [NETFILTER]: nfnetlink_queue: fix nfnetlink message size
  [NETFILTER]: ctnetlink: Fix expectaction mask dumping
  [NETFILTER]: Fix Kconfig typos
  [NETFILTER]: Fix ip6tables breakage from {get,set}sockopt compat layer

Merge master.kernel.org:/home/rmk/linux-2.6-serial

* master.kernel.org:/home/rmk/linux-2.6-serial:
[SERIAL] Merge avlab serial board entries in parport_serial
[SERIAL] kernel console should send CRLF not LFCR

Merge master.kernel.org:/home/rmk/linux-2.6-arm

* master.kernel.org:/home/rmk/linux-2.6-arm: (45 commits)
  [ARM] 3389/1: typo and grammar fix
  [ARM] 3386/1: AT91RM9200 Clock update
  [ARM] 3384/1: AT91RM9200: Timer
  [ARM] 3382/1: ixp2000: unify defconfigs
  [ARM] 3381/1: ixp2000: fix slowport write timing control register fields
  [ARM] 3380/1: ixp2000: simplify ixdp2x00_master_npu() check
  [ARM] 3379/1: ixp2000: use generic 8250 debug macros
  [ARM] 3378/1: ixp2000: fix gpio interrupt handling
  [ARM] Quieten spurious IRQ detection
  [ARM] Use kcalloc to allocate counter_config array rather than kmalloc
  [ARM] Oprofile: dynamically allocate counter_config
  [ARM] Oprofile: Convert semaphore to mutex
  [ARM] 3376/2: S3C2410 - update defconfig
  [ARM] 3375/1: S3C2440 - fix osiris machine build
  [ARM] 3374/1: ep93xx: gpio interrupt support
  [ARM] 3361/1: S3C24XX - add USB bus clock source
  [ARM] 3360/1: S3C2440 - add set rate methods and camera clock
  [ARM] 3359/1: S3C24XX - add support for clk_set_rate
  [ARM] Convert kmalloc+memset to kzalloc
  [ARM] 3373/1: move uengine loader to arch/arm/common
  ...

[PATCH] Use of uninitialized variable in drivers/net/depca.c

hi,

this fixes coverity bug #888, where the variable
dev is used uninitialized. I assume the programmer
meant to use mdev, which is initialized.
Compile tested only.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] Use after free in net/tulip/de2104x.c

hi,

this fixes coverity bug #912, where skb is freed first,
and dereferenced a few lines later with skb->len.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] sis900 adm7001 PHY support

this patch is required to get a SIS964 based motherboard ethernet working (FSC D1875)
(picking the #1 transceiver, instead of the last one, in case no known ones were found
might be a better default, and would have worked in this case too)

Signed-off-by: Artur Skawina <art_k@o2.pl>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] sky2: more ethtool stats

Expose all the available hardware statistics via ethtool.
And cleanup some of the statistics definitions.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] s390: qeth :allow setting of attribute "route6" to "no_router".

[patch 4/6] s390: qeth :allow setting of attribute "route6" to "no_router".

From: Ursula Braun <braunu@de.ibm.com>
when setting route6 attribute back to no_router qeth does not
issue an IP ASSIST command to reset router value to no_router.
Once primary_router is set device stays in this mode.
Issue an IP ASSIST command when no_router is set in route6.
Device will be reset and thus will not longer run as a primary
router.

Signed-off-by: Frank Pavlic <fpavlic@de.ibm.com>
diffstat:
qeth_main.c | 5 -----
1 files changed, 5 deletions(-)
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] s390: qeth driver cleanups

[patch 3/6] s390: qeth driver cleanups

From: Ursula Braun <braunu@de.ibm.com>
- code analyzing tool BEAM has found some unreachable
  and unnecessary statements and also conditions
  which are always true.
- removed some useless MII code since OSA card will never
  allow to set such values.

Signed-off-by: Frank Pavlic <fpavlic@de.ibm.com>
diffstat:
qeth_main.c |   49 ++++---------------------------------------------
qeth_proc.c |   18 +++++++++---------
qeth_sys.c  |    2 +-
3 files changed, 14 insertions(+), 55 deletions(-)
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] s390: qeth driver statistics fixes

[patch 2/6] s390: qeth driver statistics fixes

From: Ursula Braun <braunu@de.ibm.com>
- display "unsigned int" values in /proc/qeth_perf with %u instead of %i
- omit qdio header length when increasing card->stats.tx_bytes

Signed-off-by: Frank Pavlic <fpavlic@de.ibm.com>
diffstat:
qeth_main.c | 3 ++-
qeth_proc.c | 38 +++++++++++++++++++-------------------
2 files changed, 21 insertions(+), 20 deletions(-)
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] AMD Au1xx0: fix Ethernet TX stats

With Au1xx0 Ethernet driver, TX bytes/packets always remain zero. The
problem seems to be that when packet has been transmitted, the length word
in DMA buffer is zero.

The patch updates the TX stats when a buffer is fed to DMA. The initial
2.4 patch was posted to linux-mips@linux-mips.org by Thomas Lange 21 Jan
2005.

Signed-off-by: Thomas Lange <thomas@corelatus.se>
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Jordan Crouse <jordan.crouse@amd.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] fix spidernet build issue

<unchangelogged>

Signed-off-by: Jens Osterkamp <Jens.Osterkamp@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] ahci: add softreset

Now that libata is smart enought to handle both soft and hard resets,
add softreset method.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] libata: do not ignore PIO-only devices

As libata now can do PIO, don't ignore PIO-only devices.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] libata: Symbol exports

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] Update libata DMA blacklist to cover versions, and resync with IDE layer

Not much to say here except that some drives have fixed and bad firmware

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] libata: Fix a drive detection problem

The current code follows the spec but uses an overlong delay. This would
be great if the hardware did. Several vendors however forget the D7
pulldown. Fortunately 0xFF isnt a sane reset state so we can use it to
skip detection as is done in drivers/ide. (ie this is a tested solution
over a long time)

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] libata: note missing posting in mmio cmd write

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

Merge branch 'master'

[TCP]: Do not use inet->id of global tcp_socket when sending RST.

The problem is in ip_push_pending_frames(), which uses:

        if (!df) {
                __ip_select_ident(iph, &rt->u.dst, 0);
        } else {
                iph->id = htons(inet->id++);
        }

instead of ip_select_ident().

Right now I think the code is a nonsense. Most likely, I copied it from
old ip_build_xmit(), where it was really special, we had to decide
whether to generate unique ID when generating the first (well, the last)
fragment.

In ip_push_pending_frames() it does not make sense, it should use plain
ip_select_ident() instead.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: Fix undefined references to get_h225_addr

get_h225_addr is exported, but declared static, which fails when
linking statically.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: futher {ip,ip6,arp}_tables unification

This patch moves {ip,ip6,arp}t_entry_{match,target} definitions to
x_tables.h. This move simplifies code and future compatibility fixes.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Acked-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: Fix xt_policy address matching

Fix missing inversion in address matching, it was broken during the
conversion to x_tables.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: nf_conntrack: support for layer 3 protocol load on demand

x_tables matches and targets that require nf_conntrack_ipv[4|6] to work
don't have enough information to load on demand these modules. This
patch introduces the following changes to solve this issue:

o nf_ct_l3proto_try_module_get: try to load the layer 3 connection
tracker module and increases the refcount.
o nf_ct_l3proto_module put: drop the refcount of the module.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: x_tables: set the protocol family in x_tables targets/matches

Set the family field in xt_[matches|targets] registered.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: conntrack: cleanup the conntrack ID initialization

Currently the first conntrack ID assigned is 2, use 1 instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: nfnetlink_queue: fix nfnetlink message size

Fix oversized message, use NLMSG_SPACE just one since it reserves space
for the netlink header and NFA_SPACE for every attribute.

Thanks to Harald Welte for the feedback

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: ctnetlink: Fix expectaction mask dumping

The expectation mask has some particularities that requires a different
handling. The protocol number fields can be set to non-valid protocols,
ie. l3num is set to 0xFFFF. Since that protocol does not exist, the mask
tuple will not be dumped. Moreover, this results in a kernel panic when
nf_conntrack accesses the array of protocol handlers, that is PF_MAX (0x1F)
long.

This patch introduces the function ctnetlink_exp_dump_mask, that correctly
dumps the expectation mask. Such function uses the l3num value from the
expectation tuple that is a valid layer 3 protocol number. The value of the
l3num mask isn't dumped since it is meaningless from the userspace side.

Thanks to Yasuyuki Kozakai and Patrick McHardy for the feedback.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: Fix Kconfig typos

Signed-off-by: Thomas Vögtle <tv@lio96.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[NETFILTER]: Fix ip6tables breakage from {get,set}sockopt compat layer

do_ipv6_getsockopt returns -EINVAL for unknown options, not
-ENOPROTOOPT as do_ipv6_setsockopt.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

[ARM] 3389/1: typo and grammar fix

Patch from Erik Hovland

I found a typo and what seems to be a run-on sentence in
arch/arm/common/dmabounce.c

This patch corrects both.

Signed-off-by: Erik Hovland <erik@hovland.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3386/1: AT91RM9200 Clock update

Patch from Andrew Victor

This patch includes a few changes to the clock support on the
AT91RM9200.

1. Added definitions for Ethernet, MMC, TWI, USARTs, and SPI peripheral
clocks.

2. Replaced some hard-coded hex values with the text definitions in
at91rm9200_sys.h.

3. If the USB96M bit is set for PLLB, then the rate of PLLB is not
affected but only the USB Host/Device clocks which are derived from it.
Issue reported by Sergei Sharonov.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3384/1: AT91RM9200: Timer

Patch from Andrew Victor

If the timer interrupt is ever significantly delayed (or after the
system was suspended), the system could spin incrementing the time for
too long.
The fix is to replace the "do {} while" with a "while {}".

Orignal patch by Savin Zlobec and Peter Menzebach.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3382/1: ixp2000: unify defconfigs

Patch from Lennert Buytenhek

Unify the five existing ixp2000 defconfigs into one defconfig.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3381/1: ixp2000: fix slowport write timing control register fields

Patch from Lennert Buytenhek

The original version of the chip docs had the PW and SU fields in
the slowport write timing control register accidentally reversed.
This is mentioned in the errata (documentation change #4) and fixed
in newer docs.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3380/1: ixp2000: simplify ixdp2x00_master_npu() check

Patch from Lennert Buytenhek

On the IXDP2x00s, the NPU that is PCI master is always the egress
(i.e. 'master') NPU. At least on the IXDP2800, both NPUs have flash,
so the ixp2000_has_flash() check in ixdp2x00_master_npu() is useless.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3379/1: ixp2000: use generic 8250 debug macros

Patch from Lennert Buytenhek

The xscale UART in the ixp2000 is basically just an 8250 UART (with
some extra bits and pieces), so we can use the generic 8250 debug
macros on the ixp2000.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

[ARM] 3378/1: ixp2000: fix gpio interrupt handling

Patch from Lennert Buytenhek

ixp2000 used to initially mark GPIO interrupts as invalid, and not
mark them valid until set_irq_type() was called, but this doesn't
work if you want to use request_irq() with the SA_TRIGGER_* flags.

So, just mark the GPIO interrupts valid from the beginning. We
configure GPIOs as inputs when set_irq_type() is called anyway, so
this shouldn't be a problem.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/perex/alsa

* git://git.kernel.org/pub/scm/linux/kernel/git/perex/alsa: (124 commits)
  [ALSA] version 1.0.11rc4
  [PATCH] Intruduce DMA_28BIT_MASK
  [ALSA] hda-codec - Add support for ASUS P4GPL-X
  [ALSA] hda-codec - Add support for HP nx9420 laptop
  [ALSA] Fix memory leaks in error path of control.c
  [ALSA] AMD Au1x00: AC'97 controller is memory mapped
  [ALSA] AMD Au1x00: fix DMA init/cleanup
  [ALSA] hda-codec - Fix generic auto-configurator
  [ALSA] hda-codec - Fix BIOS auto-configuration
  [ALSA] Fixes typos in Audiophile-USB.txt
  [ALSA] ice1712 - typo fixes for dxr_enable module option
  [ALSA] AMD Au1x00: make driver build after cleanup
  [ALSA] ice1712 - Fix wrong value types for enum items
  [ALSA] fix resource leak in usbmixer
  [ALSA] Fix gus_pcm dereference before NULL
  [ALSA] Fix seq_clientmgr dereferences before NULL check
  [ALSA] hda-codec - Fix for Samsung R65 and ASUS A6J
  [ALSA] hda-codec - Add support for VAIO FE550G and SZ110
  [ALSA] usb-audio: add Maya44 mixer control names
  [ALSA] usb-audio: add Casio PL-40R support
  ...

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial

* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  fixed path to moved file in include/linux/device.h
  Fix spelling in E1000_DISABLE_PACKET_SPLIT Kconfig description
  Documentation/dvb/get_dvb_firmware: fix firmware URL
  Documentation: Update to BUG-HUNTING
  Remove superfluous NOTIFY_COOKIE_LEN define
  add "tags" to .gitignore
  Fix "frist", "fisrt", typos
  fix rwlock usage example
  It's UTF-8

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[SPARC64]: Add a secondary TSB for hugepage mappings.
[SPARC]: Respect vm_page_prot in io_remap_page_range().

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [TG3]: Bump driver version and reldate.
  [TG3]: Skip phy power down on some devices
  [TG3]: Fix SRAM access during tg3_init_one()
  [X25]: dte facilities 32 64 ioctl conversion
  [X25]: allow ITU-T DTE facilities for x25
  [X25]: fix kernel error message 64 bit kernel
  [X25]: ioctl conversion 32 bit user to 64 bit kernel
  [NET]: socket timestamp 32 bit handler for 64 bit kernel
  [NET]: allow 32 bit socket ioctl in 64 bit kernel
  [BLUETOOTH]: Return negative error constant

Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits)
  [SCSI] libata: implement minimal transport template for ->eh_timed_out
  [SCSI] eliminate rphy allocation in favour of expander/end device allocation
  [SCSI] convert mptsas over to end_device/expander allocations
  [SCSI] allow displaying and setting of cache type via sysfs
  [SCSI] add scsi_mode_select to scsi_lib.c
  [SCSI] 3ware 9000 add big endian support
  [SCSI] qla2xxx: update MAINTAINERS
  [SCSI] scsi: move target_destroy call
  [SCSI] fusion - bump version
  [SCSI] fusion - expander hotplug suport in mptsas module
  [SCSI] fusion - exposing raid components in mptsas
  [SCSI] fusion - memory leak, and initializing fields
  [SCSI] fusion - exclosure misspelled
  [SCSI] fusion - cleanup mptsas event handling functions
  [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure
  [SCSI] fusion - static fix's
  [SCSI] fusion - move some debug firmware event debug msgs to verbose level
  [SCSI] fusion - loginfo header update
  [SCSI] add scsi_reprobe_device
  [SCSI] megaraid_sas: fix extended timeout handling
  ...

[PATCH] SELinux: add slab cache for inode security struct

Add a slab cache for the SELinux inode security struct, one of which is
allocated for every inode instantiated by the system.

The memory savings are considerable.

On 64-bit, instead of the size-128 cache, we have a slab object of 96
bytes, saving 32 bytes per object.  After booting, I see about 4000 of
these and then about 17,000 after a kernel compile.  With this patch, we
save around 530KB of kernel memory in the latter case.  On 32-bit, the
savings are about half of this.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] SELinux: cleanup stray variable in selinux_inode_init_security()

Remove an unneded pointer variable in selinux_inode_init_security().

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] SELinux: fix hard link count for selinuxfs root directory

A further fix is needed for selinuxfs link count management, to ensure that
the count is correct for the parent directory when a subdirectory is
created. This is only required for the root directory currently, but the
code has been updated for the general case.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinuxfs cleanups: sel_make_avc_files

Fix copy & paste error in sel_make_avc_files(), removing a supurious call to
d_genocide() in the error path. All of this will be cleaned up by
kill_litter_super().

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinuxfs cleanups: sel_make_bools

Remove the call to sel_make_bools() from sel_fill_super(), as policy needs to
be loaded before the boolean files can be created. Policy will never be
loaded during sel_fill_super() as selinuxfs is kernel mounted during init and
the only means to load policy is via selinuxfs.

Also, the call to d_genocide() on the error path of sel_make_bools() is
incorrect and replaced with sel_remove_bools().

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinuxfs cleanups: sel_fill_super exit path

Unify the error path of sel_fill_super() so that all errors pass through the
same point and generate an error message. Also, removes a spurious dput() in
the error path which breaks the refcounting for the filesystem
(litter_kill_super() will correctly clean things up itself on error).

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinuxfs cleanups: use sel_make_dir()

Use existing sel_make_dir() helper to create booleans directory rather than
duplicating the logic.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinuxfs cleanups: fix hard link count

Fix the hard link count for selinuxfs directories, which are currently one
short.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinux: simplify sel_read_bool

Simplify sel_read_bool to use the simple_read_from_buffer helper, like the
other selinuxfs functions.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] sem2mutex: security/

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Cc: James Morris <jmorris@namei.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] selinux: Disable automatic labeling of new inodes when no policy is loaded

This patch disables the automatic labeling of new inodes on disk
when no policy is loaded.

Discussion is here:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180296

In short, we're changing the behavior so that when no policy is loaded,
SELinux does not label files at all.  Currently it does add an 'unlabeled'
label in this case, which we've found causes problems later.

SELinux always maintains a safe internal label if there is none, so with this
patch, we just stick with that and wait until a policy is loaded before adding
a persistent label on disk.

The effect is simply that if you boot with SELinux enabled but no policy
loaded and create a file in that state, SELinux won't try to set a security
extended attribute on the new inode on the disk.  This is the only sane
behavior for SELinux in that state, as it cannot determine the right label to
assign in the absence of a policy.  That state usually doesn't occur, but the
rawhide installer seemed to be misbehaving temporarily so it happened to show
up on a test install.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] page migration reorg

Centralize the page migration functions in anticipation of additional
tinkering. Creates a new file mm/migrate.c

1. Extract buffer_migrate_page() from fs/buffer.c

2. Extract central migration code from vmscan.c

3. Extract some components from mempolicy.c

4. Export pageout() and remove_from_swap() from vmscan.c

5. Make it possible to configure NUMA systems without page migration
and non-NUMA systems with page migration.

I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] mm: slab cache interleave rotor fix

The alien cache rotor in mm/slab.c assumes that the first online node is
node 0. Eventually for some archs, especially with hotplug, this will no
longer be true.

Fix the interleave rotor to handle the general case of node numbering.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] mm: hugetlb alloc_fresh_huge_page bogus node loop fix

Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
assuming that nodes are numbered contiguously from 0 to num_online_nodes().
Once the hotplug folks get this far, that will be false.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] fix swap cluster offset

When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the
first index of swap cluster. But current code probably sets it wrong offset.

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] drain_node_pages: interrupt latency reduction / optimization

1. Only disable interrupts if there is actually something to free

2. Only dirty the pcp cacheline if we actually freed something.

3. Disable interrupts for each single pcp and not for cleaning
all the pcps in all zones of a node.

drain_node_pages is called every 2 seconds from cache_reap. This
fix should avoid most disabling of interrupts.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] slab: fix drain_array() so that it works correctly with the shared_array

The list_lock also protects the shared array and we call drain_array() with
the shared array. Therefore we cannot go as far as I wanted to but have to
take the lock in a way so that it also protects the array_cache in
drain_pages.

(Note: maybe we should make the array_cache locking more consistent? I.e.
always take the array cache lock for shared arrays and disable interrupts
for the per cpu arrays?)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] slab: remove drain_array_locked

Remove drain_array_locked and use that opportunity to limit the time the l3
lock is taken further.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] slab: make drain_array more universal by adding more parameters

And a parameter to drain_array to control the freeing of all objects and
then use drain_array() to replace instances of drain_array_locked with
drain_array. Doing so will avoid taking locks in those locations if the
arrays are empty.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] slab: cache_reap(): further reduction in interrupt holdoff

cache_reap takes the l3->list_lock (disabling interrupts) unconditionally
and then does a few checks and maybe does some cleanup.  This patch makes
cache_reap() only take the lock if there is work to do and then the lock is
taken and released for each cleaning action.

The checking of when to do the next reaping is done without any locking and
becomes racy.  Should not matter since reaping can also be skipped if the
slab mutex cannot be acquired.

The same is true for the touched processing.  If we get this wrong once in
awhile then we will mistakenly clean or not clean the shared cache.  This
will impact performance slightly.

Note that the additional drain_array() function introduced here will fall
out in a subsequent patch since array cleaning will now be very similar
from all callers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] mm: make shrink_all_memory try harder

Make shrink_all_memory() repeat the attempts to free more memory if there
seems to be no pages to free.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] optimize follow_hugetlb_page

follow_hugetlb_page() walks a range of user virtual address and then fills
in list of struct page * into an array that is passed from the argument
list.  It also gets a reference count via get_page().  For compound page,
get_page() actually traverse back to head page via page_private() macro and
then adds a reference count to the head page.  Since we are doing a virt to
pte look up, kernel already has a struct page pointer into the head page.
So instead of traverse into the small unit page struct and then follow a
link back to the head page, optimize that with incrementing the reference
count directly on the head page.

The benefit is that we don't take a cache miss on accessing page struct for
the corresponding user address and more importantly, not to pollute the
cache with a "not very useful" round trip of pointer chasing.  This adds a
moderate performance gain on an I/O intensive database transaction
workload.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] convert hugetlbfs_counter to atomic

Implementation of hugetlbfs_counter() is functionally equivalent to
atomic_inc_return(). Use the simpler atomic form.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: is_aligned_hugepage_range() cleanup

Quite a long time back, prepare_hugepage_range() replaced
is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
verify if an address range is suitable for a hugepage mapping.
is_aligned_hugepage_range() stuck around, but only to implement
prepare_hugepage_range() on archs which didn't implement their own.

Most archs (everything except ia64 and powerpc) used the same
implementation of is_aligned_hugepage_range().  On powerpc, which
implements its own prepare_hugepage_range(), the custom version was never
used.

In addition, "is_aligned_hugepage_range()" was a bad name, because it
suggests it returns true iff the given range is a good hugepage range,
whereas in fact it returns 0-or-error (so the sense is reversed).

This patch cleans up by abolishing is_aligned_hugepage_range().  Instead
prepare_hugepage_range() is defined directly.  Most archs use the default
version, which simply checks the given region is aligned to the size of a
hugepage.  ia64 and powerpc define custom versions.  The ia64 one simply
checks that the range is in the correct address space region in addition to
being suitably aligned.  The powerpc version (just as previously) checks
for suitable addresses, and if necessary performs low-level MMU frobbing to
set up new areas for use by hugepages.

No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: Move hugetlb_free_pgd_range() prototype to hugetlb.h

The optional hugepage callback, hugetlb_free_pgd_range() is presently
implemented non-trivially only on ia64 (but I plan to add one for powerpc
shortly). It has its own prototype for the function in asm-ia64/pgtable.h.
However, since the function is called from generic code, it make sense for
its prototype to be in the generic hugetlb.h header file, as the protypes
other arch callbacks already are (prepare_hugepage_range(),
set_huge_pte_at(), etc.). This patch makes it so.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: Fix hugepage logic in free_pgtables() harder

Turns out the hugepage logic in free_pgtables() was doubly broken.  The
loop coalescing multiple normal page VMAs into one call to free_pgd_range()
had an off by one error, which could mean it would coalesce one hugepage
VMA into the same bundle (checking 'vma' not 'next' in the loop).  I
transferred this bug into the new is_vm_hugetlb_page() based version.
Here's the fix.

This one didn't bite on powerpc previously for the same reason the
is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range()
is identical to free_pgd_range().  It didn't bite on ia64 because the
hugepage region is distant enough from any other region that the separated
PMD_SIZE distance test would always prevent coalescing the two together.

No libhugetlbfs testsuite regressions (ppc64, POWER5).

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: Fix hugepage logic in free_pgtables()

free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
of the normal free_pgd_range() on hugepage VMAs.  However, the test it uses
to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
range at the start of the vma.  is_hugepage_only_range() will return true
if the given range has any intersection with a hugepage address region, and
in this case the given region need not be hugepage aligned.  So, for
example, this test can return true if called on, say, a 4k VMA immediately
preceding a (nicely aligned) hugepage VMA.

At present we get away with this because the powerpc version of
hugetlb_free_pgd_range() is just a call to free_pgd_range().  On ia64 (the
only other arch with a non-trivial is_hugepage_only_range()) we get away
with it for a different reason; the hugepage area is not contiguous with
the rest of the user address space, and VMAs are not permitted in between,
so the test can't return a false positive there.

Nonetheless this should be fixed.  We do that in the patch below by
replacing the is_hugepage_only_range() test with an explicit test of the
VMA using is_vm_hugetlb_page().

This in turn changes behaviour for platforms where is_hugepage_only_range()
returns false always (everything except powerpc and ia64).  We address this
by ensuring that hugetlb_free_pgd_range() is defined to be identical to
free_pgd_range() (instead of a no-op) on everything except ia64.  Even so,
it will prevent some otherwise possible coalescing of calls down to
free_pgd_range().  Since this only happens for hugepage VMAs, removing this
small optimization seems unlikely to cause any trouble.

This patch causes no regressions on the libhugetlbfs testsuite - ppc64
POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: Make {alloc,free}_huge_page() local

Originally, mm/hugetlb.c just handled the hugepage physical allocation path
and its {alloc,free}_huge_page() functions were used from the arch specific
hugepage code.  These days those functions are only used with mm/hugetlb.c
itself.  Therefore, this patch makes them static and removes their
prototypes from hugetlb.h.  This requires a small rearrangement of code in
mm/hugetlb.c to avoid a forward declaration.

This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
POWER5).

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: Strict page reservation for hugepage inodes

These days, hugepages are demand-allocated at first fault time.  There's a
somewhat dubious (and racy) heuristic when making a new mmap() to check if
there are enough available hugepages to fully satisfy that mapping.

A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the pages in between allocations.  In this case the size of
each block is compared against the total number of available hugepages.
It's thus easy for the process to become overcommitted, because each block
mapping will succeed, although the total number of hugepages required by
all blocks exceeds the number available.  In particular, this defeats such
a program which will detect a mapping failure and adjust its hugepage usage
downward accordingly.

The patch below addresses this problem, by strictly reserving a number of
physical hugepages for hugepage inodes which have been mapped, but not
instatiated.  MAP_SHARED mappings are thus "safe" - they will fail on
mmap(), not later with an OOM SIGKILL.  MAP_PRIVATE mappings can still
trigger an OOM.  (Actually SHARED mappings can technically still OOM, but
only if the sysadmin explicitly reduces the hugepage pool between mapping
and instantiation)

This patch appears to address the problem at hand - it allows DB2 to start
correctly, for instance, which previously suffered the failure described
above.

This patch causes no regressions on the libhugetblfs testsuite, and makes a
test (designed to catch this problem) pass which previously failed (ppc64,
POWER5).

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: serialize hugepage allocation and instantiation

Currently, no lock or mutex is held between allocating a hugepage and
inserting it into the pagetables / page cache.  When we do go to insert the
page into pagetables or page cache, we recheck and may free the newly
allocated hugepage.  However, since the number of hugepages in the system
is strictly limited, and it's usualy to want to use all of them, this can
still lead to spurious allocation failures.

For example, suppose two processes are both mapping (MAP_SHARED) the same
hugepage file, large enough to consume the entire available hugepage pool.
If they race instantiating the last page in the mapping, they will both
attempt to allocate the last available hugepage.  One will fail, of course,
returning OOM from the fault and thus causing the process to be killed,
despite the fact that the entire mapping can, in fact, be instantiated.

The patch fixes this race by the simple method of adding a (sleeping) mutex
to serialize the hugepage fault path between allocation and insertion into
pagetables and/or page cache.  It would be possible to avoid the
serialization by catching the allocation failures, waiting on some
condition, then rechecking to see if someone else has instantiated the page
for us.  Given the likely frequency of hugepage instantiations, it seems
very doubtful it's worth the extra complexity.

This patch causes no regression on the libhugetlbfs testsuite, and one
test, which can trigger this race now passes where it previously failed.

Actually, the test still sometimes fails, though less often and only as a
shmat() failure, rather processes getting OOM killed by the VM.  The dodgy
heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
space aren't protected by the new mutex, and would be ugly to do so, so
there's still a race there.  Another patch to replace those tests with
something saner for this reason as well as others coming...

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] hugepage: Small fixes to hugepage clear/copy path

Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
own functions for clarity.  As we do so, we add some checks of need_resched
- we are, after all copying megabytes of memory here.  We also add
might_sleep() accordingly.  We generally dropped locks around the clear and
copy, already but not everyone has PREEMPT enabled, so we should still be
checking explicitly.

For this to work, we need to remove the clear_huge_page() from
alloc_huge_page(), which is called with the page_table_lock held in the COW
path.  We move the clear_huge_page() to just after the alloc_huge_page() in
the hugepage no-page path.  In the COW path, the new page is about to be
copied over, so clearing it was just a waste of time anyway.  So as a side
effect we also fix the fact that we held the page_table_lock for far too
long in this path by calling alloc_huge_page() under it.

It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] Enable mprotect on huge pages

2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
mprotect.

From: David Gibson <david@gibson.dropbear.id.au>

  Remove a test from the mprotect() path which checks that the mprotect()ed
  range on a hugepage VMA is hugepage aligned (yes, really, the sense of
  is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

  In fact, we don't need this test.  If the given addresses match the
  beginning/end of a hugepage VMA they must already be suitably aligned.  If
  they don't, then mprotect_fixup() will attempt to split the VMA.  The very
  first test in split_vma() will check for a badly aligned address on a
  hugepage VMA and return -EINVAL if necessary.

From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>

  On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE.  The
  identify of hugetlb pte is lost when changing page protection via mprotect.
  A page fault occurs later will trigger a bug check in huge_pte_alloc().

  The fix is to always make new pte a hugetlb pte and also to clean up
  legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] readahead: fix initial window size calculation

The current current get_init_ra_size is not optimal across different IO
sizes and max_readahead values.  Here is a quick summary of sizes computed
under current design and under the attached patch.  All of these assume 1st
IO at offset 0, or 1st detected sequential IO.

32k max, 4k request

old         new
-----------------
8k        8k
16k       16k
32k       32k

128k max, 4k request
old         new
-----------------
32k         16k
64k         32k
128k        64k
128k       128k

128k max, 32k request
old         new
-----------------
32k         64k    <-----
64k        128k
128k       128k

512k max, 4k request
old         new
-----------------
4k         32k     <----
16k        64k
64k       128k
128k      256k
512k      512k

Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] readahead: ->prev_page can overrun the ahead window

If get_next_ra_size() does not grow fast enough, ->prev_page can overrun
the ahead window. This means the caller will read the pages from
->ahead_start + ->ahead_size to ->prev_page synchronously.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>