Lockd and the NFSv4 server both exercise a race condition where
posix_test_lock() is called either before or after posix_lock_file() to
deal with a denied lock request due to a conflicting lock.
Remove the race condition for the NFSv4 server by adding a new conflicting
lock parameter to __posix_lock_file() , changing the name to
__posix_lock_file_conf().
Keep posix_lock_file() interface, add posix_lock_conf() interface, both
call __posix_lock_file_conf().
[akpm@osdl.org: Put the EXPORT_SYMBOL() where it belongs] Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Eric Dumazet [Sun, 26 Mar 2006 09:37:24 +0000 (01:37 -0800)]
[PATCH] Use __read_mostly on some hot fs variables
I discovered on oprofile hunting on a SMP platform that dentry lookups were
slowed down because d_hash_mask, d_hash_shift and dentry_hashtable were in
a cache line that contained inodes_stat. So each time inodes_stats is
changed by a cpu, other cpus have to refill their cache line.
This patch moves some variables to the __read_mostly section, in order to
avoid false sharing. RCU dentry lookups can go full speed.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Corey Minyard [Sun, 26 Mar 2006 09:37:22 +0000 (01:37 -0800)]
[PATCH] ipmi: Increment driver version to v39.0
Need to increment the version number because of the new PCI and sysfs
capabilities of the driver. People maintaining things for distros have
asked that I do this after interface or major functional changes.
Signed-off-by: Corey Minyard <minyard@acm.org> Acked-by: Matt Domsch <Matt_Domsch@dell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Corey Minyard [Sun, 26 Mar 2006 09:37:21 +0000 (01:37 -0800)]
[PATCH] ipmi: add full sysfs support
Add full driver model support for the IPMI driver. It links in the proper
bus and device support.
It adds an "ipmi" driver interface that has each BMC discovered by the
driver (as a device). These BMCs appear in the devices/platform directory.
If there are multiple interfaces to the same BMC, the driver should
discover this and will only have one BMC entry. The BMC entry will have
pointers to each interface device that connects to it.
The device information (statistics and config information) has not yet been
ported over to the driver model from proc, that will come later.
This work was based on work by Yani Ioannou. I basically rewrote it using
that code as a guide, but he still deserves credit :).
[bunk@stusta.de: make ipmi_find_bmc_guid() static] Signed-off-by: Corey Minyard <minyard@acm.org> Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Corey Minyard [Sun, 26 Mar 2006 09:37:20 +0000 (01:37 -0800)]
[PATCH] ipmi: add generic PCI handling
Modify the PCI hanling code for the IPMI driver to use the new method of
tables and registering, and adds more generic PCI handling for IPMI.
Unfortunately, this required a rather large rework of the way the driver
did detection so it would be more event-driven.
[bunk@stusta.de: make a struct static] Signed-off-by: Corey Minyard <minyard@acm.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NeilBrown [Sun, 26 Mar 2006 09:37:18 +0000 (01:37 -0800)]
[PATCH] Make address_space_operations->invalidatepage return void
The return value of this function is never used, so let's be honest and
declare it as void.
Some places where invalidatepage returned 0, I have inserted comments
suggesting a BUG_ON.
[akpm@osdl.org: JBD BUG fix]
[akpm@osdl.org: rework for git-nfs]
[akpm@osdl.org: don't go BUG in block_invalidate_page()] Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Sun, 26 Mar 2006 09:37:12 +0000 (01:37 -0800)]
[PATCH] sem2mutex: fs/
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Eric Van Hensbergen <ericvh@ericvh.myip.org> Cc: Robert Love <rml@tech9.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Bjorn Helgaas [Sun, 26 Mar 2006 09:37:08 +0000 (01:37 -0800)]
[PATCH] EFI: keep physical table addresses in efi structure
Almost all users of the table addresses from the EFI system table want
physical addresses. So rather than doing the pa->va->pa conversion, just keep
physical addresses in struct efi.
This fixes a DMI bug: the efi structure contained the physical SMBIOS address
on x86 but the virtual address on ia64, so dmi_scan_machine() used ioremap()
on a virtual address on ia64.
This is essentially the same as an earlier patch by Matt Tolentino:
http://marc.theaimsgroup.com/?l=linux-kernel&m=112130292316281&w=2
except that this changes all table addresses, not just ACPI addresses.
Matt's original patch was backed out because it caused MCAs on HP sx1000
systems. That problem is resolved by the ioremap() attribute checking added
for ia64.
Bjorn Helgaas [Sun, 26 Mar 2006 09:37:07 +0000 (01:37 -0800)]
[PATCH] DMI: only ioremap stuff we actually need
dmi_scan_machine() tries to ioremap 0x10000 (64K) bytes, even though it only
looks at the first 32 bytes or so. If the SMBIOS table is near the end of a
memory region, the ioremap() may fail when it shouldn't.
This is in the efi_enabled path, so it really only affects ia64 at the moment.
Bjorn Helgaas [Sun, 26 Mar 2006 09:37:06 +0000 (01:37 -0800)]
[PATCH] ia64: ioremap: check EFI for valid memory attributes
Check the EFI memory map so we can use the correct memory attributes for
ioremap(). Previously, we always used uncacheable access, which blows up on
some machines for regular system memory.
Pass the size, not a pointer to the size, to efi_mem_attribute_range().
This function validates memory regions for the /dev/mem read/write/mmap paths.
The pointer allows arches to reduce the size of the range, but I think that's
unnecessary complexity. Simplifying it will let me use
efi_mem_attribute_range() to improve the ia64 ioremap() implementation.
Matt Domsch [Sun, 26 Mar 2006 09:37:03 +0000 (01:37 -0800)]
[PATCH] ia64: use i386 dmi_scan.c
Enable DMI table parsing on ia64.
Andi Kleen has a patch in his x86_64 tree which enables the use of i386
dmi_scan.c on x86_64. dmi_scan.c functions are being used by the
drivers/char/ipmi/ipmi_si_intf.c driver for autodetecting the ports or
memory spaces where the IPMI controllers may be found.
This patch adds equivalent changes for ia64 as to what is in the x86_64
tree. In addition, I reworked the DMI detection, such that on EFI-capable
systems, it uses the efi.smbios pointer to find the table, rather than
brute-force searching from 0xF0000. On non-EFI systems, it continues the
brute-force search.
My test system, an Intel S870BN4 'Tiger4', aka Dell PowerEdge 7250, with
latest BIOS, does not list the IPMI controller in the ACPI namespace, nor
does it have an ACPI SPMI table. Also note, currently shipping Dell x8xx
EM64T servers don't have these either, so DMI is the only method for
obtaining the address of the IPMI controller.
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com> Acked-by: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vivek Goyal [Sun, 26 Mar 2006 09:37:02 +0000 (01:37 -0800)]
[PATCH] i386: export: memory more than 4G through /proc/iomem
Currently /proc/iomem exports physical memory also apart from io device
memory. But on i386, it truncates any memory more than 4GB. This leads to
problems for kexec/kdump.
Kexec reads /proc/iomem to determine the system memory layout and prepares a
memory map based on that and passes it to the kernel being kexeced. Given the
fact that memory more than 4GB has been truncated, new kernel never gets to
see and use that memory.
Kdump also reads /proc/iomem to determine the physical memory layout of the
system and encodes this informaiton in ELF headers. After a crash new kernel
parses these ELF headers being used by previous kernel and vmcore is prepared
accordingly. As memory more than 4GB has been truncated, kdump never sees
that memory and never prepares ELF headers for it. Hence vmcore is truncated
and limited to 4GB even if there is more physical memory in the system.
This patch exports memory more than 4GB through /proc/iomem on i386.
Jan Beulich [Sun, 26 Mar 2006 09:37:01 +0000 (01:37 -0800)]
[PATCH] i386: pass proper trap numbers to die chain handlers
Pass the trap number causing the call to notify_die() to the die
notification handler chain in a number of instances. Also, honor the
return value from the handler chain invocation in die() as, through a
debugger, the fault may have been fixed.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-By: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
H. Peter Anvin [Sun, 26 Mar 2006 09:36:59 +0000 (01:36 -0800)]
[PATCH] x86: "make isoimage" support; FDINITRD= support; minor cleanups
Add a "make isoimage" to i386 and x86-64, which allows the automatic
creation of a bootable CD image. It also adds an option FDINITRD= to
include an initrd of the user's choice in generated floppy- or CD boot
images. Finally, some minor cleanups of the image generation code.
Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Andi Kleen <ak@muc.de> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
James Bottomley [Sun, 26 Mar 2006 09:36:59 +0000 (01:36 -0800)]
[PATCH] Add flush_kernel_dcache_page() API
We have a problem in a lot of emulated storage in that it takes a page from
get_user_pages() and does something like
kmap_atomic(page)
modify page
kunmap_atomic(page)
However, nothing has flushed the kernel cache view of the page before the
kunmap. We need a lightweight API to do this, so this new API would
specifically be for flushing the kernel cache view of a user page which the
kernel has modified. The driver would need to add
flush_kernel_dcache_page(page) before the final kunmap.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
James Bottomley [Sun, 26 Mar 2006 09:36:57 +0000 (01:36 -0800)]
[PATCH] Add API for flushing Anon pages
Currently, get_user_pages() returns fully coherent pages to the kernel for
anything other than anonymous pages. This is a problem for things like
fuse and the SCSI generic ioctl SG_IO which can potentially wish to do DMA
to anonymous pages passed in by users.
The fix is to add a new memory management API: flush_anon_page() which
is used in get_user_pages() to make anonymous pages coherent.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Steven Rostedt [Sun, 26 Mar 2006 09:36:55 +0000 (01:36 -0800)]
[PATCH] protect remove_proc_entry
It has been discovered that the remove_proc_entry has a race in the removing
of entries in the proc file system that are siblings. There's no protection
around the traversing and removing of elements that belong in the same
subdirectory.
This subdirectory list is protected in other areas by the BKL. So the BKL was
at first used to protect this area too, but unfortunately, remove_proc_entry
may be called with spinlocks held. The BKL may schedule, so this was not a
solution.
The final solution was to add a new global spin lock to protect this list,
called proc_subdir_lock. This lock now protects the list in
remove_proc_entry, and I also went around looking for other areas that this
list is modified and added this protection there too. Care must be taken
since these locations call several functions that may also schedule.
Since I don't see any location that these functions that modify the
subdirectory list are called by interrupts, the irqsave/restore versions of
the spin lock was _not_ used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Linus Torvalds [Sun, 26 Mar 2006 04:29:54 +0000 (20:29 -0800)]
Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
[ARM] 3030/2: fix permission check in the obscur cmpxchg syscall
[ARM] nommu: rename compressed/head.S symbols to a new style
[ARM] select TLS_REG_EMUL and NEEDS_SYSCALL_FOR_CMPXCHG
[ARM] nommu: Move hardware page table definitions to pgtable-hwdef.h
[ARM] Move read of processor ID out of lookup_processor_type()
[ARM] Fix typo in tlbflush.h
[ARM] noMMU: removes TLB codes in nommu mode
[ARM] noMMU: block sys_fork in nommu mode
[ARM] 3399/1: Fix link problem when CONFIG_PRINTK is disabled
[ARM] 3398/1: Fix the VFP registers loading/storing base address
[ARM] 3397/1: AT91RM9200 Header update
[ARM] 3385/1: Battery support for sharp zaurus sl-5500 (collie)
[ARM] SMP: don't set cpu_*_map in smp_prepare_boot_cpu
include/linux/clk.h is betraying its ARM origins
[ARM] Move enable_irq and disable_irq to assembler.h
[ARM] 3391/1: use PLAT8250_DEV_PLATFORM{,1} for platform device id instead of 0/1
[ARM] 3383/3: ixp2000: ixdp2x01 platform serial conversion
Patch from Lennert Buytenhek
Add a PLAT8250_DEV_PLATFORM2, and convert the two ixdp2x01 CPLD serial
ports to use platform serial devices with ids PLAT8250_DEV_PLATFORM[12].
(The on-chip xscale UART is PLAT8250_DEV_PLATFORM, id #0.)
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Nicolas Pitre [Sat, 25 Mar 2006 22:44:05 +0000 (22:44 +0000)]
[ARM] 3030/2: fix permission check in the obscur cmpxchg syscall
Patch from Nicolas Pitre
Quoting RMK:
|pte_write() just says that the page _may_ be writable. It doesn't say
|that the MMU is programmed to allow writes. If pte_dirty() doesn't
|return true, that means that the page is _not_ writable from userspace.
|If you write to it from kernel mode (without using put_user) you'll
|bypass the MMU read-only protection and may end up writing to a page
|owned by two separate processes.
Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Malcolm Parsons [Sat, 25 Mar 2006 21:58:03 +0000 (21:58 +0000)]
[ARM] 3399/1: Fix link problem when CONFIG_PRINTK is disabled
Patch from Malcolm Parsons
Printking a backtrace requires printk, so disable backtrace code
when printk is disabled.
Without this patch, a kernel with CONFIG_PRINTK disabled does not link:
arch/arm/lib/lib.a(backtrace.o): In function `c_backtrace':
arch/arm/lib/backtrace.S:(.text+0x108): undefined reference to `printk'
arch/arm/lib/backtrace.S:(.text+0x11c): undefined reference to `printk'
arch/arm/lib/lib.a(backtrace.o):(.fixup+0x8): undefined reference to `printk'
Signed-off-by: Malcolm Parsons <malcolm.parsons@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Catalin Marinas [Sat, 25 Mar 2006 21:58:00 +0000 (21:58 +0000)]
[ARM] 3398/1: Fix the VFP registers loading/storing base address
Patch from Catalin Marinas
The current VFP code corrupts the VFP registers (including the control
ones) if more than one floating point application is executed at the same
time. This patch fixes the updating of the load/store base addresses for
the VFP registers.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Russell King [Sat, 25 Mar 2006 21:37:29 +0000 (21:37 +0000)]
[ARM] SMP: don't set cpu_*_map in smp_prepare_boot_cpu
The recent addition of boot_cpu_init() implements the initialisation
of the online, present and possible cpu maps for the boot CPU, so
there is no reason to duplicate this in the architecture
smp_prepare_boot_cpu() hook.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Linus Torvalds [Sat, 25 Mar 2006 17:24:53 +0000 (09:24 -0800)]
Merge branch 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits)
[PATCH] fix audit_init failure path
[PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
[PATCH] sem2mutex: audit_netlink_sem
[PATCH] simplify audit_free() locking
[PATCH] Fix audit operators
[PATCH] promiscuous mode
[PATCH] Add tty to syscall audit records
[PATCH] add/remove rule update
[PATCH] audit string fields interface + consumer
[PATCH] SE Linux audit events
[PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
[PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
[PATCH] Fix IA64 success/failure indication in syscall auditing.
[PATCH] Miscellaneous bug and warning fixes
[PATCH] Capture selinux subject/object context information.
[PATCH] Exclude messages by message type
[PATCH] Collect more inode information during syscall processing.
[PATCH] Pass dentry, not just name, in fsnotify creation hooks.
[PATCH] Define new range of userspace messages.
[PATCH] Filter rule comparators
...
Fixed trivial conflict in security/selinux/hooks.c
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/aoe-2.6:
[PATCH] aoe [3/3]: update version to 22
[PATCH] aoe [2/3]: don't request ATA device ID on ATA error
[PATCH] aoe [1/3]: support multiple AoE listeners
[PATCH] aoe: do not stop retransmit timer when device goes down
[PATCH] aoe [8/8]: update driver version number
[PATCH] aoe [7/8]: update driver compatibility string
[PATCH] aoe [6/8]: update device information on last close
[PATCH] aoe [5/8]: allow network interface migration on packet retransmit
[PATCH] aoe [4/8]: use less confusing driver name
[PATCH] aoe [3/8]: increase allowed outstanding packets
[PATCH] aoe [2/8]: support dynamic resizing of AoE devices
[PATCH] aoe [1/8]: zero packet data after skb allocation
Linus Torvalds [Sat, 25 Mar 2006 17:18:27 +0000 (09:18 -0800)]
Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (103 commits)
SUNRPC,RPCSEC_GSS: spkm3--fix config dependencies
SUNRPC,RPCSEC_GSS: spkm3: import contexts using NID_cast5_cbc
LOCKD: Make nlmsvc_traverse_shares return void
LOCKD: nlmsvc_traverse_blocks return is unused
SUNRPC,RPCSEC_GSS: fix krb5 sequence numbers.
NFSv4: Dont list system.nfs4_acl for filesystems that don't support it.
SUNRPC,RPCSEC_GSS: remove unnecessary kmalloc of a checksum
SUNRPC: Ensure rpc_call_async() always calls tk_ops->rpc_release()
SUNRPC: Fix memory barriers for req->rq_received
NFS: Fix a race in nfs_sync_inode()
NFS: Clean up nfs_flush_list()
NFS: Fix a race with PG_private and nfs_release_page()
NFSv4: Ensure the callback daemon flushes signals
SUNRPC: Fix a 'Busy inodes' error in rpc_pipefs
NFS, NLM: Allow blocking locks to respect signals
NFS: Make nfs_fhget() return appropriate error values
NFSv4: Fix an oops in nfs4_fill_super
lockd: blocks should hold a reference to the nlm_file
NFSv4: SETCLIENTID_CONFIRM should handle NFS4ERR_DELAY/NFS4ERR_RESOURCE
NFSv4: Send the delegation stateid for SETATTR calls
...
* master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart:
[AGPGART] x86_64: Enable VIA AGP driver on x86-64 for VIA P4 chipsets
[AGPGART] x86_64: Fix wrong PCI ID for ALI M1695 AGP bridge
[AGPGART] ATI RS350 support.
[AGPGART] Lots of CodingStyle/whitespace cleanups.
Andi Kleen [Sat, 25 Mar 2006 15:31:52 +0000 (16:31 +0100)]
[PATCH] x86_64: Initialize powernow_data[] for all siblings
I got an oops on a dual core system because the lost tick handler
called cpufreq_get() on core 1 and powernow tried to follow
a NULL powernow_data[] pointer there.
Initialize powernow_data for all cores of a CPU.
Cc: Jacob Shin <jacob.shin@amd.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Eric Dumazet [Sat, 25 Mar 2006 15:31:46 +0000 (16:31 +0100)]
[PATCH] x86_64: group memnodemap and memnodeshift in a memnode structure
pfn_to_page() and others need to access both memnode_shift and the very
first bytes of memnodemap[]. If we force memnode_shift to be just before the
memnodemap array, we can reduce the memory footprint to one cache line
instead of two for most setups. This patch introduce a 'memnode' structure
where shift and map[] are carefully placed.
Navin Boppuri [Sat, 25 Mar 2006 15:31:40 +0000 (16:31 +0100)]
[PATCH] x86_64: Search K8 devices on more devices.
arch/x86_64/kernel/aperture.c: The search for the AGP bridge has been
extended to search for all the 256 buses instead of the first 32. This
is required since on a some systems, the bridge may be located on a bus
much farther than the first 32. By searching all 256 buses, we guarantee
that the search succeeds on such systems.
arch/x86_64/kernel/pci-gart.c: The search for the Northbridge is not
limited to just bus 0 anymore. This is required because on certain
systems, we may not find one on bus 0.
Jon Mason [Sat, 25 Mar 2006 15:31:34 +0000 (16:31 +0100)]
[PATCH] x86_64: Make GART_IOMMU kconfig help text more specific (trivial)
Have the GART_IOMMU help text specify that this is the hardware IOMMU in
amd64 processors. This will be significant if/when other IOMMUs are
added to the x86-64 architecture. :-)
Also, note that the previous help text stated that IOMMU was needed for
>3GB memory instead of >4GB. This is fixed in the newer version.
Andi Kleen [Sat, 25 Mar 2006 15:31:31 +0000 (16:31 +0100)]
[PATCH] x86_64: Remove CONFIG_UNORDERED_IO
It was a failed experiment - all benchmarks done with it on both AMD
and Intel showed it was a loss. That was probably because the store
buffers of the CPUs for write combining traffic weren't large enough.
Jon Mason [Sat, 25 Mar 2006 15:31:19 +0000 (16:31 +0100)]
[PATCH] x86_64: free_bootmem_node needs __pa in allocate_aperture
free_bootmem_node expects a physical address to be passed in, but
__alloc_bootmem_node returns a virtual one. That address needs to be
translated to physical.
Vivek Goyal [Sat, 25 Mar 2006 15:31:16 +0000 (16:31 +0100)]
[PATCH] x86_64: timer interrupt lockup due to pending interrupt
o check_timer() routine fails while second kernel is booting after a crash
on an opetron box. Problem happens because timer vector (0x31) seems to be
locked.
o After a system crash, it is not safe to service interrupts any more, hence
interrupts are disabled. This leads to pending interrupts at LAPIC. LAPIC
sends these interrupts to the CPU during early boot of second kernel. Other
pending interrupts are discarded saying unexpected trap but timer interrupt
is serviced and CPU does not issue an LAPIC EOI because it think this
interrupt came from i8259 and sends ack to 8259. This leads to vector 0x31
locking as LAPIC does not clear respective ISR and keeps on waiting for
EOI.
o This patch issues extra EOI for the pending interrupts who have ISR set.
o Though today only timer seems to be the special case because in early
boot it thinks interrupts are coming from i8259 and uses
mask_and_ack_8259A() as ack handler and does not issue LAPIC EOI. But
probably doing it in generic manner for all vectors makes sense.
Andi Kleen [Sat, 25 Mar 2006 15:31:10 +0000 (16:31 +0100)]
[PATCH] x86_64: Try to allocate node memmap near the end of node
This fixes problems with very large nodes (over 128GB) filling up all of
the first 4GB with their mem_map and not leaving enough space for the
swiotlb.
Andi Kleen [Sat, 25 Mar 2006 15:31:04 +0000 (16:31 +0100)]
[PATCH] x86_64: Change default setting for noexec32 to match i386 kernel
This means i386 processes compiled with a recent compiler will get non
executable heap by default now. This is the same default as a 32bit PAE
kernel would use on a NX enabled CPU.
Chuck Ebbert [Sat, 25 Mar 2006 15:30:55 +0000 (16:30 +0100)]
[PATCH] x86_64: fix orphaned bits of timer init messages
When x86_64 timer init messages were changed to use apic verbosity
levels, two messages were missed and one got the wrong level. This
causes the last word of a suppressed message to print on a line by
itself. Fix that so either the entire message prints or none of it
does.
Arjan van de Ven [Sat, 25 Mar 2006 15:30:49 +0000 (16:30 +0100)]
[PATCH] x86_64: Basic reorder infrastructure
This patch puts the infrastructure in place to allow for a reordering of
functions based inside the vmlinux. The general idea is that it is possible
to put all "common" functions into the first 2Mb of the code, so that they
are covered by one TLB entry. This as opposed to the current situation where
a typical vmlinux covers about 3.5Mb (on x86-64) and thus 2 TLB entries.
This is done by enabling the -ffunction-sections flag in gcc, which puts
each function in its own ELF section, so that the linker can then order them
in a way defined by the linker script.
As per previous discussions, Linus said he wanted a "static" list for this,
eg a list provided by the kernel tarbal, so that most people have the same
ordering at least. A script is provided to create this list based on
readprofile(1) output. The included list is provisional, and entirely biased
on my own testbox and me running a few kernel compiles and some other
things.
I think that to get to a better list we need to invite people to submit
their own profiles, and somehow add those all up and base the final list on
that. I'm willing to do that effort if this is ends up being the prefered
approach. Such an effort probably needs to be repeated like once a year or
so to adopt to the changing nature of the kernel.
Made it a CONFIG with default n because it increases link times
dramatically.
Andi Kleen [Sat, 25 Mar 2006 15:30:31 +0000 (16:30 +0100)]
[PATCH] x86_64: Handle years beyond 2100
ACPIv2 has an official but optional way to get a date >2100. Use it.
But all the platforms I tested didn't seem to support it. But anyways
the x86-64 kernel should be ready for the 22nd century now. Actually i
shouldn't care about this because I will be dead by then @)
Arjan van de Ven [Sat, 25 Mar 2006 15:30:28 +0000 (16:30 +0100)]
[PATCH] x86_64: Patch to make the head.S-must-be-first-in-vmlinux order explicit
This patch puts the code from head.S in a special .bootstrap.text
section.
I'm working on a patch to reorder the functions in the kernel (I'll post
that later), but for x86-64 at least the kernel bootstrap requires that
the head.S functions are on the very first page/pages of the kernel
text. This is understandable since the bootstrap is complex enough
already and not a problem at all, it just means they aren't allowed to
be reordered. This patch puts these special functions into a separate
section to document this, and to guarantee this in the light of possibly
reordering the rest later.
(So this patch doesn't fix a bug per se, but makes things more robust by
making the order of these functions explicit)
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andi Kleen [Sat, 25 Mar 2006 15:30:22 +0000 (16:30 +0100)]
[PATCH] x86_64: Implement early DMI scanning
There are more and more cases where we need to know DMI information
early to work around bugs. i386 already had early DMI scanning, but
x86-64 didn't. Implement this now.
Andi Kleen [Sat, 25 Mar 2006 15:30:19 +0000 (16:30 +0100)]
[PATCH] x86_64: Clean up and tweak ACPI blacklist year code
- Move the core parser into dmi_scan.c. It can be useful for other
subsystems too.
- Differentiate between field doesn't exist and field is 0 or
unparseable. The first case is likely an old BIOS with broken ACPI,
the later is likely a slightly buggy BIOS where someone forget to
edit the date. Don't blacklist in the later case.
Arjan van de Ven [Sat, 25 Mar 2006 15:30:10 +0000 (16:30 +0100)]
[PATCH] x86_64: prefetch the mmap_sem in the fault path
In a micro-benchmark that stresses the pagefault path, the down_read_trylock
on the mmap_sem showed up quite high on the profile. Turns out this lock is
bouncing between cpus quite a bit and thus is cache-cold a lot. This patch
prefetches the lock (for write) as early as possible (and before some other
somewhat expensive operations). With this patch, the down_read_trylock
basically fell out of the top of profile.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] x86_64: to use lapic ids instead of initial apic ids
phys_proc_id[] on AMD boxes is right now populated with the initial
apic id, obtained by the cpuid instruction. But, the initial apic id
need not be the local apic id on clustered APIC systems (see comment at
x86_64/kernel/genapic_cluster.c, line 110). On vSMPowered with AMD
CPUs the cpu_to_node will turn out to be incorrect (as apicid_to_node[] is
indexed by the initial apic id rather than the local apic id).
On vSMPowered boxes with Intel CPUs this is working correctly as
phys_proc_id[] is initialized correctly in detect_ht().
This fixes AMD boot path according to specification, to use the correct
routines for local apic id and socket ids. We use
hard_smp_processor_id() to read the local apic id, and phys_pkg_id() to
determine socket id for phys_proc_id[]
Patch tested on Tyan multicore boxes as well as vSMPowered boxes.
Jan Beulich [Sat, 25 Mar 2006 15:30:01 +0000 (16:30 +0100)]
[PATCH] x86_64: miscellaneous cleanup
- adjust limits of GDT/IDT pseudo-descriptors (some were off by one)
- move empty_zero_page into .bss.page_aligned
- move cpu_gdt_table into .data.page_aligned
- move idt_table into .bss
- align inital_code and init_rsp
- eliminate pointless (re-)declaration of idt_table in traps.c
Roberto Nibali [Sat, 25 Mar 2006 15:29:55 +0000 (16:29 +0100)]
[PATCH] x86_64: Clean up white space in traps.c
Attached is a small code style cleanup patch that resulted from my
skimming through the arch/x86_64/kernel/traps.c code to figure out what
went haywire.
Andi Kleen [Sat, 25 Mar 2006 15:29:49 +0000 (16:29 +0100)]
[PATCH] x86_64: Don't define string functions to builtin
gcc should handle this anyways, and it causes problems when
sprintf is turned into strcpy by gcc behind our backs and
the C fallback version of strcpy is actually defining __builtin_strcpy
Then drop -ffreestanding from the main Makefile because it isn't
needed anymore and implies -fno-builtin, which is wrong now.
(it was only added for x86-64, so dropping it should be safe)
Andi Kleen [Sat, 25 Mar 2006 15:29:46 +0000 (16:29 +0100)]
[PATCH] x86_64: Check that early arguments are words on their own
We've always had the problem that arguments only did a prefix match,
which resulted e.g. in noapic and noapictimer getting confused.
Fix the early argument parsing code to always check that arguments are
whole words (except for those that take additional arguments of course)
I factored out the checking code for that while also makes the code
easier to maintain.
Jan Beulich [Sat, 25 Mar 2006 15:29:40 +0000 (16:29 +0100)]
[PATCH] x86_64: actively synchronize vmalloc area when registering certain callbacks
While the modular aspect of the respective i386 patch doesn't apply to
x86-64 (as the top level page directory entry is shared between modules
and the base kernel), handlers registered with register_die_notifier()
are still under similar constraints for touching ioremap()ed or
vmalloc()ed memory. The likelihood of this problem becoming visible is
of course significantly lower, as the assigned virtual addresses would
have to cross a 2**39 byte boundary. This is because the callback gets
invoked
(a) in the page fault path before the top level page table propagation
gets carried out (hence a fault to propagate the top level page table
entry/entries mapping to module's code/data would nest infinitly) and
(b) in the NMI path, where nested faults must absolutely not happen,
since otherwise the IRET from the nested fault re-enables NMIs,
potentially resulting in nested NMI occurences.
Andi Kleen [Sat, 25 Mar 2006 15:29:31 +0000 (16:29 +0100)]
[PATCH] x86_64: Don't need to read PIT in timer handler when PM timer is used
The PM timer path through main_timer_handler doesn't need
the delay variable because it figures it out in a different way.
Don't try to read it from the PIT. With stopped PIT timer
it is even useless.