Paul Mundt [Thu, 14 Oct 2010 17:09:00 +0000 (02:09 +0900)]
sh: Provide a generic SRAM pool for tiny memories.
This sets up a generic SRAM pool for CPUs and platform code to insert
their otherwise unused memories into. A simple alloc/free interface is
provided (lifed from avr32) for generic code.
This only applies to tiny SRAMs that are otherwise unmanaged, and does
not take in to account the more complex SRAMs sitting behind transfer
engines, or that employ an I/D split.
Paul Mundt [Wed, 13 Oct 2010 23:44:55 +0000 (08:44 +0900)]
sh: pci: Support secondary FPGA-driven PCIe clocks on SDK7786.
The SDK7786 FPGA has secondary control over the PCIe clocks, specifically
relating to the slots and oscillator. This ties the FPGA clocks in to the
clock framework and balances the refcounting similar to how the primary
on-chip clocks are managed. While the on-chip clocks are per-port, the
FPGA clock enable/disable is global for the entire block.
Paul Mundt [Wed, 13 Oct 2010 22:37:01 +0000 (07:37 +0900)]
sh: pci: Support slot 4 routing on SDK7786.
SDK7786 supports connecting either slot3 or 4 to the same PCIe port by
way of FPGA muxing. By default the vertical slot 3 on the baseboard is
enabled, so this adds in a command line option for forcibly enabling the
slot 4 edge connector.
If nothing has been specified on the command line, we fall back to
reading the resistor values for card presence to figure out where to
route the port to.
Paul Mundt [Wed, 13 Oct 2010 18:04:44 +0000 (03:04 +0900)]
sh: mach-sdk7786: Add support for fpga gpios.
The sdk7786 FPGA supports a number of user settable input switches that
are otherwise unused. This wires up a dummy gpio chip for the switch bank
to simply expose them to userspace.
Paul Mundt [Wed, 13 Oct 2010 08:52:14 +0000 (17:52 +0900)]
sh: perf: Set up perf_max_events.
Presently this is uninitialized in the architecture code, so it's
artificlally capped to the default initialization value. Set it up at
registration time.
Paul Mundt [Tue, 12 Oct 2010 22:17:03 +0000 (07:17 +0900)]
sh: perf: Support SH-X3 hardware counters.
The PMCAT location has conveniently moved on newer SH-X3 parts, special
case this for now with a note. This will probably want to be redone in a
less visually offensive way when/if more information becomes available.
Paul Mundt [Wed, 6 Oct 2010 17:57:39 +0000 (02:57 +0900)]
sh: Fix up the SH-3 build.
SH-3 lacks an MMUCR_TI definition for global TLB flushes. As SH-3 parts
lack a split TLB, the same global flush behaviour is accomplished
through the flush bit, which just happens to be the same as on SH-4.
Akinobu Mita [Tue, 5 Oct 2010 15:54:00 +0000 (15:54 +0000)]
sh: fix uninitialized spinlock
The spinlock in traps_64.c is used without initialization.
This fixes it by declaring DEFINE_SPINLOCK() and makes the spinlock static
variable.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: linux-sh@vger.kernel.org Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Tue, 5 Oct 2010 14:26:42 +0000 (23:26 +0900)]
sh: Wire up INTC subgroup splitting for SH7786 SCIF1.
SH7786 is the big user for subgroup splitting, mostly for the PCIe block,
but those will follow later. For now we simply split up SCIF1, as used by
the serial console on SDK7786 and others.
Paul Mundt [Tue, 5 Oct 2010 13:10:30 +0000 (22:10 +0900)]
sh: intc: Split up the INTC code.
This splits up the sh intc core in to something more vaguely resembling
a subsystem. Most of the functionality was alread fairly well
compartmentalized, and there were only a handful of interdependencies
that needed to be resolved in the process.
This also serves as future-proofing for the genirq and sparseirq rework,
which will make some of the split out functionality wholly generic,
allowing things to be killed off in place with minimal migration pain.
Paul Mundt [Tue, 5 Oct 2010 09:13:23 +0000 (18:13 +0900)]
sh: intc: Handle early lookups of subgroup IRQs.
If lookups happen while the radix node still points to a subgroup
mapping, an IRQ hasn't yet been made available for the specified id, so
error out accordingly. Once the slot is replaced with an IRQ mapping and
the tag is discarded, lookup can commence as normal.
Paul Mundt [Mon, 4 Oct 2010 19:47:03 +0000 (04:47 +0900)]
sh: intc: Support virtual mappings for IRQ subgroups.
Many interrupts that share a single mask source but are on different
hardware vectors will have an associated register tied to an INTEVT that
denotes the precise cause for the interrupt exception being triggered.
This introduces the concept of IRQ subgroups in the intc core, where
a virtual IRQ map is constructed for each of the pre-defined cause bits,
and a higher level chained handler takes control of the parent INTEVT.
This enables CPUs with heavily muxed IRQ vectors (especially across
disjoint blocks) to break things out in to a series of managed chained
handlers while being able to dynamically lookup and adopt the IRQs
created for them.
This is largely an opt-in interface, requiring CPUs to manually submit
IRQs for subgroup splitting, in addition to providing identifiers in
their enum maps that can be used for lazy lookup via the radix tree.
Paul Mundt [Thu, 23 Sep 2010 11:09:38 +0000 (20:09 +0900)]
sh: intc: Implement reverse mapping for IRQs to per-controller IDs.
This implements a scheme roughly analogous to the PowerPC virtual to
hardware IRQ mapping, which we use for IRQ to per-controller ID mapping.
This makes it possible for drivers to use the IDs directly for lookup
instead of hardcoding the vector.
The main motivation for this work is as a building block for dynamically
allocating virtual IRQs for demuxing INTC events sharing a single INTEVT
in addition to a common masking source.
Paul Mundt [Sun, 3 Oct 2010 20:15:20 +0000 (05:15 +0900)]
sh: pfc: Fix up BUG() triggered by gpiolib debugfs lookups.
The gpiolib debugfs entry takes a hammer approach and iterates over all
of the potential GPIOs, regardless of their type. The SH PFC code on the
other hand contains a variable mismash of input/output/function types
spread out sparsely, leading to situations where the debug code can
trigger an out of range enum for the type. Since we already have an error
path for out of range enums, we can just hand that up to the higher level
instead of the current BUG() behaviour.
Paul Mundt [Sun, 3 Oct 2010 18:54:56 +0000 (03:54 +0900)]
sh: pfc: support pinmux deregistration.
Presently the pinmux code is a one-way thing, but there's nothing
preventing an unregistration if no one has grabbed any of the pins.
This will permit us to save a bit of memory on systems that require pin
demux for certain peripherals in the case where registration of those
peripherals fails, or they are otherwise not attached to the system.
Paul Mundt [Sun, 3 Oct 2010 18:07:49 +0000 (03:07 +0900)]
sh: mach-x3proto: Improve ILSEL debugging.
At the moment ILSEL blows up with a BUG when aliased sets are handed in,
but as the enable call is able to hand back errors we opt for that path
instead. None of the ILSEL peripherals are vital to the board's
operation, so trapping a BUG is a bit excessive.
Paul Mundt [Sat, 2 Oct 2010 13:02:07 +0000 (22:02 +0900)]
sh: Support early IRQ vector map reservation for delayed controllers.
Some controllers will need to be initialized lazily due to pinmux
constraints, while others may simply have no need to be brought online if
there are no backing devices for them attached. In this case it's still
necessary to be able to reserve their hardware vector map before dynamic
IRQs get a hold of them.
Paul Mundt [Sat, 2 Oct 2010 10:43:40 +0000 (19:43 +0900)]
sh: Handle pinmux for SH-X3 proto IRQ/IRL modes.
The SH-X3 proto CPU has all of the external IRQ and IRL pins muxed, make
sure that we're able to grab them before attempting to register their
respective IRQ controllers.
Paul Mundt [Fri, 1 Oct 2010 15:43:43 +0000 (00:43 +0900)]
sh: Support userimask for all SH-X3 interrupt controllers.
This shuffles some of the shared bits out of the 7786 code and in to a
shared SH-X3 support file. Presently just for userimask, but also a good
place for the IRQ balancing wrappers.
Magnus Damm [Fri, 24 Sep 2010 09:05:38 +0000 (09:05 +0000)]
sh: boot kernel with SR.BL set
Update the SH kernel to keep SR.BL set until the VBR
register has been initialized. Useful to allow boot
of the kernel even though exceptions are pending.
Without this patch there is a window of time when
exceptions such as NMI are enabled but no exception
handlers are installed.
This patch modifies both the zImage loader and the
actual kernel to boot with BL=1, but the zImage
loader is modfied in such a way that the init_sr
value is unchanged to not break the zImage loader
provided by kexec.
Tested on sh7724 Ecovec and on the SH4AL-DSP core
included in sh7372.
Signed-off-by: Magnus Damm <damm@opensource.se> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
kfree() in clkdev_drop() function should actually be called with an address of
a struct clk_lookup_alloc object, and not struct clk_lookup, as presently done.
This just happens to work, because "struct clk_lookup cl" is the first
member in struct clk_lookup_alloc.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Thu, 23 Sep 2010 19:04:26 +0000 (04:04 +0900)]
sh: provide generic arch_debugfs_dir.
While sh previously had its own debugfs root, there now exists a
common arch_debugfs_dir prototype, so we switch everything over to
that. Presumably once more architectures start making use of this
we'll be able to just kill off the stub kdebugfs wrapper.
Paul Mundt [Mon, 20 Sep 2010 09:56:13 +0000 (18:56 +0900)]
sh: pci: Use a generic raw spinlock for PCI config access locking.
This copies the pci_config_lock idea from x86 over, allowing us to kill
off a couple of existing private locks. At the same time, these need to
be converted to raw spinlocks for -rt kernels, so we make that change at
the same time. This should make it easier for future parts to get the
locking right instead of inevitable ending up with lock type mismatches.
Paul Mundt [Mon, 20 Sep 2010 08:10:02 +0000 (17:10 +0900)]
sh: pci: Use I/O accessors consistently in SH7786 PCIe init code.
Some of the existing code is flipping between __raw_xxx() and
pci_{read,write}_reg(). As the latter are just wrappers for the former,
flip over to using them consistently.
Paul Mundt [Mon, 20 Sep 2010 07:12:58 +0000 (16:12 +0900)]
sh: pci: Support ports with disabled links on SH7786 PCIe.
Presently we error out if a link is disabled and simply drop the port
registration outright. This follows the PPC changes and simply reports on
the link state on boot, leaving the port registered, in order to more
easily deal with hotplug on future parts.
Paul Mundt [Mon, 20 Sep 2010 06:39:54 +0000 (15:39 +0900)]
sh: pci: Support root complex config accesses on SH7786 PCIe.
The SH7786 PCIe is presently unable to enumerate itself in root complex
mode, and has no visibility through either type 0 or type 1 accesses,
despite having a mostly sensible extended config space for each port.
Attempts to generate type 0 or type 1 config cycles result in completer
aborts, so we're ultimately forced to use SuperHyway transactions
instead.
As each port has a single port <-> device mapping that resolves for any
PCI_SLOT definition, we simply hijack devfn 0 for the SuperHyway
transaction and bump up the devfn limit.
With enumeration of the root complex now possible, we also need to insert
an early fixup to hide the BARs from the kernel. With all of that done,
it's now possible to use the pcieport services with all of the PCIe
ports, which is the first step to power management support.
Paul Mundt [Mon, 20 Sep 2010 06:37:25 +0000 (15:37 +0900)]
sh: pci: Move Renesas PCI IDs to a better place.
Previously these IDs were only used by one driver, so there was not much
need for having them generically defined. Now that this will no longer
hold true, move them over.
Paul Mundt [Sun, 19 Sep 2010 04:57:51 +0000 (13:57 +0900)]
sh: pci: Give SH7786 PHY some time to settle.
The spec suggests waiting up to 500ms for the PHY to settle before
testing link state, but practice shows that 100ms is sufficient (this is
the delay value we also use on the other SH-4A PCI controllers, too).
This makes device detection much more reliable, although in the future it
should be a bit faster to simply serialize with a TLP IRQ.
Paul Mundt [Thu, 16 Sep 2010 08:16:31 +0000 (17:16 +0900)]
usb: Fix up r8a66597-hcd section mismatches.
The _remove() routine is flagged __init_or_module, despite only being
used in a __devexit context. As the rest of the driver is already
balanced out with __devinit, switch to __devexit and __devexit_p()
wrapping.
Paul Mundt [Tue, 14 Sep 2010 08:43:11 +0000 (17:43 +0900)]
sh: Provide a non-multiplexed sys_recvmmsg path.
Now that the rest of the socket calls are provided through their own
paths, do the same for sys_recvmmsg. It's unlikely we'll ever be able to
kill off the socketcall path, but this at least permits userspace to
gradually begin migrating.
sh: Add syscall entries for non multiplexed socket calls
Linux kernel already has socket syscalls that can be invoked
without the multiplexing sys_socketcall wrapper.
C library wrappers are ready to use them directly. It needs just
to define the missing syscall numbers and provide the related entries
into the syscalls table, like sh64 aleady does.
Signed-off-by: Francesco Rundo <francesco.rundo@st.com> Signed-off-by: Carmelo Amoroso <carmelo.amoroso@st.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Matt Fleming [Sat, 21 Aug 2010 14:49:38 +0000 (14:49 +0000)]
sh: Set CONFIG_SYSFS_DEPRECATED_V2=n
As the help for the config option suggests, this option really shouldn't
be set by default for any recent distribution as it changes the layout
of sysfs. I spotted this while running debian when udev got very
confused by the sysfs layout and failed to create some device nodes.
Signed-off-by: Matt Fleming <matt@console-pimps.org> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Tue, 7 Sep 2010 08:07:05 +0000 (17:07 +0900)]
sh: Hook up 3rd memory window for all SH7786 PCIe channels.
Now that the resource assignment issues are resolved, we can finally wire
up the small third memory window -- in the future we may reclaim this for
MSI.
Paul Mundt [Tue, 7 Sep 2010 08:05:08 +0000 (17:05 +0900)]
sh: Properly wire up channel 2's I/O window on SH7786 PCIe.
An IORESOURCE_IO was missing here, which meant that we weren't properly
establishing the I/O window for this particular slot. With this
corrected, cards with I/O BARs have them actually assigned and
accessible.
Paul Mundt [Tue, 7 Sep 2010 08:03:10 +0000 (17:03 +0900)]
sh: Ignore 32-bit windows in 29-bit mode for SH7786 PCIe.
Certain memory windows are only available for 32-bit space, so skip over
these in 29-bit mode. This will severely restrict the amount of memory
that can be mapped, but since a boot loader bug makes booting in 29-bit
mode close to impossible anyways, everything is ok.
Paul Mundt [Tue, 7 Sep 2010 07:12:26 +0000 (16:12 +0900)]
sh: Establish a SuperHyway<->PCIe window mapping on SH7786 PCIe.
This bumps up the low address to match the physical memory windows for
SHway<->PCIe transfers. The previous implementation was banking on a 1:1
virt<->phys SHway mapping, which doesn't apply here.
Paul Mundt [Fri, 20 Aug 2010 11:26:41 +0000 (20:26 +0900)]
sh: Relax devfn constraints for SH7786 PCIe.
SH7786 PCIe has 1 slot per port, but no specific restriction on function.
Relax the devfn restriction and look to the slot number instead when
configured as a root complex.
Paul Mundt [Fri, 20 Aug 2010 10:10:38 +0000 (19:10 +0900)]
sh: reinstate clock framework rate rounding.
This was killed off by a simplification patch previously that failed to
take the cpufreq use case in to account, so reinstate the old bounding
logic. The lowest rate bounding on the other hand was broken in that it
never actually got assigned a rate and the best fit rate was instead just
getting lucky based on the ordering of the rate table, fix this up so the
code actually does what it was intended to do originally.
Paul Mundt [Fri, 20 Aug 2010 06:59:40 +0000 (15:59 +0900)]
sh: Support type 1 accesses for SH7786 PCI.
This enables support for type 1 config space accesses on the SH7786
PCI controller. At the same time, add in some extra sanity checks for
controller asserted errors.
Linus Torvalds [Wed, 18 Aug 2010 22:45:23 +0000 (15:45 -0700)]
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix an Oops in the NFSv4 atomic open code
NFS: Fix the selection of security flavours in Kconfig
NFS: fix the return value of nfs_file_fsync()
rpcrdma: Fix SQ size calculation when memreg is FRMR
xprtrdma: Do not truncate iova_start values in frmr registrations.
nfs: Remove redundant NULL check upon kfree()
nfs: Add "lookupcache" to displayed mount options
NFS: allow close-to-open cache semantics to apply to root of NFS filesystem
SUNRPC: fix NFS client over TCP hangs due to packet loss (Bug 16494)
Linus Torvalds [Wed, 18 Aug 2010 22:29:38 +0000 (15:29 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
USB HID: Add ID for eGalax Multitouch used in JooJoo tablet
HID: hiddev: fix memory corruption due to invalid intfdata
HID: hiddev: protect against disconnect/NULL-dereference race
HID: picolcd: correct ordering of framebuffer freeing
HID: picolcd: testing the wrong variable
Missed the declaration of sys_execve in the ia64 asm/unistd.h (perhaps
because there is no reason for it to be there ... it might be a left over
from the COMPAT code?). Just delete the conflicting version.
Linus Torvalds [Wed, 18 Aug 2010 16:35:08 +0000 (09:35 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: brlock vfsmount_lock
fs: scale files_lock
lglock: introduce special lglock and brlock spin locks
tty: fix fu_list abuse
fs: cleanup files_lock locking
fs: remove extra lookup in __lookup_hash
fs: fs_struct rwlock to spinlock
apparmor: use task path helpers
fs: dentry allocation consolidation
fs: fix do_lookup false negative
mbcache: Limit the maximum number of cache entries
hostfs ->follow_link() braino
hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy
remove SWRITE* I/O types
kill BH_Ordered flag
vfs: update ctime when changing the file's permission by setfacl
cramfs: only unlock new inodes
fix reiserfs_evict_inode end_writeback second call
Linus Torvalds [Wed, 18 Aug 2010 16:32:13 +0000 (09:32 -0700)]
Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tools: Fix build on POSIX shells
latencytop: Fix kconfig dependency warnings
perf annotate tui: Fix exit and RIGHT keys handling
tracing: Sanitize value returned from write(trace_marker, "...", len)
tracing/events: Convert format output to seq_file
tracing: Extend recordmcount to better support Blackfin mcount
tracing: Fix ring_buffer_read_page reading out of page boundary
tracing: Fix an unallocated memory access in function_graph
Linus Torvalds [Wed, 18 Aug 2010 16:27:10 +0000 (09:27 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68knommu: include sched.h in ColdFire/SPI driver
m68knommu: formatting of pointers in printk()
m68knommu: arch/m68k/include/asm/ide.h fix for nommu
Linus Torvalds [Wed, 18 Aug 2010 16:26:42 +0000 (09:26 -0700)]
Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
md raid-1/10 Fix bio_rw bit manipulations again
md: provide appropriate return value for spare_active functions.
md: Notify sysfs when RAID1/5/10 disk is In_sync.
Update recovery_offset even when external metadata is used.
Linus Torvalds [Wed, 18 Aug 2010 16:26:17 +0000 (09:26 -0700)]
Merge branch 'merge-devicetree' of git://git.secretlab.ca/git/linux-2.6
* 'merge-devicetree' of git://git.secretlab.ca/git/linux-2.6:
spi.h: missing kernel-doc notation, please fix
of: fix missing headers for of_address_to_resource() in MTD and SysACE drivers
of: Fix missing includes
ata: update for of_device to platform_device replacement
microblaze: Fix of: eliminate of_device->node and dev_archdata->{of,prom}_node
microblaze: Fix of/address: Merge all of the bus translation code
booting-without-of: Remove nonexistent chapters from TOC, fix numbering
Jaroslav Kysela [Wed, 18 Aug 2010 12:08:17 +0000 (14:08 +0200)]
ALSA: emu10k1 - delay the PCM interrupts (add pcm_irq_delay parameter)
With some hardware combinations, the PCM interrupts are acknowledged
before the period boundary from the emu10k1 chip. The midlevel PCM code
gets confused and the playback stream is interrupted.
It seems that the interrupt processing shift by 2 samples is enough
to fix this issue. This default value does not harm other,
non-affected hardware.
More information: Kernel bugzilla bug#16300
[A copmile warning fixed by tiwai]
Signed-off-by: Jaroslav Kysela <perex@perex.cz> Cc: <stable@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Nick Piggin [Tue, 17 Aug 2010 18:37:39 +0000 (04:37 +1000)]
fs: brlock vfsmount_lock
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nick Piggin [Tue, 17 Aug 2010 18:37:38 +0000 (04:37 +1000)]
fs: scale files_lock
fs: scale files_lock
Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).
One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.
However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.
A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.
Testing results:
On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.
So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.
Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
throughput
2.6.34-rc2 24.5
+patch 24.9
us sys idle IO wait (in %)
2.6.34-rc2 51.25 28.25 17.25 3.25
+patch 53.75 18.5 19 8.75
So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.
Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.
Cc: linux-kernel@vger.kernel.org Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nick Piggin [Tue, 17 Aug 2010 18:37:37 +0000 (04:37 +1000)]
lglock: introduce special lglock and brlock spin locks
lglock: introduce special lglock and brlock spin locks
This patch introduces "local-global" locks (lglocks). These can be used to:
- Provide fast exclusive access to per-CPU data, with exclusive access to
another CPU's data allowed but possibly subject to contention, and to provide
very slow exclusive access to all per-CPU data.
- Or to provide very fast and scalable read serialisation, and to provide
very slow exclusive serialisation of data (not necessarily per-CPU data).
Brlocks are also implemented as a short-hand notation for the latter use
case.
Thanks to Paul for local/global naming convention.
Cc: linux-kernel@vger.kernel.org Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nick Piggin [Tue, 17 Aug 2010 18:37:36 +0000 (04:37 +1000)]
tty: fix fu_list abuse
tty: fix fu_list abuse
tty code abuses fu_list, which causes a bug in remount,ro handling.
If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".
Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.
The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.
[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]
Cc: linux-kernel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nick Piggin [Tue, 17 Aug 2010 18:37:34 +0000 (04:37 +1000)]
fs: remove extra lookup in __lookup_hash
fs: remove extra lookup in __lookup_hash
Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.
Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
- We are interested in how the RCU lookup works here, particularly with
renames. Make that explicit, and point to the document where it is explained
in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
code used to work, more about the interesting parts of how it works now. So
comments about lazy LRU may be interesting, but would better be done in the
LRU or refcount management code.
Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nick Piggin [Tue, 17 Aug 2010 18:37:33 +0000 (04:37 +1000)]
fs: fs_struct rwlock to spinlock
fs: fs_struct rwlock to spinlock
struct fs_struct.lock is an rwlock with the read-side used to protect root and
pwd members while taking references to them. Taking a reference to a path
typically requires just 2 atomic ops, so the critical section is very small.
Parallel read-side operations would have cacheline contention on the lock, the
dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
real parallelism increase.
Replace it with a spinlock to avoid one or two atomic operations in typical
path lookup fastpath.
Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Nick Piggin [Tue, 17 Aug 2010 18:37:30 +0000 (04:37 +1000)]
fs: fix do_lookup false negative
fs: fix do_lookup false negative
In do_lookup, if we initially find no dentry, we take the directory i_mutex and
re-check the lookup. If we find a dentry there, then we revalidate it if
needed. However if that revalidate asks for the dentry to be invalidated, we
return -ENOENT from do_lookup. What should happen instead is an attempt to
allocate and lookup a new dentry.
This is probably not noticed because it is rare. It is only reached if a
concurrent create races in first (in which case, the dentry probably won't be
invalidated anyway), or if the racy __d_lookup has failed due to a
false-negative (which is very rare).
Fix this by removing code and have it use the normal reval path.
Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mbcache: Limit the maximum number of cache entries
Limit the maximum number of mb_cache entries depending on the number of
hash buckets: if the only limit to the number of cache entries is the
available memory the hash chains can grow very long, taking a long time
to search.
At least partially solves https://bugzilla.lustre.org/show_bug.cgi?id=22771.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>