David Chinner [Sat, 10 Feb 2007 07:37:46 +0000 (18:37 +1100)]
[XFS] Don't use kmap in xfs_iozero.
kmap() is inefficient and does not scale well. kmap_atomic() is a better
choice. Use the generic wrapper function instead of open coding the
kmap-memset-dcache flush-kunmap stuff.
Eric Sandeen [Sat, 10 Feb 2007 07:37:33 +0000 (18:37 +1100)]
[XFS] Remove unused arguments from the XFS_BTREE_*_ADDR macros.
It makes it incrementally clearer to read the code when the top of a macro
spaghetti-pile only receives the 3 arguments it uses, rather than 2 extra
ones which are not used. Also when you start pulling this thread out of
the sweater (i.e. remove unused args from XFS_BTREE_*_ADDR), a couple
other third arms etc fall off too. If they're not used in the macro, then
they sometimes don't need to be passed to the function calling the macro
either, etc....
Patch provided by Eric Sandeen (sandeen@sandeen.net).
Eric Sandeen [Sat, 10 Feb 2007 07:37:28 +0000 (18:37 +1100)]
[XFS] Remove unused header files for MAC and CAP checking functionality.
xfs_mac.h and xfs_cap.h provide definitions and macros that aren't used
anywhere in XFS at all. They are left-overs from "to be implement at some
point in the future" functionality that Irix XFS has. If this
functionality ever goes into Linux, it will be provided at a different
layer, most likely through the security hooks in the kernel so we will
never need this functionality in XFS.
Patch provided by Eric Sandeen (sandeen@sandeen.net).
Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
Lachlan McIlroy [Sat, 10 Feb 2007 07:36:47 +0000 (18:36 +1100)]
[XFS] Fix callers of xfs_iozero() to zero the correct range.
The problem is the two callers of xfs_iozero() are rounding out the range
to be zeroed to the end of a fsb and in some cases this extends past the
new eof. The call to commit_write() in xfs_iozero() will cause the Linux
inode's file size to be set too high.
David Chinner [Sat, 10 Feb 2007 07:36:40 +0000 (18:36 +1100)]
[XFS] Ensure a frozen filesystem has a clean log before writing the dummy
record.
The current Linux XFS freeze code is a mess. We flush the metadata buffers
out while we are still allowing new transactions to start and then fail to
flush the dirty buffers back out before writing the unmount and dummy
records to the log.
This leads to problems when the frozen filesystem is used for snapshots -
we do log recovery on a readonly image and often it appears that the log
image in the snapshot is not correct. Hence we end up with hangs, oops and
mount failures when trying to mount a snapshot image that has been created
when the filesystem has not been correctly frozen.
To fix this, we need to move th metadata flush to after we wait for all
current transactions to complete in teh second stage of the freeze. This
means that when we write the final log records, the log should be clean
and recovery should never occur on a snapshot image created from a frozen
filesystem.
David Chinner [Sat, 10 Feb 2007 07:36:35 +0000 (18:36 +1100)]
[XFS] Fix sub-block zeroing for buffered writes into unwritten extents.
When writing less than a filesystem block of data into an unwritten extent
via buffered I/O, __xfs_get_blocks fails to set the buffer new flag. As a
result, the generic code will not zero either edge of the block resulting
in garbage being written to disk either side of the real data. Set the
buffer new state on bufferd writes to unwritten extents to ensure that
zeroing occurs.
Lachlan McIlroy [Sat, 10 Feb 2007 07:36:29 +0000 (18:36 +1100)]
[XFS] Re-initialize the per-cpu superblock counters after recovery.
After filesystem recovery the superblock is re-read to bring in any
changes. If the per-cpu superblock counters are not re-initialized from
the superblock then the next time the per-cpu counters are disabled they
might overwrite the global counter with a bogus value.
Signed-off-by: Kevin Jamieson <kjamieson@bycast.com> Signed-off-by: David Chatterton <chatz@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
David Chinner [Sat, 10 Feb 2007 07:36:17 +0000 (18:36 +1100)]
[XFS] Fix block reservation mechanism.
The block reservation mechanism has been broken since the per-cpu
superblock counters were introduced. Make the block reservation code work
with the per-cpu counters by syncing the counters, snapshotting the amount
of available space and then doing a modifcation of the counter state
according to the result. Continue in a loop until we either have no space
available or we reserve some space.
David Chinner [Sat, 10 Feb 2007 07:36:10 +0000 (18:36 +1100)]
[XFS] Make growfs work for amounts greater than 2TB
The free block modification code has a 32bit interface, limiting the size
the filesystem can be grown even on 64 bit machines. On 32 bit machines,
there are other 32bit variables in transaction structures and interfaces
that need to be expanded to allow this to work.
David Chinner [Sat, 10 Feb 2007 07:35:09 +0000 (18:35 +1100)]
[XFS] Reduction global superblock lock contention near ENOSPC.
The existing per-cpu superblock counter code uses the global superblock
spin lock when we approach ENOSPC for global synchronisation. On larger
machines than this code was originally tested on this can still get
catastrophic spinlock contention due increasing rebalance frequency near
ENOSPC.
By introducing a sleeping lock that is used to serialise balances and
modifications near ENOSPC we prevent contention from needlessly from
wasting the CPU time of potentially hundreds of CPUs.
To reduce the number of balances occuring, we separate the need rebalance
case from the slow allocate case. Now, a counter running dry will trigger
a rebalance during which counters are disabled. Any thread that sees a
disabled counter enters a different path where it waits on the new mutex.
When it gets the new mutex, it checks if the counter is disabled. If the
counter is disabled, then we _know_ that we have to use the global counter
and lock and it is safe to do so immediately. Otherwise, we drop the mutex
and go back to trying the per-cpu counters which we know were re-enabled.
David Chinner [Sat, 10 Feb 2007 07:34:56 +0000 (18:34 +1100)]
[XFS] Keep stack usage down for 4k stacks by using noinline.
gcc-4.1 and more recent aggressively inline static functions which
increases XFS stack usage by ~15% in critical paths. Prevent this from
occurring by adding noinline to the STATIC definition.
Also uninline some functions that are too large to be inlined and were
causing problems with CONFIG_FORCED_INLINING=y.
Finally, clean up all the different users of inline, __inline and
__inline__ and put them under one STATIC_INLINE macro. For debug kernels
the STATIC_INLINE macro uninlines those functions.
David Chinner [Sat, 10 Feb 2007 07:34:49 +0000 (18:34 +1100)]
[XFS] Current usage of buftarg flags is incorrect.
The {test,set,clear}_bit() operations take a bit index for the bit to
operate on. The XBT_* flags are defined as bit fields which is incorrect,
not to mention the way the bit fields are enumerated is broken too. This
was only working by chance.
Fix the definitions of the flags and make the code using them use the
{test,set,clear}_bit() operations correctly.
Lachlan McIlroy [Sat, 10 Feb 2007 07:34:38 +0000 (18:34 +1100)]
[XFS] Prevent buffer overrun in cmn_err().
The message buffer used by cmn_err() is only 256 bytes and some CXFS
messages were exceeding this length. Since we were using vsprintf() and
not checking for buffer overruns we were clobbering memory beyond the
buffer. The size of the buffer has been increased to 1024 bytes so we can
capture these larger messages and we are now using vsnprintf() to prevent
overrunning the buffer size.
David Chinner [Sat, 10 Feb 2007 07:32:29 +0000 (18:32 +1100)]
[XFS] Fix a synchronous buftarg flush deadlock when freezing.
At the last stage of a freeze, we flush the buftarg synchronously over and
over again until it succeeds twice without skipping any buffers.
The delwri list flush skips pinned buffers, but tries to flush all others.
It removes the buffers from the delwri list, then tries to lock them one
at a time as it traverses the list to issue the I/O. It holds them locked
until we issue all of the I/O and then unlocks them once we've waited for
it to complete.
The problem is that during a freeze, the filesystem may still be doing
stuff - like flushing delalloc data buffers - in the background and hence
we can be trying to lock buffers that were on the delwri list at the same
time. Hence we can get ABBA deadlocks between threads doing allocation and
the buftarg flush (freeze) thread.
Fix it by skipping locked (and pinned) buffers as we traverse the delwri
buffer list.
Greg Ungerer [Wed, 7 Feb 2007 02:03:08 +0000 (12:03 +1000)]
[PATCH] uclinux: correctly remap bin_fmtflat exe allocated mem regions
remap() the region we get from mmap() to mark the fact that we are
using all of the available slack space. Any slack space is used
to form a simple brk region, and potentially more stack space than
requested at load time.
Any searches of the vma chain may well fail looking for
stack (and especially arg) addresses if the remaping is not done.
The simplest example is /proc/<pid>/cmdline, since the args
are pretty much always at the top of the data/bss/stack region.
Linus Torvalds [Fri, 9 Feb 2007 18:25:38 +0000 (10:25 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
ieee1394: fix host device registering when nodemgr disabled
ieee1394: video1394: DMA fix
ieee1394: raw1394: prevent unloading of low-level driver
ieee1394: dv1394: tidy up card removal
ieee1394: dv1394: fix CardBus card ejection
ieee1394: sbp2: lower block queue alignment requirement
ieee1394: sbp2: remove bogus "emulated" host flag
ieee1394: save one word in struct hpsb_host
ieee1394: restore config ROM when resuming
ieee1394: ohci1394: drop pcmcia-cs compatibility code
ieee1394: nodemgr: check info_length in ROM header earlier
the scheduled IEEE1394_OUI_DB removal
the scheduled IEEE1394_EXPORT_FULL_API removal
ieee1394: sbp2: use a better wildcard for blacklist
Add PCI class ID for firewire OHCI controllers.
ieee1394: modified csr1212_key_id_type_map to support lisight
Roman Zippel [Thu, 8 Feb 2007 21:48:51 +0000 (22:48 +0100)]
[PATCH] kbuild: more Makefile cleanups
This adds the remaining changes which should have been part of the
review process.
- the define command is inappropriate (it's primarily for rule
definitions)
- execute commands in the current dir as all other commands
- .*.tmp (but not .*.null) files are also removed up by "make clean"
- printf has other side effects, just use "echo -e"
- proper quoting
- proper indentation
Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 9 Feb 2007 17:44:28 +0000 (09:44 -0800)]
Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-apm
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-apm:
[APM] SH: Convert to use shared APM emulation.
[APM] MIPS: Convert to use shared APM emulation.
[APM] ARM: Convert to use shared APM emulation.
[APM] Add shared version of APM emulation
Roland McGrath [Thu, 8 Feb 2007 22:20:41 +0000 (14:20 -0800)]
[PATCH] Add install_special_mapping
This patch adds a utility function install_special_mapping, for creating a
special vma using a fixed set of preallocated pages as backing, such as for a
vDSO. This consolidates some nearly identical code used for vDSO mapping
reimplemented for different architectures.
Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[PATCH] enable mouse button 2+3 emulation for x86 macs
As macbook/macbook pro's also have to live with a single mouse button the
following patch just enables the Macintosh device drivers menu in Kconfig +
adds the macintosh dir to the obj-* to make macbook* users happy (who use
exactly that since months....
Signed-off-by: Soeren Sonnenburg <kernel@nn7.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Neil Brown [Thu, 8 Feb 2007 22:20:37 +0000 (14:20 -0800)]
[PATCH] md: avoid possible BUG_ON in md bitmap handling
md/bitmap tracks how many active write requests are pending on blocks
associated with each bit in the bitmap, so that it knows when it can clear
the bit (when count hits zero).
The counter has 14 bits of space, so if there are ever more than 16383, we
cannot cope.
Currently the code just calles BUG_ON as "all" drivers have request queue
limits much smaller than this.
However is seems that some don't. Apparently some multipath configurations
can allow more than 16383 concurrent write requests.
So, in this unlikely situation, instead of calling BUG_ON we now wait
for the count to drop down a bit. This requires a new wait_queue_head,
some waiting code, and a wakeup call.
Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
in that case...).
Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Thu, 8 Feb 2007 22:20:30 +0000 (14:20 -0800)]
[PATCH] knfsd: fix a race in closing NFSd connections
If you lose this race, it can iput a socket inode twice and you get a BUG
in fs/inode.c
When I added the option for user-space to close a socket, I added some
cruft to svc_delete_socket so that I could call that function when closing
a socket per user-space request.
This was the wrong thing to do. I should have just set SK_CLOSE and let
normal mechanisms do the work.
Not only wrong, but buggy. The locking is all wrong and it openned up a
race where-by a socket could be closed twice.
So this patch:
Introduces svc_close_socket which sets SK_CLOSE then either leave
the close up to a thread, or calls svc_delete_socket if it can
get SK_BUSY.
Adds a bias to sk_busy which is removed when SK_DEAD is set,
This avoid races around shutting down the socket.
Changes several 'spin_lock' to 'spin_lock_bh' where the _bh
was missing.
Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Neil Brown [Thu, 8 Feb 2007 22:20:29 +0000 (14:20 -0800)]
[PATCH] md: fix various bugs with aligned reads in RAID5
It is possible for raid5 to be sent a bio that is too big for an underlying
device. So if it is a READ that we pass stright down to a device, it will
fail and confuse RAID5.
So in 'chunk_aligned_read' we check that the bio fits within the parameters
for the target device and if it doesn't fit, fall back on reading through
the stripe cache and making lots of one-page requests.
Note that this is the earliest time we can check against the device because
earlier we don't have a lock on the device, so it could change underneath
us.
Also, the code for handling a retry through the cache when a read fails has
not been tested and was badly broken. This patch fixes that code.
Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Kai" <epimetreus@fastmail.fm> Cc: <stable@suse.de> Cc: <org@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ken Chen [Thu, 8 Feb 2007 22:20:27 +0000 (14:20 -0800)]
[PATCH] hugetlb: preserve hugetlb pte dirty state
__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range. It causes data corruption in the
event of dop_caches being used by sys admin. For example, an application
creates a hugetlb file, modify pages, then unmap it. While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".
drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping. Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.
Not only that, the internal resv_huge_pages count will also get all messed
up. Fix it up by marking page dirty appropriately.
Signed-off-by: Ken Chen <kenchen@google.com> Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a fix of regression, which triggered by ~2.6.16.
Patch with name ufs-directory-and-page-cache-from-blocks-to-pages.patch: in
additional to conversation from block to page cache mechanism added new
checks of directory integrity, one of them that directory entry do not
across directory chunks.
But some kinds of UFS: OpenStep UFS and Apple UFS (looks like these are the
same filesystems) have different directory chunk size, then common
UFSes(BSD and Solaris UFS).
So this patch adds ability to works with variable size of directory chunks,
and set it for ufstype=openstep to right size.
Atsushi Nemoto [Thu, 8 Feb 2007 22:20:24 +0000 (14:20 -0800)]
[PATCH] rtc-pcf8563: detect polarity of century bit automatically
The usage of the century bit was inverted on 2.6.19 following to PCF8563's
description, but it was not match to usage suggested by RTC8564's
datasheet. Anyway what MO_C=1 means can vary on each platform. This patch
is to detect its polarity in get_datetime routine. The default value of
c_polarity is 0 (MO_C=1 means 19xx) so that this patch does not change
current behavior even if get_datetime was not called before set_datetime.
Linus Torvalds [Fri, 9 Feb 2007 17:22:36 +0000 (09:22 -0800)]
Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-tc
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-tc:
[EISA] EISA registration with !CONFIG_EISA
[TC] pmagb-b-fb: Convert to the driver model
[TC] dec_esp: Driver model for the PMAZ-A
[TC] mips: pmag-ba-fb: Convert to the driver model
[TC] defxx: TURBOchannel support
[TC] TURBOchannel support for the DECstation
[TC] MIPS: TURBOchannel resources off-by-one fix
[TC] MIPS: TURBOchannel update to the driver model
Al Viro [Fri, 9 Feb 2007 16:38:30 +0000 (16:38 +0000)]
[PATCH] kill eth_io_copy_and_sum()
On all targets that sucker boils down to memcpy_fromio(sbk->data, from, len).
The function name is highly misguiding (it _never_ does any checksums), the
last argument is just a noise and simply expanding the call to memcpy_fromio()
gives shorter and more readable source. For a lot of reasons it has almost
no remaining users, so it's better to just outright kill it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ralf Baechle [Fri, 9 Feb 2007 17:08:57 +0000 (17:08 +0000)]
[APM] Add shared version of APM emulation
Currently ARM and MIPS both have nearly identical copies of the APM
emulation code in their arch code. Add yet another copy of it to
drivers char and make it selectable through SYS_SUPPORTS_APM_EMULATION.
This is a change for the EISA bus support to permit drivers to call
un/registration functions even if EISA support has not been enabled. This is
similar to what PCI (and now TC) does and reduces the need for #ifdef clutter.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>