David Gibson [Sun, 28 Oct 2007 21:20:34 +0000 (22:20 +0100)]
hugetlb: check for brk() entering a hugepage region
Unlike mmap(), the codepath for brk() creates a vma without first checking
that it doesn't touch a region exclusively reserved for hugepages. On
powerpc, this can allow it to create a normal page vma in a hugepage
region, causing oopses and other badness.
Add a test to prevent this. With this patch, brk() will simply fail if it
attempts to move the break into a hugepage reserved region.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Ken Chen [Sun, 28 Oct 2007 20:40:41 +0000 (21:40 +0100)]
[IA64] fix ia64 is_hugepage_only_range
fix is_hugepage_only_range() definition to be "overlaps"
instead of "within architectural restricted hugetlb address
range". Simplify the ia64 specific code that used to use
is_hugepage_only_range() to just check which region the
address is in.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Adam Litke [Fri, 19 Oct 2007 17:05:10 +0000 (19:05 +0200)]
Don't allow the stack to grow into hugetlb reserved regions (CVE-2007-3739)
When expanding the stack, we don't currently check if the VMA will cross
into an area of the address space that is reserved for hugetlb pages.
Subsequent faults on the expanded portion of such a VMA will confuse the
low-level MMU code, resulting in an OOPS. Check for this.
Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Hugh Dickins [Fri, 19 Oct 2007 12:30:18 +0000 (14:30 +0200)]
hugetlb: fix prio_tree unit (CVE-2007-4133)
hugetlb_vmtruncate_list was misconverted to prio_tree: its prio_tree is in
units of PAGE_SIZE (PAGE_CACHE_SIZE) like any other, not HPAGE_SIZE (whereas
its radix_tree is kept in units of HPAGE_SIZE, otherwise slots would be
absurdly sparse).
At first I thought the error benign, just calling __unmap_hugepage_range on
more vmas than necessary; but on 32-bit machines, when the prio_tree is
searched correctly, it happens to ensure the v_offset calculation won't
overflow. As it stood, when truncating at or beyond 4GB, it was liable to
discard pages COWed from lower offsets; or even to clear pmd entries of
preceding vmas, triggering exit_mmap's BUG_ON(nr_ptes).
Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Arthur Othieno [Fri, 19 Oct 2007 00:04:58 +0000 (02:04 +0200)]
hugetlbfs: add Kconfig help text
In kernel bugzilla #6248 (http://bugzilla.kernel.org/show_bug.cgi?id=6248),
Adrian Bunk <bunk@stusta.de> notes that CONFIG_HUGETLBFS is missing Kconfig
help text.
Signed-off-by: Arthur Othieno <apgo@patchbomb.org> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Ken Chen [Thu, 18 Oct 2007 23:59:17 +0000 (01:59 +0200)]
x86: HUGETLBFS and DEBUG_PAGEALLOC are incompatible
DEBUG_PAGEALLOC is not compatible with hugetlb page support. That debug
option turns off PSE. Once it is turned off in CR4, the cpu will ignore
pse bit in the pmd and causing infinite page-not- present faults.
So disable DEBUG_PAGEALLOC if the user selected hugetlbfs.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Stephen Smalley [Thu, 18 Oct 2007 23:27:51 +0000 (01:27 +0200)]
SELinux: clear parent death signal on SID transitions
Clear parent death signal on SID transitions to prevent unauthorized
signaling between SIDs.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Eric Paris <eparis@parisplace.org> Signed-off-by: James Morris <jmorris@localhost.localdomain> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Ulrich Drepper [Thu, 18 Oct 2007 21:46:58 +0000 (23:46 +0200)]
make UML compile (FC6/x86-64)
I need this patch to get a UML kernel to compile. This is with the
kernel headers in FC6 which are automatically generated from the kernel
tree. Some headers are missing but those files don't need them. At
least it appears so since the resuling kernel works fine.
Tested on x86-64.
Signed-off-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Ilpo Järvinen [Thu, 18 Oct 2007 16:55:43 +0000 (18:55 +0200)]
[TCP]: Fix fastpath_cnt_hint when GSO skb is partially ACKed
When only GSO skb was partially ACKed, no hints are reset,
therefore fastpath_cnt_hint must be tweaked too or else it can
corrupt fackets_out. The corruption to occur, one must have
non-trivial ACK/SACK sequence, so this bug is not very often
that harmful. There's a fackets_out state reset in TCP because
fackets_out is known to be inaccurate and that fixes the issue
eventually anyway.
In case there was also at least one skb that got fully ACKed,
the fastpath_skb_hint is set to NULL which causes a recount for
fastpath_cnt_hint (the old value won't be accessed anymore),
thus it can safely be decremented without additional checking.
Reported by Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
David S. Miller [Thu, 18 Oct 2007 16:48:42 +0000 (18:48 +0200)]
[SPARC64]: Fix bugs in SYSV IPC handling in 64-bit processes.
Thanks to Tom Callaway for the excellent bug report and
test case.
sys_ipc() has several problems, most to due with semaphore
call handling:
1) 'err' return should be a 'long'
2) "union semun" is passed in a register on 64-bit compared
to 32-bit which provides it on the stack and therefore
by reference
3) Second and third arguments to SEMCTL are swapped compared
to 32-bit.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
David S. Miller [Thu, 18 Oct 2007 16:47:05 +0000 (18:47 +0200)]
[NET]: Zero length write() on socket should not simply return 0.
This fixes kernel bugzilla #5731
It should generate an empty packet for datagram protocols when the
socket is connected, for one.
The check is doubly-wrong because all that a write() can be is a
sendmsg() call with a NULL msg_control and a single entry iovec. No
special semantics should be assigned to it, therefore the zero length
check should be removed entirely.
This matches the behavior of BSD and several other systems.
Alan Cox notes that SuSv3 says the behavior of a zero length write on
non-files is "unspecified", but that's kind of useless since BSD has
defined this behavior for a quarter century and BSD is essentially
what application folks code to.
Based upon a patch from Stephen Hemminger.
Adrian Bunk:
Backported to 2.6.16.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
[PKT_SCHED] cls_u32: error code isn't been propogated properly
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Ilpo Järvinen [Thu, 18 Oct 2007 15:56:27 +0000 (17:56 +0200)]
[PKT_SCHED] RED: Fix overflow in calculation of queue average
Overflow can occur very easily with 32 bits, e.g., with 1 second
us_idle is approx. 2^20, which leaves only 11-Wlog bits for queue
length. Since the EWMA exponent is typically around 9, queue
lengths larger than 2^2 cause overflow. Whether the affected
branch is taken when us_idle is as high as 1 second, depends on
Scell_log, but with rather reasonable configuration Scell_log is
large enough to cause p->Stab to have zero index, which always
results zero shift (typically also few other small indices result
in zero shift).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Eric Sandeen [Sat, 6 Oct 2007 22:52:10 +0000 (00:52 +0200)]
sysfs: store sysfs inode nrs in s_ino to avoid readdir oopses (CVE-2007-3104)
Backport of
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc1/2.6.22-rc1-mm1/broken-out/gregkh-driver-sysfs-allocate-inode-number-using-ida.patch
For regular files in sysfs, sysfs_readdir wants to traverse
sysfs_dirent->s_dentry->d_inode->i_ino to get to the inode number.
But, the dentry can be reclaimed under memory pressure, and there is
no synchronization with readdir. This patch follows Tejun's scheme of
allocating and storing an inode number in the new s_ino member of a
sysfs_dirent, when dirents are created, and retrieving it from there
for readdir, so that the pointer chain doesn't have to be traversed.
Tejun's upstream patch uses a new-ish "ida" allocator which brings
along some extra complexity; this -stable patch has a brain-dead
incrementing counter which does not guarantee uniqueness, but because
sysfs doesn't hash inodes as iunique expects, uniqueness wasn't
guaranteed today anyway.
Adrian Bunk:
Backported to 2.6.16.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Matt Mackall [Sat, 6 Oct 2007 22:27:53 +0000 (00:27 +0200)]
random: fix bound check ordering (CVE-2007-3105)
If root raised the default wakeup threshold over the size of the
output pool, the pool transfer function could overflow the stack with
RNG bytes, causing a DoS or potential privilege escalation.
(Bug reported by the PaX Team <pageexec@freemail.hu>)
Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Marcel Holtmann [Sat, 6 Oct 2007 22:03:26 +0000 (00:03 +0200)]
Reset current->pdeath_signal on SUID binary execution (CVE-2007-3848)
This fixes a vulnerability in the "parent process death signal"
implementation discoverd by Wojciech Purczynski of COSEINC PTE Ltd.
and iSEC Security Research.
Kumar Gala [Sat, 6 Oct 2007 21:36:26 +0000 (23:36 +0200)]
[POWERPC] Flush registers to proper task context
When we flush register state for FP, Altivec, or SPE in flush_*_to_thread
we need to respect the task_struct that the caller has passed to us.
Most cases we are called with current, however sometimes (ptrace) we may
be passed a different task_struct.
This showed up when using gdbserver debugging a simple program that used
floating point. When gdb tried to show the FP regs they all showed up as 0,
because the child's FP registers were never properly flushed to memory.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Adrian Bunk <bunk@kernel.org>
gcc gets unhappy when the MSTK_SET macro's u8 __val variable
is updated with &= ~0xff (MSTK_YEAR_MASK). Making the constant
unsigned fixes the problem.
[ I fixed up the sparc32 side as well -DaveM ]
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Nick Bowler [Sat, 6 Oct 2007 18:34:18 +0000 (20:34 +0200)]
[IPSEC] AH4: Update IPv4 options handling to conform to RFC 4302.
In testing our ESP/AH offload hardware, I discovered an issue with how
AH handles mutable fields in IPv4. RFC 4302 (AH) states the following
on the subject:
For IPv4, the entire option is viewed as a unit; so even
though the type and length fields within most options are immutable
in transit, if an option is classified as mutable, the entire option
is zeroed for ICV computation purposes.
The current implementation does not zero the type and length fields,
resulting in authentication failures when communicating with hosts
that do (i.e. FreeBSD).
I have tested record route and timestamp options (ping -R and ping -T)
on a small network involving Windows XP, FreeBSD 6.2, and Linux hosts,
with one router. In the presence of these options, the FreeBSD and
Linux hosts (with the patch or with the hardware) can communicate.
The Windows XP host simply fails to accept these packets with or
without the patch.
I have also been trying to test source routing options (using
traceroute -g), but haven't had much luck getting this option to work
*without* AH, let alone with.
Signed-off-by: Nick Bowler <nbowler@ellipticsemi.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Ilpo Järvinen [Thu, 30 Aug 2007 04:21:05 +0000 (06:21 +0200)]
TCP: Fix TCP handling of SACK in bidirectional flows
It's possible that new SACK blocks that should trigger new LOST
markings arrive with new data (which previously made is_dupack
false). In addition, I think this fixes a case where we get
a cumulative ACK with enough SACK blocks to trigger the fast
recovery (is_dupack would be false there too).
I'm not completely pleased with this solution because readability
of the code is somewhat questionable as 'is_dupack' in SACK case
is no longer about dupacks only but would mean something like
'lost_marker_work_todo' too... But because of Eifel stuff done
in CA_Recovery, the FLAG_DATA_SACKED check cannot be placed to
the if statement which seems attractive solution. Nevertheless,
I didn't like adding another variable just for that either... :-)
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
[PPP]: Fix output buffer size in ppp_decompress_frame().
This patch addresses the issue with "osize too small" errors in mppe
encryption. The patch fixes the issue with wrong output buffer size
being passed to ppp decompression routine.
--------------------
As pointed out by Suresh Mahalingam, the issue addressed by
ppp-fix-osize-too-small-errors-when-decoding patch is not fully resolved yet.
The size of allocated output buffer is correct, however it size passed to
ppp->rcomp->decompress in ppp_generic.c if wrong. The patch fixes that.
--------------------
Signed-off-by: Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
[PPP]: Fix osize too small errors when decoding mppe.
The mppe_decompress() function required a buffer that is 1 byte too
small when receiving a message of mru size. This fixes buffer
allocation to prevent this from occurring.
Signed-off-by: Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
David S. Miller [Thu, 30 Aug 2007 04:14:37 +0000 (06:14 +0200)]
[MATH-EMU]: Fix underflow exception reporting.
The underflow exception cases were wrong.
This is one weird area of ieee1754 handling in that the underflow
behavior changes based upon whether underflow is enabled in the trap
enable mask of the FPU control register. As a specific case the Sparc
V9 manual gives us the following description:
--------------------
If UFM = 0: Underflow occurs if a nonzero result is tiny and a
loss of accuracy occurs. Tininess may be detected
before or after rounding. Loss of accuracy may be
either a denormalization loss or an inexact result.
If UFM = 1: Underflow occurs if a nonzero result is tiny.
Tininess may be detected before or after rounding.
--------------------
What this amounts to in the packing case is if we go subnormal,
we set underflow if any of the following are true:
1) rounding sets inexact
2) we ended up rounding back up to normal (this is the case where
we set the exponent to 1 and set the fraction to zero), this
should set inexact too
3) underflow is set in FPU control register trap-enable mask
The initially discovered example was "DBL_MIN / 16.0" which
incorrectly generated an underflow. It should not, unless underflow
is set in the trap-enable mask of the FPU csr.
Another example, "0x0.0000000000001p-1022 / 16.0", should signal both
inexact and underflow. The cpu implementations and ieee1754
literature is very clear about this. This is case #2 above.
However, if underflow is set in the trap enable mask, only underflow
should be set and reported as a trap. That is handled properly by the
prioritization logic in
Mark Fortescue [Thu, 30 Aug 2007 04:07:11 +0000 (06:07 +0200)]
[SPARC32]: Fix rounding errors in ndelay/udelay implementation.
__ndelay and __udelay have not been delayung >= specified time.
The problem with __ndelay has been tacked down to the rounding of the
multiplier constant. By changing this, delays > app 18us are correctly
calculated.
The problem with __udelay has also been tracked down to rounding issues.
Changing the multiplier constant (to match that used in sparc64) corrects
for large delays and adding in a rounding constant corrects for trunctaion
errors in the claculations.
Many short delays will return without looping. This is not an error as there
is the fixed delay of doing all the maths to calculate the loop count.
Signed-off-by: Mark Fortescue <mark@mtfhpc.demon.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Sparc optimized memset (arch/sparc/lib/memset.S) does not fill last
byte of the memory area, if area size is less than 8 bytes and start
address is not word (4-bytes) aligned.
Here is code chunk where bug located:
/* %o0 - memory address, %o1 - size, %g3 - value */
8:
add %o0, 1, %o0
subcc %o1, 1, %o1
bne,a 8b
stb %g3, [%o0 - 1]
This code should write byte every loop iteration, but last time delay
instruction stb is not executed because branch instruction sets
"annul" bit.
Signed-off-by: Alexander Shmelev <ashmelev@task.sun.mcst.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Thu, 23 Aug 2007 00:13:17 +0000 (02:13 +0200)]
md: avoid possible BUG_ON in md bitmap handling
md/bitmap tracks how many active write requests are pending on blocks
associated with each bit in the bitmap, so that it knows when it can clear
the bit (when count hits zero).
The counter has 14 bits of space, so if there are ever more than 16383, we
cannot cope.
Currently the code just calles BUG_ON as "all" drivers have request queue
limits much smaller than this.
However is seems that some don't. Apparently some multipath configurations
can allow more than 16383 concurrent write requests.
So, in this unlikely situation, instead of calling BUG_ON we now wait
for the count to drop down a bit. This requires a new wait_queue_head,
some waiting code, and a wakeup call.
Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
in that case...).
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Wed, 22 Aug 2007 23:39:24 +0000 (01:39 +0200)]
md: fix a few problems with the interface (sysfs and ioctl) to md
While developing more functionality in mdadm I found some bugs in md...
- When we remove a device from an inactive array (write 'remove' to
the 'state' sysfs file - see 'state_store') would should not
update the superblock information - as we may not have
read and processed it all properly yet.
- initialise all raid_disk entries to '-1' else the 'slot sysfs file
will claim '0' for all devices in an array before the array is
started.
- all '\n' not to be present at the end of words written to
sysfs files
- when we use SET_ARRAY_INFO to set the md metadata version,
set the flag to say that there is persistant metadata.
- allow GET_BITMAP_FILE to be called on an array that hasn't
been started yet.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Wed, 22 Aug 2007 23:38:42 +0000 (01:38 +0200)]
md: assorted md and raid1 one-liners
Fix few bugs that meant that:
- superblocks weren't alway written at exactly the right time (this
could show up if the array was not written to - writting to the array
causes lots of superblock updates and so hides these errors).
- restarting device recovery after a clean shutdown (version-1 metadata
only) didn't work as intended (or at all).
1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
The body of this if takes one of two branches depending on whether
MD_RECOVERY_SYNC is set, so testing it in the clause of the if
is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
metadata only) make sure a full recovery (not just as guided by
bitmaps) does get done.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Wed, 22 Aug 2007 23:06:28 +0000 (01:06 +0200)]
md: fix some small races in bitmap plugging in raid5
The comment gives more details, but I didn't quite have the sequencing write,
so there was room for races to leave bits unset in the on-disk bitmap for
short periods of time.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Wed, 22 Aug 2007 23:02:57 +0000 (01:02 +0200)]
md: fix a plug/unplug race in raid5
When a device is unplugged, requests are moved from one or two (depending on
whether a bitmap is in use) queues to the main request queue.
So whenever requests are put on either of those queues, we should make sure
the raid5 array is 'plugged'. However we don't. We currently plug the raid5
queue just before putting requests on queues, so there is room for a race. If
something unplugs the queue at just the wrong time, requests will be left on
the queue and nothing will want to unplug them. Normally something else will
plug and unplug the queue fairly soon, but there is a risk that nothing will.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Wed, 22 Aug 2007 22:57:45 +0000 (00:57 +0200)]
md: fix resync speed calculation for restarted resyncs
We introduced 'io_sectors' recently so we could count the sectors that causes
io during resync separate from sectors which didn't cause IO - there can be a
difference if a bitmap is being used to accelerate resync.
However when a speed is reported, we find the number of sectors processed
recently by subtracting an oldish io_sectors count from a current
'curr_resync' count. This is wrong because curr_resync counts all sectors,
not just io sectors.
So, add a field to mddev to store the curren io_sectors separately from
curr_resync, and use that in the calculations.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Wed, 22 Aug 2007 22:56:48 +0000 (00:56 +0200)]
md: Allow re-add to work on array without bitmaps
When an array has a bitmap, a device can be removed and re-added and only
blocks changes since the removal (as recorded in the bitmap) will be resynced.
It should be possible to do a similar thing to arrays without bitmaps. i.e.
if a device is removed and re-added and *no* changes have been made in the
interim, then the add should not require a resync.
This patch allows that option. This means that when assembling an array one
device at a time (e.g. during device discovery) the array can be enabled
read-only as soon as enough devices are available, but extra devices can still
be added without causing a resync.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Sat, 11 Aug 2007 23:09:29 +0000 (01:09 +0200)]
md/bitmap: tidy up i_writecount handling in md/bitmap
md/bitmap modifies i_writecount of a bitmap file to make sure that no-one else
writes to it. The reverting of the change is sometimes done twice, and there
is one error path where it is omitted.
This patch tidies that up.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Sat, 11 Aug 2007 23:05:36 +0000 (01:05 +0200)]
md/bitmap: remove unnecessary page reference manipulations from md/bitmap code
md/bitmap gets a collection of pages representing the bitmap when it
initialises the bitmap, and puts all the references when discarding the
bitmap.
It also occasionally takes extra references without any good reason, and
sometimes drops them ... though it doesn't always drop them, which can result
in a memory leak.
This patch removes the unnecessary 'get_page' calls, and the corresponding
'put_page' calls.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Sat, 11 Aug 2007 23:03:48 +0000 (01:03 +0200)]
md/bitmap: cleaner separation of page attribute handlers in md/bitmap
md/bitmap has some attributes per-page. Handling of these attributes in
largely abstracted in set_page_attr and clear_page_attr. However
get_page_attr exposes the format used to store them. So prior to changing
that format, introduce test_page_attr instead of get_page_attr, and make
appropriate usage changes.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Sat, 11 Aug 2007 22:18:08 +0000 (00:18 +0200)]
md/bitmap: fix online removal of file-backed bitmaps
When "mdadm --grow /dev/mdX --bitmap=none" is used to remove a filebacked
bitmap, the bitmap was disconnected from the array, but the file wasn't closed
(until the array was stopped).
The file also wasn't closed if adding the bitmap file failed.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
Neil Brown [Sat, 11 Aug 2007 22:17:07 +0000 (00:17 +0200)]
md: Don't clear bits in bitmap when writing to one device fails during recovery
Currently a device failure during recovery leaves bits set in the bitmap.
This normally isn't a problem as the offending device will be rejected because
of errors. However if device re-adding is being used with non-persistent
bitmaps, this can be a problem.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>
James Bottomley [Sat, 11 Aug 2007 22:03:39 +0000 (00:03 +0200)]
[MCA] fix bus matching
There's a bug in the MCA bus matching algorithm in that it promotes from
signed short to int before comparing with the actual id and does sign
extension on anything > 0x7fff (which means that pos ids > 0x7fff never
get correctly matched).
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
[IPV6]: MSG_ERRQUEUE messages do not pass to connected raw sockets
Taken from http://bugzilla.kernel.org/show_bug.cgi?id=8747
Problem Description:
It is related to the possibility to obtain MSG_ERRQUEUE messages from the udp
and raw sockets, both connected and unconnected.
There is a little typo in net/ipv6/icmp.c code, which prevents such messages
to be delivered to the errqueue of the correspond raw socket, when the socket
is CONNECTED. The typo is due to swap of local/remote addresses.
Consider __raw_v6_lookup() function from net/ipv6/raw.c. When a raw socket is
looked up usual way, it is something like:
sk = __raw_v6_lookup(sk, nexthdr, daddr, saddr, IP6CB(skb)->iif);
where "daddr" is a destination address of the incoming packet (IOW our local
address), "saddr" is a source address of the incoming packet (the remote end).
But when the raw socket is looked up for some icmp error report, in
net/ipv6/icmp.c:icmpv6_notify() , daddr/saddr are obtained from the echoed
fragment of the "bad" packet, i.e. "daddr" is the original destination
address of that packet, "saddr" is our local address. Hence, for
icmpv6_notify() must use "saddr, daddr" in its arguments, not "daddr, saddr"
...
Steps to reproduce:
Create some raw socket, connect it to an address, and cause some error
situation: f.e. set ttl=1 where the remote address is more than 1 hop to reach.
Set IPV6_RECVERR .
Then send something and wait for the error (f.e. poll() with POLLERR|POLLIN).
You should receive "time exceeded" icmp message (because of "ttl=1"), but the
socket do not receive it.
If you do not connect your raw socket, you will receive MSG_ERRQUEUE
successfully. (The reason is that for unconnected socket there are no actual
checks for local/remote addresses).
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Patrick McHardy [Sun, 22 Jul 2007 16:26:20 +0000 (18:26 +0200)]
[NET]: Fix gen_estimator timer removal race
As noticed by Jarek Poplawski <jarkao2@o2.pl>, the timer removal in
gen_kill_estimator races with the timer function rearming the timer.
Check whether the timer list is empty before rearming the timer
in the timer function to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Jarek Poplawski <jarkao2@o2.pl> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
SCTP: Add scope_id validation for link-local binds
SCTP currently permits users to bind to link-local addresses,
but doesn't verify that the scope id specified at bind matches
the interface that the address is configured on. It was report
that this can hang a system.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Johannes Berg [Sun, 22 Jul 2007 16:11:42 +0000 (18:11 +0200)]
[NET] skbuff: remove export of static symbol
skb_clone_fraglist is static so it shouldn't be exported.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
[NETFILTER]: nf_conntrack: don't track locally generated special ICMP error
The conntrack assigned to locally generated ICMP error is usually the one
assigned to the original packet which has caused the error. But if
the original packet is handled as invalid by nf_conntrack, no conntrack
is assigned to the original packet. Then nf_ct_attach() cannot assign
any conntrack to the ICMP error packet. In that case the current
nf_conntrack_icmp assigns appropriate conntrack to it. But the current
code mistakes the direction of the packet. As a result, NAT code mistakes
the address to be mangled.
To fix the bug, this changes nf_conntrack_icmp not to assign conntrack
to such ICMP error. Actually no address is necessary to be mangled
in this case.
Albert Lee [Sun, 22 Jul 2007 16:05:44 +0000 (18:05 +0200)]
ide: clear bmdma status in ide_intr() for ICHx controllers (revised #4)
patch 1/2 (revised):
- Fix drive->waiting_for_dma to work with CDB-intr devices.
- Do the dma status clearing in ide_intr() and add a new
hwif->ide_dma_clear_irq for Intel ICHx controllers.
Revised per Alan, Sergei and Bart's advice.
Patch against 2.6.20-rc6. Tested ok on my ICH4 and pdc20275 adapters.
Please review/apply, thanks.
Signed-off-by: Albert Lee <albertcc@tw.ibm.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
fix deadlock in the 8139too driver: poll handlers should never forcibly
enable local interrupts, because they might be used by netpoll/printk
from IRQ context.
=================================
[ INFO: inconsistent lock state ]
2.6.19 #11
---------------------------------
inconsistent {softirq-on-W} -> {in-softirq-W} usage.
swapper/1 [HC0[0]:SC1[1]:HE1:SE0] takes:
(&npinfo->poll_lock){-+..}, at: [<c0350a41>] net_rx_action+0x64/0x1de
{softirq-on-W} state was registered at:
[<c0134c86>] mark_lock+0x5b/0x39c
[<c0135012>] mark_held_locks+0x4b/0x68
[<c01351e9>] trace_hardirqs_on+0x115/0x139
[<c02879e6>] rtl8139_poll+0x3d7/0x3f4
[<c035c85d>] netpoll_poll+0x82/0x32f
[<c035c775>] netpoll_send_skb+0xc9/0x12f
[<c035cdcc>] netpoll_send_udp+0x253/0x25b
[<c0288463>] write_msg+0x40/0x65
[<c011cead>] __call_console_drivers+0x45/0x51
[<c011cf16>] _call_console_drivers+0x5d/0x61
[<c011d4fb>] release_console_sem+0x11f/0x1d8
[<c011d7d7>] register_console+0x1ac/0x1b3
[<c02883f8>] init_netconsole+0x55/0x67
[<c010040c>] init+0x9a/0x24e
[<c01049cf>] kernel_thread_helper+0x7/0x10
[<ffffffff>] 0xffffffff
irq event stamp: 819992
hardirqs last enabled at (819992): [<c0350a16>] net_rx_action+0x39/0x1de
hardirqs last disabled at (819991): [<c0350b1e>] net_rx_action+0x141/0x1de
softirqs last enabled at (817552): [<c01214e4>] __do_softirq+0xa3/0xa8
softirqs last disabled at (819987): [<c0106051>] do_softirq+0x5b/0xc9
other info that might help us debug this:
no locks held by swapper/1.
Mark Glines [Sun, 22 Jul 2007 15:51:59 +0000 (17:51 +0200)]
[TCP]: Use default 32768-61000 outgoing port range in all cases.
This diff changes the default port range used for outgoing connections,
from "use 32768-61000 in most cases, but use N-4999 on small boxes
(where N is a multiple of 1024, depending on just *how* small the box
is)" to just "use 32768-61000 in all cases".
I don't believe there are any drawbacks to this change, and it keeps
outgoing connection ports farther away from the mess of
IANA-registered ports.
Signed-off-by: Mark Glines <mark@glines.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
sys_setsockopt() do not check properly timeout values for
SO_RCVTIMEO/SO_SNDTIMEO, for example it's possible to set negative timeout
values. POSIX do not defines behaviour for sys_setsockopt in case negative
timeouts, but requires that setsockopt() shall fail with -EDOM if the send and
receive timeout values are too big to fit into the timeout fields in the socket
structure.
In current implementation negative timeout can lead to error messages like
"schedule_timeout: wrong timeout value".
Proposed patch:
- checks tv_usec and returns -EDOM if it is wrong
- do not allows to set negative timeout values (sets 0 instead) and outputs
ratelimited information message about such attempts.
Signed-off-By: Vasily Averin <vvs@sw.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Jan Engelhardt [Sun, 22 Jul 2007 15:44:18 +0000 (17:44 +0200)]
[SPARC]: Linux always started with 9600 8N1
The Linux kernel ignored the PROM's serial settings (115200,n,8,1 in
my case). This was because mode_prop remained "ttyX-mode" (expected:
"ttya-mode") due to the constness of string literals when used with
"char *". Since there is no "ttyX-mode" property in the PROM, Linux
always used the default 9600.
[ Investigation of the suncore.s assembler reveals that gcc optimizied
away the stores, yet did not emit a warning, which is a pretty
anti-social thing to do and is the only reason this bug lived for
so long -DaveM ]
Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Dave Jones [Sun, 22 Jul 2007 15:42:38 +0000 (17:42 +0200)]
[IPV4]: Correct rp_filter help text.
As mentioned in http://bugzilla.kernel.org/show_bug.cgi?id=5015
The helptext implies that this is on by default.
This may be true on some distros (Fedora/RHEL have it enabled
in /etc/sysctl.conf), but the kernel defaults to it off.
Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
When creating a new connection by sending an unknown chunk type, we don't
transition to a valid state, causing a NULL pointer dereference in
sctp_packet when accessing sctp_timeouts[SCTP_CONNTRACK_NONE].
Fix by don't creating new conntrack entry if initial state is invalid.
Noticed by Vilmos Nebehaj <vilmos.nebehaj@ramsys.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Corey Mutter [Tue, 22 May 2007 23:01:53 +0000 (01:01 +0200)]
[IPV6]: Reverse sense of promisc tests in ip6_mc_input
Reverse the sense of the promiscuous-mode tests in ip6_mc_input().
Signed-off-by: Corey Mutter <crm-netdev@mutternet.com> Signed-off-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
David L Stevens [Tue, 22 May 2007 22:55:49 +0000 (00:55 +0200)]
[IPV6]: Send ICMPv6 error on scope violations.
When an IPv6 router is forwarding a packet with a link-local scope source
address off-link, RFC 4007 requires it to send an ICMPv6 destination
unreachable with code 2 ("not neighbor"), but Linux doesn't. Fix below.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Srinivas Aji [Tue, 22 May 2007 22:54:10 +0000 (00:54 +0200)]
[TCP]: zero out rx_opt in tcp_disconnect()
When the server drops its connection, NFS client reconnects using the
same socket after disconnecting. If the new connection's SYN,ACK
doesn't contain the TCP timestamp option and the old connection's did,
tp->tcp_header_len is recomputed assuming no timestamp header but
tp->rx_opt.tstamp_ok remains set. Then tcp_build_and_update_options()
adds in a timestamp option past the end of the allocated TCP header,
overwriting TCP data, or when the data is in skb_shinfo(skb)->frags[],
overwriting skb_shinfo(skb) causing a crash soon after. (The issue was
debugged from such a crash.)
Similarly, wscale_ok and sack_ok also get set based on the SYN,ACK
packet but not reset on disconnect, since they are zeroed out at
initialization. The patch zeroes out the entire tp->rx_opt struct in
tcp_disconnect() to avoid this sort of problem.
Signed-off-by: Srinivas Aji <Aji_Srinivas@emc.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Sergei Shtylyov [Tue, 22 May 2007 22:43:37 +0000 (00:43 +0200)]
[NETPOLL]: Remove CONFIG_NETPOLL_RX
Get rid of the CONFIG_NETPOLL_RX option completely since all the
dependencies have been removed long ago...
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Acked-by: Jeff Garzik <jgarzik@pobox.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Sergei Shtylyov [Tue, 22 May 2007 22:41:22 +0000 (00:41 +0200)]
[NETPOLL]: Fix TX queue overflow in trapped mode.
CONFIG_NETPOLL_TRAP causes the TX queue controls to be completely bypassed in
the netpoll's "trapped" mode which easily causes overflows in the drivers with
short TX queues (most notably, in 8139too with its 4-deep queue). So, make
this option more sensible by making it only bypass the TX softirq wakeup.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Acked-by: Jeff Garzik <jgarzik@pobox.com> Acked-by: Tom Rini <trini@kernel.crashing.org> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
When network device's are renamed, the IPV6 snmp6 code
gets confused. It doesn't track name changes so it will OOPS
when network device's are removed.
The fix is trivial, just unregister/re-register in notify handler.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Eric Sesterhenn [Tue, 22 May 2007 22:38:17 +0000 (00:38 +0200)]
[IPV6]: Fix slab corruption running ip6sic
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
Andrew Morton [Tue, 22 May 2007 21:50:21 +0000 (23:50 +0200)]
gcc-4.1.0 is bust
Keith says
Compiling 2.6.19-rc6 with gcc version 4.1.0 (SUSE Linux), wait_hpet_tick is
optimized away to a never ending loop and the kernel hangs on boot in timer
setup.
Andrew Hendry [Fri, 4 May 2007 22:00:25 +0000 (00:00 +0200)]
[X.25]: Add missing sock_put in x25_receive_data
__x25_find_socket does a sock_hold.
This adds a missing sock_put in x25_receive_data.
Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>