I'm using ide on 2.6.30.1 with xfs filesystem. I noticed a kernel memory
leak after writing lots of data, the kmalloc-96 slab cache keeps
growing. It seems the struct ide_cmd kmalloced by idedisk_prepare_flush
is never kfreed.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Acked-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Simon Kirby <sim@netnation.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We can't call nfs_readdata_release()/nfs_writedata_release() without
first initialising and referencing args.context. Doing so inside
nfs_direct_read_schedule_segment()/nfs_direct_write_schedule_segment()
causes an Oops.
We should rather be calling nfs_readdata_free()/nfs_writedata_free() in
those cases.
Looking at the O_DIRECT code, the "struct nfs_direct_req" is already
referencing the nfs_open_context for us. Since the readdata and writedata
structures carry a reference to that, we can simplify things by getting rid
of the extra nfs_open_context references, so that we can replace all
instances of nfs_readdata_release()/nfs_writedata_release().
Robert Richter [Wed, 12 Aug 2009 15:59:52 +0000 (17:59 +0200)]
ring-buffer: Fix advance of reader in rb_buffer_peek()
Backport for 2.6.30-stable of:
469535a ring-buffer: Fix advance of reader in rb_buffer_peek()
When calling rb_buffer_peek() from ring_buffer_consume() and a
padding event is returned, the function rb_advance_reader() is
called twice. This may lead to missing samples or under high
workloads to the warning below. This patch fixes this. If a padding
event is returned by rb_buffer_peek() it will be consumed by the
calling function now.
Also, I simplified some code in ring_buffer_consume().
kernel_sendpage() does the proper default case handling for when the
socket doesn't have a native sendpage implementation.
Now, arguably this might be something that we could instead solve by
just specifying that all protocols should do it themselves at the
protocol level, but we really only care about the common protocols.
Does anybody really care about sendpage on something like Appletalk? Not
likely.
Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Julien TINNES <julien@cr0.org> Acked-by: Tavis Ormandy <taviso@sdf.lonestar.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.
Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.
Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.
Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.
With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
With CONFIG_STACK_PROTECTOR turned on, VMI doesn't boot with
more than one processor. The problem is with the gs value not
being initialized correctly when registering the secondary
processor for VMI's case.
The patch below initializes the gs value for the AP to
__KERNEL_STACK_CANARY. Without this the secondary processor
keeps on taking a GP on every gs access.
access_ok() checks must be done on every part of the userspace structure
that is accessed. If access_ok() on one part of the struct succeeded, it
does not imply it will succeed on other parts of the struct. (Does
depend on the architecture implementation of access_ok()).
This changes the __get_user() users to first check access_ok() on the
data structure.
Signed-off-by: Michael Buesch <mb@bu3sch.de> Cc: Pete Zaitcev <zaitcev@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch (as1272) changes the error code returned when an open call
for a USB device node fails to locate the corresponding device. The
appropriate error code is -ENODEV, not -ENOENT.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Attached patch adds USB vendor and product IDs for Bayer's USB to serial
converter cable used by Bayer blood glucose meters. It seems to be a
FT232RL based device and works without any problem with ftdi_sio driver
when this patch is applied. See: http://winglucofacts.com/cables/
Signed-off-by: Marko Hänninen <bugitus@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The FIEMAP_IOC_FIEMAP mapping ioctl was missing a 32-bit compat handler,
which means that 32-bit suerspace on 64-bit kernels cannot use this ioctl
command.
The structure is nicely aligned, padded, and sized, so it is just this
simple.
Tested w/ 32-bit ioctl tester (from Josef) on a 64-bit kernel on ext4.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Mark Lord <lkml@rtr.ca> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Josef Bacik <josef@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
While looking at Jens Rosenboom bug report
(http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
a dying "ps" program, we found following problem.
clone() syscall has special support for TID of created threads. This
support includes two features.
One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
TID value.
One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
thread dies.
The integer location is a user provided pointer, provided at clone()
time.
kernel keeps this pointer value into current->clear_child_tid.
At execve() time, we should make sure kernel doesnt keep this user
provided pointer, as full user memory is replaced by a new one.
As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
memory in forked processes.
Following sequence could happen:
1) bash (or any program) starts a new process, by a fork() call that
glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
...) syscall
2) When new process starts, its current->clear_child_tid is set to a
location that has a meaning only in bash (or initial program) context
(&THREAD_SELF->tid)
3) This new process does the execve() syscall to start a new program.
current->clear_child_tid is left unchanged (a non NULL value)
4) If this new program creates some threads, and initial thread exits,
kernel will attempt to clear the integer pointed by
current->clear_child_tid from mm_release() :
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
<< here >> put_user(0, tidptr);
sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
}
5) OR : if new program is not multi-threaded, but spied by /proc/pid
users (ps command for example), mm_users > 1, and the exiting program
could corrupt 4 bytes in a persistent memory area (shm or memory mapped
file)
If current->clear_child_tid points to a writeable portion of memory of the
new program, kernel happily and silently corrupts 4 bytes of memory, with
unexpected effects.
Fix is straightforward and should not break any sane program.
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.
Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Parentheses are required or the comparison occurs before the bitand.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The WAKE_MCAST bit is tested twice, the first should be WAKE_UCAST.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Jie Yang <jie.yang@atheros.com> Cc: Jay Cliburn <jcliburn@gmail.com> Cc: Chris Snook <csnook@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Increase the command ORB data structure to transport up to 16 bytes long
CDBs (instead of 12 bytes), and tell the SCSI mid layer about it. This
is notably necessary for READ CAPACITY(16) and friends, i.e. support of
large disks.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Increase the command ORB data structure to transport up to 16 bytes long
CDBs (instead of 12 bytes), and tell the SCSI mid layer about it. This
is notably necessary for READ CAPACITY(16) and friends, i.e. support of
large disks.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I've tested TSL2550 driver and I've found a bug: when light is off,
returned value from tsl2550_calculate_lux function is -1 when it should
be 0 (sensor correctly read that light was off).
I think the bug is that a zero c0 value (approximated value of ch0) is
misinterpreted as an error.
Signed-off-by: Michele Jr De Candia <michele.decandia@valueteam.com> Acked-by: Rodolfo Giometti <giometti@linux.it> Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are some broken devices that report multiple DMA xfer modes
enabled at once (ATA spec doesn't allow it) but otherwise work fine
with DMA so just delete ide_id_dma_bug().
[ As discovered by detective work by Frans and Bart, due to how
handling of the ID block was handled before commit c419993
("ide-iops: only clear DMA words on setting DMA mode") this
check was always seeing zeros in the fields or other similar
garbage. Therefore this check wasn't actually checking anything.
Now that the tests actually check the real bits, all we see are
devices that trigger the check yet work perfectly fine, therefore
killing this useless check is the best thing to do. -DaveM ]
Reported-by: Frans Pop <elendil@planet.nl> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add ide_host_enable_irqs() helper and use it in ide_host_register()
before registering ports. Then remove no longer needed IRQ unmasking
from in init_irq().
This should fix the problem with "screaming" shared IRQ on the first
port (after request_irq() call while we have the unexpected IRQ pending
on the second port) which was uncovered by my rework of the serialized
interfaces support.
Reported-by: Frans Pop <elendil@planet.nl> Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When an array is changed from RAID6 to RAID5, fewer drives are
needed. So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.
Changeset 3869c4aa18835c8c61b44bd0f3ace36e9d3b5bd0
that went in after 2.6.30-rc1 was a seemingly small change to _set_memory_wc()
to make it complaint with SDM requirements. But, introduced a nasty bug, which
can result in crash and/or strange corruptions when set_memory_wc is used.
One such crash reported here
http://lkml.org/lkml/2009/7/30/94
Actually, that changeset introduced two bugs.
* change_page_attr_set() takes &addr as first argument and can the addr value
might have changed on return, even for single page change_page_attr_set()
call. That will make the second change_page_attr_set() in this routine
operate on unrelated addr, that can eventually cause strange corruptions
and bad page state crash.
* The second change_page_attr_set() call, before setting _PAGE_CACHE_WC, should
clear the earlier _PAGE_CACHE_UC_MINUS, as otherwise cache attribute will not
be WC (will be UC instead).
The patch below fixes both these problems. Sending a single patch to fix both
the problems, as the change is to the same line of code. The change to have a
addr_copy is not very clean. But, it is simpler than making more changes
through various routines in pageattr.c.
A huge thanks to Jerome for reporting this problem and providing a simple test
case that helped us root cause the problem.
Reported-by: Jerome Glisse <glisse@freedesktop.org> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20090730214319.GA1889@linux-os.sc.intel.com> Acked-by: Dave Airlie <airlied@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
11static inline unsigned long native_save_fl(void)
12{
13 unsigned long flags;
14
15 asm volatile("# __raw_save_flags\n\t"
16 "pushf ; pop %0"
17 : "=g" (flags)
18 : /* no input */
19 : "memory");
20
21 return flags;
22}
If gcc chooses to put flags on the stack, for instance because this is
inlined into a larger function with more register pressure, the offset
of the flags variable from the stack pointer will change when the
pushf is performed. gcc doesn't attempt to understand that fact, and
address used for pop will still be the same. It will write to
somewhere near flags on the stack but not actually into it and
overwrite some other value.
I saw this happen in the ide_device_add_all function when running in a
simulator I work on. I'm assuming that some quirk of how the simulated
hardware is set up caused the code path this is on to be executed when
it normally wouldn't.
A simple fix might be to change "=g" to "=r".
Reported-by: Gabe Black <spamforgabe@umich.edu> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The code was incorrectly reserving memtypes using the page
virtual address instead of the physical address. Furthermore,
the code was not ignoring highmem pages as it ought to.
( upstream does not pass in highmem pages yet - but upcoming
graphics code will do it and there's no reason to not handle
this properly in the CPA APIs.)
Narayanan reports "The regression is around 15%. There is no disk controller
as our setup is based on Samsung OneNAND used as a memory mapped device on a
OMAP2430 based board."
The page allocator tries to preserve contiguous PFN ordering when returning
pages such that repeated callers to the allocator have a strong chance of
getting physically contiguous pages, particularly when external fragmentation
is low. However, of the bulk of the allocations have __GFP_COLD set as they
are due to aio_read() for example, then the PFNs are in reverse PFN order.
This can cause performance degration when used with IO controllers that could
have merged the requests.
This patch attempts to preserve the contiguous ordering of PFNs for users of
__GFP_COLD.
This is because hugetlb_unreserve_pages() is unconditionally removing
blocks_per_huge_page(h) on each call rather than using the freed amount.
If there were 0 blocks, it goes negative, resulting in the above.
usb0 and usb1 mux settings in the sicrl register were swapped (twice!)
in mpc834x_usb_cfg(), leading to various strange issues with fsl-ehci
and full speed devices.
The USB port config on mpc834x is done using 2 muxes: Port 0 is always
used for MPH port 0, and port 1 can either be used for MPH port 1 or DR
(unless DR uses UTMI phy or OTG, then it uses both ports) - See 8349 RM
figure 1-4..
mpc8349_usb_cfg() had this inverted for the DR, and it also had the bit
positions of the usb0 / usb1 mux settings swapped. It would basically
work if you specified port1 instead of port0 for the MPH controller (and
happened to use ULPI phys), which is what all the 834x dts have done,
even though that configuration is physically invalid.
Instead fix mpc8349_usb_cfg() and adjust the dts files to match reality.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk> Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This fixes regression (battery "vanishing" on resume) introduced by
commit d0c71fe7ebc180f1b7bc7da1d39a07fc19eec768 ("ACPI Suspend: Enable
ACPI during resume if SCI_EN is not set") and also the issue with
the "screaming" IRQ 9.
Reported-and-tested-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Acked-by: Len Brown <lenb@kernel.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
These pointers can be NULL, the is_mesh() case isn't
ever hit in the current kernel, but cmp_ies() can be
hit under certain conditions.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
loff_t is a signed type. If userspace passes a negative ppos, the "count"
range check is weakened. "count"s bigger than HPEE_MAX_LENGTH will pass the check.
Also, if ppos is negative, the readb(eisa_eeprom_addr + *ppos) will poke in random
memory.
About a half events are missing when we splice_read
from trace_pipe. They are unexpectedly consumed because we ignore
the TRACE_TYPE_NO_CONSUME return value used by the function graph
tracer when it needs to consume the events by itself to walk on
the ring buffer.
The same problem appears with ftrace_dump()
Example of an output before this patch:
1) | ktime_get_real() {
1) 2.846 us | read_hpet();
1) 4.558 us | }
1) 6.195 us | }
After this patch:
0) | ktime_get_real() {
0) | getnstimeofday() {
0) 1.960 us | read_hpet();
0) 3.597 us | }
0) 5.196 us | }
The fix also applies on 2.6.30
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A6EEC52.90704@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When print_graph_entry() computes a function call entry event, it needs
to also check the next entry to guess if it matches the return event of
the current function entry.
In order to look at this next event, it needs to consume the current
entry before going ahead in the ring buffer.
However, if the current event that gets consumed is the last one in the
ring buffer head page, the ring_buffer may reuse the page for writers.
The consumed entry will then become invalid because of possible
racy overwriting.
Me must then handle this entry by making a copy of it.
The fix also applies on 2.6.30
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A6EEAEC.3050508@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Andrea Gelmini gave me a report that a kernel oops hit on a nilfs
filesystem with a 1KB block size when doing rsync.
This turned out to be caused by an inconsistency of dirty state
between a page and its buffers storing b-tree node blocks.
If the page had multiple buffers split over multiple logs, and if the
logs were written at a time, a dirty flag remained in the page even
every dirty flag in the buffers was cleared.
This will fix the failure by dropping the dirty flag properly for
pages with the discrete multiple b-tree nodes.
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently, the ThinkPad-ACPI bay and dock drivers are completely
broken, and cause a NULL pointer derreference in kernel mode (and,
therefore, an OOPS) when they try to issue events (i.e. on dock,
undock, bay ejection, etc).
OTOH, the standard ACPI dock driver can handle the hotplug bays and
docks of the ThinkPads just fine (including batteries) as of 2.6.27.
In fact, it does a much better job of it than thinkpad-acpi ever did.
It is just not worth the hassle to find a way to fix this crap without
breaking the (deprecated) thinkpad-acpi dock/bay ABI. This is old,
deprecated code that sees little testing or use.
As a quick fix suitable for -stable backports, mark the thinkpad-acpi
bay and dock subdrivers as BROKEN in Kconfig. The dead code will be
removed by a later patch.
This fixes bugzilla #13669, and should be applied to 2.6.27 and later.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Reported-by: Joerg Platte <jplatte@naasa.net> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If the referral is malformed or the hostname can't be resolved, then
the current code generates an oops. Fix it to handle these errors
gracefully.
Reported-by: Sandro Mathys <sm@sandro-mathys.ch> Acked-by: Igor Mammedov <niallain@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There's a hotplug problem in the way libsas allocates ports: it loops over the
available ports first trying to add to an existing for a wide port and
otherwise allocating the next free port. This scheme only works if the port
array is packed from zero, which fails if a port gets hot unplugged and the
array becomes sparse. In that case, a new port is formed even if there's a
wide port it should be part of. Fix this by creating two loops over all the
ports: the first to see if the phy should be part of a wide port and the
second to form a new port in an empty port slot.
Signed-off-by: Tom Peng <tom_peng@usish.com> Signed-off-by: Jack Wang <jack_wang@usish.com> Signed-off-by: Lindar Liu <lindar_liu@usish.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Requests to get max LUN, for certain USB storage devices, require a
longer timeout before a correct reply is returned. This happens for a
Realtek USB Card Reader (0bda:0152), which has a max LUN of 3 but is set
to 0, thus losing functionality, because of the timeout occurring too
quickly.
Raising the timeout value fixes the issue and might help other devices
to return a correct max LUN value as well.
Update directory hardlink count when moving kobjects to a new parent.
Fixes the following problem which occurs when several devices are
moved to the same parent and then unregistered:
Unitialized fence register could leads to corrupted display. Problem
encountered on MacBooks (revision 1 and 2), directly booting from EFI
or through BIOS emulation.
(bug #21710 at freedestop.org)
Signed-off-by: Grégoire Henry <henry@pps.jussieu.fr> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With the DRM-driven DPMS code, encoders are considered idle unless a
connector is hooked to them, so mode setting is skipped. This makes load
detection fail as none of the hardware is enabled.
Signed-off-by: Keith Packard <keithp@keithp.com> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix a FIXME in the intel LVDS bring-up code, adding the appropriate
blacklist entry for the AOpen Mini PC, courtesy of a dmidecode
dump from Florian Demmer.
Signed-off-by: Jarod Wilson <jarod@redhat.com> CC: Florian Demmer <florian@demmer.org> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As transparent proxying looks up the socket early and assigns
it to the skb for later processing, we must drop any existing
socket ownership prior to that in order to distinguish between
the case where tproxy is active and where it is not.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In order to get the tun driver to account packets, we need to be
able to receive packets with destructors set. To be on the safe
side, I added an skb_orphan call for all protocols by default since
some of them (IP in particular) cannot handle receiving packets
destructors properly.
Now it seems that at least one protocol (CAN) expects to be able
to pass skb->sk through the rx path without getting clobbered.
So this patch attempts to fix this properly by moving the skb_orphan
call to where it's actually needed. In particular, I've added it
to skb_set_owner_[rw] which is what most users of skb->destructor
call.
This is actually an improvement for tun too since it means that
we only give back the amount charged to the socket when the skb
is passed to another socket that will also be charged accordingly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Oliver Hartkopp <olver@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As it stands skb fraglists can get past the check in dev_queue_xmit
if the skb is marked as GSO. In particular, if the packet doesn't
have the proper checksums for GSO, but can otherwise be handled by
the underlying device, we will not perform the fraglist check on it
at all.
If the underlying device cannot handle fraglists, then this will
break.
The fix is as simple as moving the fraglist check from the device
check into skb_gso_ok.
This has caused crashes with Xen when used together with GRO which
can generate GSO packets with fraglists.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When NAPI is disabled while we're in net_rx_action, we end up
calling __napi_complete without flushing GRO packets. This is
a bug as it would cause the GRO packets to linger, of course it
also literally BUGs to catch error like this :)
This patch changes it to napi_complete, with the obligatory IRQ
reenabling. This should be safe because we've only just disabled
IRQs and it does not materially affect the test conditions in
between.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1
(net: sk_prot_alloc() should not blindly overwrite memory)
took care of not zeroing whole new socket at allocation time.
sock_copy() is another spot where we should be very careful.
We should not set refcnt to a non null value, until
we are sure other fields are correctly setup, or
a lockless reader could catch this socket by mistake,
while not fully (re)initialized.
This patch puts sk_node & sk_refcnt to the very beginning
of struct sock to ease sock_copy() & sk_prot_alloc() job.
We add appropriate smp_wmb() before sk_refcnt initializations
to match our RCU requirements (changes to sock keys should
be committed to memory before sk_refcnt setting)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
depends on sk->sk_nulls_node.next being always valid. A NULL
value is not allowed as it might fault a lockless reader.
Current sk_prot_alloc() implementation doesnt respect this hypothesis,
calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
the forbidden field.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The host-side CDC subset driver is binding more specifically
than it should ... only to PXA 210/25x/26x Linux-USB gadgets.
Loosen that restriction to match the gadget driver driver.
This will various PXA 27x and PXA 3xx devices happier when
talking to Linux hosts, potentially others.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Tested-by: Aric D. Blumer <aric@sdgsystems.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Converting checksum field from le16 to CPU byte order fixes the issue.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Our CAST algorithm is called cast5, not cast128. Clearly nobody
has ever used it :)
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
E100 places it's RX packet descriptors inside skb->data and uses them
with bidirectional streaming DMA mapping. Unfortunately it fails to
transfer skb->data ownership to the device after it reads the
descriptor's status, breaking on non-coherent (e.g., ARM) platforms.
This have to be converted to use coherent memory for the descriptors.
Signed-off-by: Krzysztof Halasa <khc@pm.waw.pl> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
While testing the driver on PPC, we ran into a crash with LRO, Jumbo frames.
With CONFIG_PPC_64K_PAGES configured (a default in PPC), MAX_SKB_FRAGS drops to 3 and we were crossing the array limits on skb_shinfo(skb)->frags[].
Now we coalesce the frags from the same physical page into one slot in
skb_shinfo(skb)->frags[] and go to the next index when the frag is from
different physical page.
This patch is against the net-2.6 tree.
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This changes the power_level file to adhere to the "one value
per file" sysfs rule. The user will know which power level was
requested as it will be the number just written to this file. It
is thus not necessary to create a new sysfs file for this value.
In addition it fixes a problem where powertop's parsing expects
this value to be the first value in this file without any descriptions.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Tag 11 packets are stored in the metadata section of an eCryptfs file to
store the key signature(s) used to encrypt the file encryption key.
After extracting the packet length field to determine the key signature
length, a check is not performed to see if the length would exceed the
key signature buffer size that was passed into parse_tag_11_packet().
Thanks to Ramon de Carvalho Valle for finding this bug using fsfuzzer.
With the "security: use mmap_min_addr indepedently of security models"
change, mmap_min_addr is used in common areas, which susbsequently blows
up the nommu build. This stubs in the definition in the nommu case as
well.
Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Mike Frysinger <vapier.adi@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: James Morris <jmorris@namei.org>
ata_eh_reset() was missing error return handling after follow-up SRST
allowing EH to continue the normal probing path after reset failure.
This was discovered while testing new WD 2TB drives which take longer
than 10 secs to spin up and cause the first follow-up SRST to time
out.
This patch adds DMI information to automatically load the correct
layout for the Maxdata Pro 7000X/DX notebook models. Such notebooks
are clones of Fujitsu Amilo V2000, the hook for the v2000 is being
used and I have tested that perfectly works.
The immediate result of integrating this patch is that the five
special buttons will work on these specific notebook models and that
the RF killswitch will not be activated after suspend. This patch
definitively obsoletes the fsam7400 module which I was still needing
to enable wifi and to fix the RF killswitch suspend problem; in the
current 2.6.30 kernel it is necessary to load the wistron_btns module
with options 'force=1 keymap=1557/MS2141', which was not anyway a
complete workaround.
alloc_etherdev() used to install a default implementation of this
operation, but it must now be explicitly installed in struct
net_device_ops.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
alloc_etherdev() used to install default implementations of these
operations, but they must now be explicitly installed in struct
net_device_ops.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
objects, since slab allocator could give a freed object still used by lockless
readers.
In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
object in hash chain.)
kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
for ct->tuplehash[xxx].hnnode.next.
Fix is to call kmem_cache_alloc() and do the zeroing ourself.
As spotted by Patrick, we also need to make sure lookup keys are committed to
memory before setting refcount to 1, or a lockless reader could get a reference
on the old version of the object. Its key re-check could then pass the barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When NAT helpers change the TCP packet size, the highest seen sequence
number needs to be corrected. This is currently only done upwards, when
the packet size is reduced the sequence number is unchanged. This causes
TCP conntrack to falsely detect unacknowledged data and decrease the
timeout.
Fix by updating the highest seen sequence number in both directions after
packet mangling.
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Compiling the kernel with clang has shown this warning:
net/netfilter/xt_rateest.c:69:16: warning: self-comparison always results in a
constant value
ret &= pps2 == pps2;
^
Looking at the code:
if (info->flags & XT_RATEEST_MATCH_BPS)
ret &= bps1 == bps2;
if (info->flags & XT_RATEEST_MATCH_PPS)
ret &= pps2 == pps2;
Judging from the MATCH_BPS case it seems to be a typo, with the intention of
comparing pps1 with pps2.
http://bugzilla.kernel.org/show_bug.cgi?id=13535
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit v2.6.29-rc5-872-gacc738f ("xtables: avoid pointer to self")
forgot to copy the initial quota value supplied by iptables into the
private structure, thus counting from whatever was in the memory
kmalloc returned.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The RCU protected conntrack hash lookup only checks whether the entry
has a refcount of zero to decide whether it is stale. This is not
sufficient, entries are explicitly removed while there is at least
one reference left, possibly more. Explicitly check whether the entry
has been marked as dying to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
New connection tracking entries are inserted into the hash before they
are fully set up, namely the CONFIRMED bit is not set and the timer not
started yet. This can theoretically lead to a race with timer, which
would set the timeout value to a relative value, most likely already in
the past.
Perform hash insertion as the final step to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 31207dab7d2e63795eb15823947bd2f7025b08e2
"Fix incorrect allocation of interrupt rev-map"
introduced a regression crashing on boot on machines using
a "DCR" based MPIC, such as the Cell blades.
The reason is that the irq host data structure is initialized
much later as a result of that patch, causing our calls to
mpic_map() do be done before we have a host setup.
Unfortunately, this breaks _mpic_map_dcr() which uses the
mpic->irqhost to get to the device node.
This fixes it by, instead, passing the device node explicitely
to mpic_map().
In testing a backport of the write_begin/write_end AOPs, a 10% re-read
regression was noticed when running iozone. This regression was
introduced because the old AOPs would always do a mark_page_accessed(page)
after the commit_write, but when the new AOPs where introduced, the only
place this was kept was in pagecache_write_end().
This patch does the same thing in the generic case as what is done in
pagecache_write_end(), which is just to mark the page accessed before we
do write_end().
Signed-off-by: Josef Bacik <jbacik@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It's really not right to use 'access_ok()', since that is meant for the
normal "get_user()" and "copy_from/to_user()" accesses, which are done
through the TLB, rather than through the page tables.
Why? access_ok() does both too few, and too many checks. Too many,
because it is meant for regular kernel accesses that will not honor the
'user' bit in the page tables, and because it honors the USER_DS vs
KERNEL_DS distinction that we shouldn't care about in GUP. And too few,
because it doesn't do the 'canonical' check on the address on x86-64,
since the TLB will do that for us.
So instead of using a function that isn't meant for this, and does
something else and much more complicated, just do the real rules: we
don't want the range to overflow, and on x86-64, we want it to be a
canonical low address (on 32-bit, all addresses are canonical).
Acked-by: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
vmscan: do not unconditionally treat zones that fail zone_reclaim() as full
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met. The problem is that zone_reclaim() failing at all means the
zone gets marked full.
This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.
This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full
if it really is unreclaimable. If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.
There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.
This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced. It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option. Due to the age of the bug, it should be
considered a -stable candidate.
x86, setup (2.6.30-stable) fix 80x34 and 80x60 console modes
Note: this is not in upstream since upstream is not affected due to the
new "BIOS glovebox" subsystem.
As coded, most INT10 calls in video-vga.c allow the compiler to assume
EAX remains unchanged across them, which is not always the case. This
triggers an optimisation issue that causes vga_set_vertical_end() to be
called with an incorrect number of scanlines. Fix this by beefing up
the asm constraints on these calls.
Reported-by: Marc Aurele La France <tsi@xfree86.org> Signed-off-by: Marc Aurele La France <tsi@xfree86.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are two reasons to expose the memory *a in the asm:
1) To prevent the compiler from discarding a preceeding write to *a, and
2) to prevent it from caching *a in a register over the asm.
The change has had a few days testing with a SMP build of 2.6.22.19
running on a rp3440.
This patch is about the correctness of the __ldcw() macro itself.
The use of the macro should be confined to small inline functions
to try to limit the effect of clobbering memory on GCC's optimization
of loads and stores.
The TLB flushing functions on hppa, which causes PxTLB broadcasts on the system
bus, needs to be protected by irq-safe spinlocks to avoid irq handlers to deadlock
the kernel. The deadlocks only happened during I/O intensive loads and triggered
pretty seldom, which is why this bug went so long unnoticed.
Signed-off-by: Helge Deller <deller@gmx.de>
[edited to use spin_lock_irqsave on UP as well since we'd been locking there
all this time anyway, --kyle] Signed-off-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>