With NFSv4, if we create a file then open it we explicit avoid checking
the permissions on the file during the open because the fact that we
created it ensures we should be allow to open it (the create and the
open should appear to be a single operation).
However if the reply to an EXCLUSIVE create gets lots and the client
resends the create, the current code will perform the permission check -
because it doesn't realise that it did the open already..
This patch should fix this.
Note that I haven't actually seen this cause a problem. I was just
looking at the code trying to figure out a different EXCLUSIVE open
related issue, and this looked wrong.
(Fix confirmed with pynfs 4.0 test OPEN4--bfields)
Signed-off-by: NeilBrown <neilb@suse.de>
[bfields: use OWNER_OVERRIDE and update for 4.1] Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Very embarassing: 1091006c5eb15cba56785bd5b498a8d0b9546903 "nfsd: turn
on reply cache for NFSv4" missed a line, effectively leaving the reply
cache off in the v4 case. I thought I'd tested that, but I guess not.
This time, wrote a pynfs test to confirm it works.
Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The writeback code is already capable of passing errors back to user space
by means of the open_context->error. In the case of ENOSPC, Neil Brown
is reporting seeing 2 errors being returned.
Neil writes:
"e.g. if /mnt2/ if an nfs mounted filesystem that has no space then
reported Input/output error and the relevant parts of the strace output are:
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
fsync(1) = -1 EIO (Input/output error)
close(1) = -1 ENOSPC (No space left on device)"
Neil then shows that the duplication of error messages appears to be due to
the use of the PageError() mechanism, which causes filemap_fdatawait_range
to return the extra EIO. The regression was introduced by
commit 7b281ee026552f10862b617a2a51acf49c829554 (NFS: fsync() must exit
with an error if page writeback failed).
Fix this by removing the call to SetPageError(), and just relying on
open_context->error reporting the ENOSPC back to fsync().
Reported-by: Neil Brown <neilb@suse.de> Tested-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is almost always wrong for NFS to call drop_nlink() after removing a
file. What we really want is to mark the inode's attributes for
revalidation, and we want to ensure that the VFS drops it if we're
reasonably sure that this is the final unlink().
Do the former using the usual cache validity flags, and the latter
by testing if inode->i_nlink == 1, and clearing it in that case.
This also fixes the following warning reported by Neil Brown and
Jeff Layton (among others).
In rare circumstances, nfs_clone_server() of a v2 or v3 server can get
an error between setting server->destory (to nfs_destroy_server), and
calling nfs_start_lockd (which will set server->nlm_host).
If this happens, nfs_clone_server will call nfs_free_server which
will call nfs_destroy_server and thence nlmclnt_done(NULL). This
causes the NULL to be dereferenced.
So add a guard to only call nlmclnt_done() if ->nlm_host is not NULL.
The other guards there are irrelevant as nlm_host can only be non-NULL
if one of these flags are set - so remove those tests. (Thanks to Trond
for this suggestion).
This is suitable for any stable kernel since 2.6.25.
Eryu provided a test program that would segfault when attempting to read
past the EOF on file that was opened O_DIRECT. The buffer given to the
read() call was on the stack, and when he attempted to read past it it
would scribble over the rest of the stack page.
If we hit the end of the file on a DIO READ request, then we don't want
to zero out the rest of the buffer. These aren't pagecache pages after
all, and there's no guarantee that the buffers that were passed in
represent entire pages.
Reported-by: Eryu Guan <eguan@redhat.com> Cc: Fred Isaman <iisaman@netapp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 1f1ea6c "NFSv4: Fix buffer overflow checking in
__nfs4_get_acl_uncached" accidently dropped the checking for too small
result buffer length.
If someone uses getxattr on "system.nfs4_acl" on an NFSv4 mount
supporting ACLs, the ACL has not been cached and the buffer suplied is
too short, we still copy the complete ACL, resulting in kernel and user
space memory corruption.
While there's no locking involved here, the operations are serialized,
so CTO should prevent corruption.
The first write to the file is fine and writes 4 bytes. The file is then
extended on the server. When it's reopened a GETATTR is issued and the
size change is noticed. This causes NFS_INO_INVALID_DATA to be set on
the file. Because the file is opened for write only,
nfs_want_read_modify_write() returns 0 to nfs_write_begin().
nfs_updatepage then calls nfs_write_pageuptodate() to see if it should
extend the nfs_page to cover the whole page. NFS_INO_INVALID_DATA is
still set on the file at that point, but that flag is ignored and
nfs_pageuptodate erroneously extends the write to cover the whole page,
with the write done on the server side filled in with zeroes.
This patch just has that function check for NFS_INO_INVALID_DATA in
addition to NFS_INO_REVAL_PAGECACHE. This fixes the bug, but looking
over the code, I wonder if we might have a similar bug in
nfs_revalidate_size(). The difference between those two flags is very
subtle, so it seems like we ought to be checking for
NFS_INO_INVALID_DATA in most of the places that we look for
NFS_INO_REVAL_PAGECACHE.
I believe this is regression introduced by commit 8d197a568. The code
did check for NFS_INO_INVALID_DATA prior to that patch.
If I mount an NFS v4.1 server to a single client multiple times and then
run xfstests over each mountpoint I usually get the client into a state
where recovery deadlocks. The server informs the client of a
cb_path_down sequence error, the client then does a
bind_connection_to_session and checks the status of the lease.
I found that bind_connection_to_session sets the NFS4_SESSION_DRAINING
flag on the client, but this flag is never unset before
nfs4_check_lease() reaches nfs4_proc_sequence(). This causes the client
to deadlock, halting all NFS activity to the server. nfs4_proc_sequence()
is only called by the state manager, so I can change it to run in privileged
mode to bypass the NFS4_SESSION_DRAINING check and avoid the deadlock.
At one point acpi_device_set_id() checks if acpi_device_hid(device)
returns NULL, but that never happens, so system bus devices with an
empty list of PNP IDs are given the dummy HID ("device") instead of
the "system bus HID" ("LNXSYBUS"). Fix the code to use the right
check.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 66fa7f215 "libata-acpi: improve ACPI disabling" introdcued the
behaviour of disabling ATA ACPI if ata_acpi_on_devcfg failed the 2nd
time, but commit 30dcf76ac dropped this behaviour and this caused
problem for Dimitris Damigos, where his laptop can not resume correctly.
The bugzilla page for it is:
https://bugzilla.kernel.org/show_bug.cgi?id=49331
The problem is, ata_dev_push_id will fail the 2nd time it is invoked,
and due to disabling ACPI code is dropped, ata_acpi_on_devcfg which
calls ata_dev_push_id will keep failing and eventually made the device
disabled.
This patch restores the original behaviour, if acpi failed the 2nd time,
disable acpi functionality for the device(and we do not event need to
add a debug message for this as it is still there ;-).
Reported-by: Dimitris Damigos <damigos@freemail.gr> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current acpisleep DMI checks only run when CONFIG_SUSPEND is set.
And this may break hibernation on some platforms when CONFIG_SUSPEND
is cleared.
Move acpisleep DMI check into #ifdef CONFIG_ACPI_SLEEP instead.
[rjw: Added acpi_sleep_dmi_check() and rebased on top of earlier
patches adding entries to acpisleep_dmi_table[].]
References: https://bugzilla.kernel.org/show_bug.cgi?id=45921 Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I think this is wrong since 72c973dd ("usb: gadget: add
usb_endpoint_descriptor to struct usb_ep"). If we fail to allocate an ep
or bail out early we shouldn't check for the descriptor which is
assigned at ep_enable() time.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tatyana Brokhman <tlinder@codeaurora.org> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "video->minor = -1" assigment is done in V4L2 by
video_register_device() so it is removed here.
Now. uvc_function_bind() calls in error case uvc_function_unbind() for
cleanup. The problem is that uvc_function_unbind() frees the uvc struct
and uvc_bind_config() does as well in error case of usb_add_function().
Removing kfree() in usb_add_function() would make the patch smaller but
it would look odd because the new allocated memory is not cleaned up.
However it is not guaranteed that if we call usb_add_function() we also
get to the bind function.
Therefore the patch extracts the conditional cleanup from
uvc_function_unbind() applies to uvc_function_bind().
uvc_function_unbind() now contains only the complete cleanup which is
required once everything has been registrated.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Bhupesh Sharma <bhupesh.sharma@st.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The descriptor list for FS speed was not NULL terminated. This patch
fixes this.
While here one of the twe two bAlternateSetting assignments for the BOT
interface. Both assign 0, one is enough.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The HS descriptors are only created if HS is supported by the UDC but we
never free them.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It also adds a note about Gemtek WUBI-100GW
and SparkLAN WL-682 USBID conflict [WUBI-100GW
is a ISL3886+NET2280 (LM86 firmare) solution,
whereas WL-682 is a ISL3887 (LM87 firmware)]
device.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tomasz Guszkowski <tsg@o2.pl> Acked-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Incorrect use of usb_alloc_coherent memory as input buffer to usb_control_msg
can cause problems in arch DMA code, for example kernel BUG at
'arch/arm/include/asm/dma-mapping.h:321' on ARM (linux-3.4).
Change _usb_writeN_sync use kmalloc'd buffer instead.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Acked-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dcbw@redhat.com> Acked-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The mute LED is in this case connected to the Mic1 VREF.
The machine also exposes the following string in BIOS:
"HP_Mute_LED_0_A", so if more machines are coming, it probably
makes sense to try to do something more generic, like for the
IDT codec.
The workaround to force VREF50 for dallas/hp model with ALC861VD
was introduced in commit 8fdcb6fe4204bdb4c6991652717ab5063751414e,
but it contained wrong pincap override bits.
This patch fixes to exclude VREF80 pincap bit correctly.
We've seen the broken HDMI *video* output on some machines with GM965,
and the debugging session pointed that the culprit is the disabled
audio output pins. Toggling these pins dynamically on demand caused
flickering of HDMI TV.
This patch changes the behavior to keep the pin ON constantly.
The runtime_idle callback is the right place to check the suspend
capability, but currently we do it wrongly in the runtime_suspend
callback. This leads to a kernel error message like:
pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x50 [snd_hda_intel] returns -11
and the runtime PM core would even repeat the attempts.
The commit [88a8516a: ALSA: usbaudio: implement USB autosuspend] added
the support of autopm for USB MIDI output, but it didn't take the MIDI
input into account.
This patch adds the following for fixing the autopm:
- Manage the URB start at the first MIDI input stream open, instead of
the time of instance creation
- Move autopm code to the common substream_open()
- Make snd_usbmidi_input_start/_stop() more robust and add the running
state check
Reviewd-by: Clemens Ladisch <clemens@ladisch.de> Tested-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a similar protection against the disconnection race and the
invalid use of usb instance after disconnection, as well as we've done
for the USB audio PCM.
Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts
or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic.
Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when
no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
With slab poisoning, you can then rely on mpol_to_str() to set the bit
for node 0x6b6b, probably in the next page above the caller's stack.
mpol_parse_str() is only called from shmem_parse_options(): no_context
is always true, so call it unused for now, and remove !no_context code.
Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
expect. Then mpol_to_str() can ignore its no_context argument also,
the mpol being appropriately initialized whether contextualized or not.
Rename its no_context unused too, and let subsequent patch remove them
(that's not needed for stable backporting, which would involve rejects).
I don't understand why MPOL_LOCAL is described as a pseudo-policy:
it's a reasonable policy which suffers from a confusing implementation
in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be
much more robust if MPOL_LOCAL were recognized in switch statements
throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
empty) nodes mask like everyone else, instead of its preferred_node
variant (I presume an optimization from the days before MPOL_LOCAL).
But that would take me too long to get right and fully tested.
Unfortunately with !CONFIG_PAGEFLAGS_EXTENDED, (!PageHead) is false, and
(PageHead) is true, for tail pages. If this is indeed the intended
behavior, which I doubt because it breaks cache cleaning on some ARM
systems, then the nomenclature is highly problematic.
This patch makes sure PageHead is only true for head pages and PageTail
is only true for tail pages, and neither is true for non-compound pages.
[ This buglet seems ancient - seems to have been introduced back in Apr
2008 in commit 6a1e7f777f61: "pageflags: convert to the use of new
macros". And the reason nobody noticed is because the PageHead()
tests are almost all about just sanity-checking, and only used on
pages that are actual page heads. The fact that the old code returned
true for tail pages too was thus not really noticeable. - Linus ]
Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Will Deacon <Will.Deacon@arm.com> Cc: Steve Capper <Steve.Capper@arm.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The system uses global_dirtyable_memory() to calculate number of
dirtyable pages/pages that can be allocated to the page cache. A bug
causes an underflow thus making the page count look like a big unsigned
number. This in turn confuses the dirty writeback throttling to
aggressively write back pages as they become dirty (usually 1 page at a
time). This generally only affects systems with highmem because the
underflowed count gets subtracted from the global count of dirtyable
memory.
Virtio devices may attempt to add descriptors to a virtqueue from atomic
context using GFP_ATOMIC allocation. This is problematic because such
allocations can fall outside of the lowmem mapping, causing virt_to_phys
to report bogus physical addresses which are subsequently passed to
userspace via the buffers for the virtual device.
This patch masks out __GFP_HIGH and __GFP_HIGHMEM from the requested
flags when allocating descriptors for a virtqueue. If an atomic
allocation is requested and later fails, we will return -ENOSPC which
will be handled by the driver.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When using a virtio transport, the 9p net device may pass the physical
address of a kernel buffer to userspace via a scatterlist inside a
virtqueue. If the kernel buffer is mapped outside of the linear mapping
(e.g. highmem), then virt_to_page will return a bogus value and we will
populate the scatterlist with junk.
This patch uses kmap_to_page when populating the page array for a kernel
buffer.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some virtio device drivers (9p) need to translate high virtual addresses
to physical addresses, which are inserted into the virtqueue for
processing by userspace.
This patch exports the kmap_to_page symbol, so that the affected drivers
can be compiled as modules.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some MSI laptop BIOSes are broken - INT 15h code uses port 92h to enable A20
line but resume code assumes that KBC was used.
The laptop will not resume from S3 otherwise but powers off after a while
and then powers on again stuck with a blank screen.
Fix it by enabling A20 using KBC in i8042_platform_init for x86.
Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Rafael J. Wysocki <rjw@sisk.pl> Link: http://lkml.kernel.org/r/201212112218.06551.linux@rainbow-software.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To allow debuggers to unwind through signal frames, we create a fake
stack unwinding prologue containing the link register and frame pointer
of the interrupted context. The signal frame is then offset by 16 bytes
to make room for the two saved registers which are pushed onto the frame
of the *interrupted* context, rather than placed directly above the
signal stack.
This doesn't work when an alternative signal stack is set up for a SEGV
handler, which is raised in response to RLIMIT_STACK being reached. In
this case, we try to push the unwinding prologue onto the full stack and
subsequently take a fault which we fail to resolve, causing setup_return
to return -EFAULT and handle_signal to force_sigsegv on the current task.
This patch fixes the problem by including the unwinding prologue as part
of the rt_sigframe definition, which is populated during setup_sigframe,
ensuring that it always ends up on the signal stack.
The AArch64 Linux port relies on the mm code to wrprotect clean ptes.
This however is not the case with newly created ptes and
PAGE_SHARED(_EXEC) is writable but !dirty.
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.
Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules. Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted. They leave bprm->interp
pointing to their local stack. This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.
After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules. As
such, we need to protect the changes to interp.
This changes the logic to require allocation for any changes to the
bprm->interp. To avoid adding a new kmalloc to every exec, the default
value is left as-is. Only when passing through binfmt_script or
binfmt_misc does an allocation take place.
We found a user code which was raising a divide-by-zero trap. That trap
would lead to XPC connections between system-partitions being torn down
due to the die_chain notifier callouts it received.
This also revealed a different issue where multiple callers into
xpc_die_deactivate() would all attempt to do the disconnect in parallel
which would sometimes lock up but often overwhelm the console on very
large machines as each would print at least one line of output at the
end of the deactivate.
I reviewed all the users of the die_chain notifier and changed the code
to ignore the notifier callouts for reasons which will not actually lead
to a system to continue on to call die().
[akpm@linux-foundation.org: fix ia64] Signed-off-by: Robin Holt <holt@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[akpm@linux-foundation.org: return false for '@' as well, per Bjorn] Signed-off-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ieee80211_free_txskb() needs to be used instead of dev_kfree_skb_any for
tx packets passed to the driver from mac80211
Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recent versions of udev cause synchronous firmware loading from the
probe routine to fail because the request to user space times out.
The original fix for b43legacy (commit a3ea2c7) moved the firmware
load from the probe routine to a work queue, but it still used synchronous
firmware loading. This method is OK when b43legacy is built as a module;
however, it fails when the driver is compiled into the kernel.
This version changes the code to load the initial firmware file
using request_firmware_nowait(). A completion event is used to
hold the work queue until that file is available. The remaining
firmware files are read synchronously.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is one race that both request_firmware() with the same
firmware name.
The race scenerio is as below:
CPU1 CPU2
request_firmware() -->
_request_firmware_load() return err another request_firmware() is coming -->
_request_firmware_cleanup is called --> _request_firmware_prepare -->
release_firmware ---> fw_lookup_and_allocate_buf -->
spin_lock(&fwc->lock)
... __fw_lookup_buf() return true
fw_free_buf() will be called --> ...
kref_put -->
decrease the refcount to 0
kref_get(&tmp->ref) ==> it will trigger warning
due to refcount == 0
__fw_free_buf() -->
... spin_unlock(&fwc->lock)
spin_lock(&fwc->lock)
list_del(&buf->list)
spin_unlock(&fwc->lock)
kfree(buf)
After that, the freed buf will be used.
The key race is decreasing refcount to 0 and list_del is not protected together by
fwc->lock, and it is possible another thread try to get it between refcount==0
and list_del.
Fix it here to protect it together.
Acked-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a race as below when calling request_firmware():
CPU1 CPU2
write 0 > loading
mutex_lock(&fw_lock)
...
set_bit FW_STATUS_DONE class_timeout is coming
set_bit FW_STATUS_ABORT
complete_all &completion
...
mutex_unlock(&fw_lock)
In this time, the bit FW_STATUS_DONE and FW_STATUS_ABORT are set,
and request_firmware() will return failure due to condition in
_request_firmware_load():
if (!buf->size || test_bit(FW_STATUS_ABORT, &buf->status))
retval = -ENOENT;
But from the above scenerio, it should be a successful requesting.
So we need judge if the bit FW_STATUS_DONE is already set before
calling fw_load_abort() in timeout function.
As Ming's proposal, we need change the timer into sched_work to
benefit from using &fw_lock mutex also.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Acked-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds two new products and modifies
the device id table to include them. In addition,
product of 0xbccd - BCM_USB_PRODUCT_ID_SM250 is
removed because Beceem, ZTE, Sprint use this id
for block devices.
Reported-by: Muhammad Minhazul Haque <mdminhazulhaque@gmail.com> Signed-off-by: Kevin McKinney <klmckinney1@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 29c00b4a1d9e27 (rcu: Add event-tracing for RCU callback
invocation) added a regression in rcu_do_batch()
Under stress, RCU is supposed to allow to process all items in queue,
instead of a batch of 10 items (blimit), but an integer overflow makes
the effective limit being 1. So, unless there is frequent idle periods
(during which RCU ignores batch limits), RCU can be forced into a
state where it cannot keep up with the callback-generation rate,
eventually resulting in OOM.
This commit therefore converts a few variables in rcu_do_batch() from
int to long to fix this problem, along with the module parameters
controlling the batch limits.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch (as1632b) fixes a bug in ehci-hcd. The USB core uses
urb->hcpriv to determine whether or not an URB is active; host
controller drivers are supposed to set this pointer to a non-NULL
value when an URB is queued. However ehci-hcd sets it to NULL for
isochronous URBs, which defeats the check in usbcore.
In itself this isn't a big deal. But people have recently found that
certain sequences of actions will cause the snd-usb-audio driver to
reuse URBs without waiting for them to complete. In the absence of
proper checking by usbcore, the URBs get added to their endpoint list
twice. This leads to list corruption and a system freeze.
The patch makes ehci-hcd assign a meaningful value to urb->hcpriv for
isochronous URBs. Improving robustness always helps.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Artem S. Tashkinov <t.artem@lycos.com> Reported-by: Christof Meerwald <cmeerw@cmeerw.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recently I build perf and get a build error on builtin-test.c. The error is as
following:
$ make
CC perf.o
CC builtin-test.o
cc1: warnings being treated as errors
builtin-test.c: In function ‘sched__get_first_possible_cpu’:
builtin-test.c:977: warning: implicit declaration of function ‘CPU_ALLOC’
builtin-test.c:977: warning: nested extern declaration of ‘CPU_ALLOC’
builtin-test.c:977: warning: assignment makes pointer from integer without a cast
builtin-test.c:978: warning: implicit declaration of function ‘CPU_ALLOC_SIZE’
builtin-test.c:978: warning: nested extern declaration of ‘CPU_ALLOC_SIZE’
builtin-test.c:979: warning: implicit declaration of function ‘CPU_ZERO_S’
builtin-test.c:979: warning: nested extern declaration of ‘CPU_ZERO_S’
builtin-test.c:982: warning: implicit declaration of function ‘CPU_FREE’
builtin-test.c:982: warning: nested extern declaration of ‘CPU_FREE’
builtin-test.c:992: warning: implicit declaration of function ‘CPU_ISSET_S’
builtin-test.c:992: warning: nested extern declaration of ‘CPU_ISSET_S’
builtin-test.c:998: warning: implicit declaration of function ‘CPU_CLR_S’
builtin-test.c:998: warning: nested extern declaration of ‘CPU_CLR_S’
make: *** [builtin-test.o] Error 1
This problem is introduced in 3e7c439a. CPU_ALLOC and related macros are
missing in sched__get_first_possible_cpu function. In 54489c18, commiter
mentioned that CPU_ALLOC has been removed. So CPU_ALLOC calls in this
function are removed to let perf to be built.
Signed-off-by: Vinson Lee <vlee@twitter.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vinson Lee <vlee@twitter.com> Cc: Zheng Liu <wenqing.lz@taobao.com> Link: http://lkml.kernel.org/r/1352422726-31114-1-git-send-email-vlee@twitter.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some devices (ex Nokia C7) simply don't respond at all when data is sent
to some of their USB interfaces. The data gets stuck in the TTYs queue
and sits there until close(2), which them blocks because closing_wait
defaults to 30 seconds (even though the fd is O_NONBLOCK). This is
rarely desired. Implement the standard mechanism to adjust closing_wait
and let applications handle it how they want to.
Signed-off-by: Dan Williams <dcbw@redhat.com> Acked-by: Oliver Neukum <oneukum@suse.de> Tested-by: Aleksander Morgado <aleksander@gnu.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function rb_check_pages() was added to make sure the ring buffer's
pages were sane. This check is done when the ring buffer size is modified
as well as when the iterator is released (closing the "trace" file),
as that was considered a non fast path and a good place to do a sanity
check.
The problem is that the check does not have any locks around it.
If one process were to read the trace file, and another were to read
the raw binary file, the check could happen while the reader is reading
the file.
The issues with this is that the check requires to clear the HEAD page
before doing the full check and it restores it afterward. But readers
require the HEAD page to exist before it can read the buffer, otherwise
it gives a nasty warning and disables the buffer.
By adding the reader lock around the check, this keeps the race from
happening.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function rb_set_head_page() searches the list of ring buffer
pages for a the page that has the HEAD page flag set. If it does
not find it, it will do a WARN_ON(), disable the ring buffer and
return NULL, as this should never happen.
But if this bug happens to happen, not all callers of this function
can handle a NULL pointer being returned from it. That needs to be
fixed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ali reports that plugging a device into the Fresco Logic xHCI host with
PCI device ID 1400 produces an IRQ error:
do_IRQ: 3.176 No irq handler for vector (irq -1)
Other early Fresco Logic host revisions don't support MSI, even though
their PCI config space claims they do. Extend the quirk to disabling
MSI to this chipset revision. Also enable the short transfer quirk,
since it's likely this revision also has that quirk, and it should be
harmless to enable.
This patch should be backported to stable kernels as old as 2.6.36, that
contain the commit f5182b4155b9d686c5540a6822486400e34ddd98 "xhci:
Disable MSI for some Fresco Logic hosts."
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Reported-by: A Sh <smr.ash1991@gmail.com> Tested-by: A Sh <smr.ash1991@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch (as1636) is a partial workaround for a hardware bug
affecting OHCI controllers by NVIDIA at least, maybe others too. When
the controller retires a Transfer Descriptor, it is supposed to add
the TD onto the Done Queue. But sometimes this doesn't happen, with
the result that ohci-hcd never realizes the corresponding transfer has
finished. Symptoms can vary; a typical result is that USB audio stops
working after a while.
The patch works around the problem by recognizing that TDs are always
processed in order. Therefore, if a later TD is found on the Done
Queue than all the earlier TDs for the same endpoint must be finished
as well.
Unfortunately this won't solve the problem in cases where the missing
TD is the last one in the endpoint's queue. A complete fix would
require a signficant amount of change to the driver.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Oliver Neukum <oneukum@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ACPI video driver can't control backlight correctly on
Asus UL30VT. Vendor driver (asus-laptop) can work. This patch is to
add "Asus UL30VT" to ACPI video detect blacklist in order to use
asus-laptop for video control on the "Asus UL30VT" rather than ACPI
video driver.
References: https://bugzilla.kernel.org/show_bug.cgi?id=32592 Reported-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Or else the laptop will boot with a dimmed screen.
References: https://bugzilla.kernel.org/show_bug.cgi?id=51141 Tested-by: Stefan Nagy <public@stefan-nagy.at> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During resume from system suspend the 'data' field of
struct pnp_dev in pnpacpi_set_resources() may be a stale pointer,
due to removal of the associated ACPI device node object in the
previous suspend-resume cycle. This happens, for example, if a
dockable machine is booted in the docking station and then suspended
and resumed and suspended again. If that happens,
pnpacpi_build_resource_template() called from pnpacpi_set_resources()
attempts to use that pointer and crashes.
However, pnpacpi_set_resources() actually checks the device's ACPI
handle, attempts to find the ACPI device node object attached to it
and returns an error code if that fails, so in fact it knows what the
correct value of dev->data should be. Use this observation to update
dev->data with the correct value if necessary and dump a call trace
if that's the case (once).
We still need to fix the root cause of this issue, but preventing
systems from crashing because of it is an improvement too.
Reported-and-tested-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=51071 Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a quirk to correctly report battery capacity on 2010 and 2011
Lenovo Thinkpad models.
The affected models that I tested (x201, t410, t410s, and x220)
exhibit a problem where, when battery capacity reporting unit is mAh,
the values being reported are wrong. Pre-2010 and 2012 models appear
to always report in mWh and are thus unaffected. Also, in mid-2012
Lenovo issued a BIOS update for the 2011 models that fixes the issue
(tested on x220 with a post-1.29 BIOS). No such update is available
for the 2010 models, so those still need this patch.
Problem description: for some reason, the affected Thinkpads switch
the reporting unit between mAh and mWh; generally, mAh is used when a
laptop is plugged in and mWh when it's unplugged, although a
suspend/resume or rmmod/modprobe is needed for the switch to take
effect. The values reported in mAh are *always* wrong. This does
not appear to be a kernel regression; I believe that the values were
never reported correctly. I tested back to kernel 2.6.34, with
multiple machines and BIOS versions.
Simply plugging a laptop into mains before turning it on is enough to
reproduce the problem. Here's a sample /proc/acpi/battery/BAT0/info
from Thinkpad x220 (before a BIOS update) with a 4-cell battery:
present: yes
design capacity: 2886 mAh
last full capacity: 2909 mAh
battery technology: rechargeable
design voltage: 14800 mV
design capacity warning: 145 mAh
design capacity low: 13 mAh
cycle count: 0
capacity granularity 1: 1 mAh
capacity granularity 2: 1 mAh
model number: 42T4899
serial number: 21064
battery type: LION
OEM info: SANYO
Once the laptop switches the unit to mWh (unplug from mains, suspend,
resume), the output changes to:
Can you see how the values for "design capacity", etc., differ by a
factor of 10 instead of 14.8 (the design voltage of this battery)?
On the battery itself it says: 14.8V, 1.95Ah, 29Wh, so clearly the
values reported in mWh are correct and the ones in mAh are not.
My guess is that this problem has been around ever since those
machines were released, but because the most common Thinkpad
batteries are rated at 10.8V, the error (8%) is small enough that it
simply hasn't been noticed or at least nobody could be bothered to
look into it.
My patch works around the problem by adjusting the incorrectly
reported mAh values by "10000 / design_voltage". The patch also has
code to figure out if it should be activated or not. It only
activates on Lenovo Thinkpads, only when the unit is mAh, and, as an
extra precaution, only when the battery capacity reported through
ACPI does not match what is reported through DMI (I've never
encountered a machine where the first two conditions would be true
but the last would not, but better safe than sorry).
I've been using this patch for close to a year on several systems
without any problems.
References: https://bugzilla.kernel.org/show_bug.cgi?id=41062 Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As reported https://bugzilla.kernel.org/show_bug.cgi?id=51031, the UAS
driver causes problems and has been asked to be not built into any of
the major distributions. To prevent users from running into problems
with it, and for distros that were not notified, just mark the whole
thing as broken.
Acked-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
BeagleBone A5+ devices ended up getting shipped with the
'BeagleBone/XDS100V2' product string, and not XDS100 like it
was agreed, so adjust the quirk to match.
The Newport AGILIS model AG-UC8 compact piezo motor controller
(http://search.newport.com/?q=*&x2=sku&q2=AG-UC8)
is yet another device using an FTDI USB-to-serial chip. It works
fine with the ftdi_sio driver when adding
options ftdi-sio product=0x3000 vendor=0x104d
to modprobe.d. udevadm reports "Newport" as the manufacturer,
and "Agilis" as the product name.
Signed-off-by: Martin Teichmann <lkb.teichmann@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Huawei E173 will normally appear as 12d1:1436 in Linux. But
the modem has another mode with different device ID and a slightly
different set of descriptors. This is the mode used by Windows like
this:
3Modem: USB\VID_12D1&PID_140C&MI_00\6&3A1D2012&0&0000
Networkcard: USB\VID_12D1&PID_140C&MI_01\6&3A1D2012&0&0001
Appli.Inter: USB\VID_12D1&PID_140C&MI_02\6&3A1D2012&0&0002
PC UI Inter: USB\VID_12D1&PID_140C&MI_03\6&3A1D2012&0&0003
All interfaces have the same ff/ff/ff class codes in this mode.
Blacklisting the network interface to allow it to be picked up by
the network driver.
HPET_TN_FSB is not a proper mask bit; it merely toggles between MSI and
legacy interrupt delivery. The proper mask bit is HPET_TN_ENABLE, so
use both bits when (un)masking the interrupt.
This fixes a bit error in the U8500 clock implementation: the
unused p2_pclk12 registered at bit 12 in periphereral group 6
was defined as using bit 11 rather than bit 12.
When walking over and disabling the unused clocks in the tree
at late init time, p2_pclk12 was disabled, by effectively
clearing the but for p2_pclk11 instead of bit 12 as it should
have, thus disabling gpio block 6 and 7.
Reported-by: Lee Jones <lee.jones@linaro.org> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Cc: Philippe Begnic <philippe.begnic@st.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Mike Turquette <mturquette@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dmapool always calls dma_alloc_coherent() with GFP_ATOMIC flag,
regardless the flags provided by the caller. This causes excessive
pruning of emergency memory pools without any good reason. Additionaly,
on ARM architecture any driver which is using dmapools will sooner or
later trigger the following error:
"ERROR: 256 KiB atomic DMA coherent pool is too small!
Please increase it with coherent_pool= kernel parameter!".
Increasing the coherent pool size usually doesn't help much and only
delays such error, because all GFP_ATOMIC DMA allocations are always
served from the special, very limited memory pool.
This patch changes the dmapool code to correctly use gfp flags provided
by the dmapool caller.
Reported-by: Soeren Moch <smoch@web.de> Reported-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Soeren Moch <smoch@web.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Florian Fainelli [Mon, 10 Dec 2012 20:25:32 +0000 (12:25 -0800)]
Input: matrix-keymap - provide proper module license
The matrix-keymap module is currently lacking a proper module license,
add one so we don't have this module tainting the entire kernel. This
issue has been present since commit 1932811f426f ("Input: matrix-keymap
- uninline and prepare for device tree support")
1) Netlink socket dumping had several missing verifications and checks.
In particular, address comparisons in the request byte code
interpreter could access past the end of the address in the
inet_request_sock.
Also, address family and address prefix lengths were not validated
properly at all.
This means arbitrary applications can read past the end of certain
kernel data structures.
Fixes from Neal Cardwell.
2) ip_check_defrag() operates in contexts where we're in the process
of, or about to, input the packet into the real protocols
(specifically macvlan and AF_PACKET snooping).
Unfortunately, it does a pskb_may_pull() which can modify the
backing packet data which is not legal if the SKB is shared. It
very much can be shared in this context.
Deal with the possibility that the SKB is segmented by using
skb_copy_bits().
Fix from Johannes Berg based upon a report by Eric Leblond.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
ipv4: ip_check_defrag must not modify skb before unsharing
inet_diag: validate port comparison byte code to prevent unsafe reads
inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
This is a revert of a revert of a revert. In addition, it reverts the
even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
original commits in linux-next.
It turns out that the original patch really was bogus, and that the
original revert was the correct thing to do after all. We thought we
had fixed the problem, and then reverted the revert, but the problem
really is fundamental: waking up kswapd simply isn't the right thing to
do, and direct reclaim sometimes simply _is_ the right thing to do.
When certain allocations fail, we simply should try some direct reclaim,
and if that fails, fail the allocation. That's the right thing to do
for THP allocations, which can easily fail, and the GPU allocations want
to do that too.
So starting kswapd is sometimes simply wrong, and removing the flag that
said "don't start kswapd" was a mistake. Let's hope we never revisit
this mistake again - and certainly not this many times ;)
Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Berg [Sun, 9 Dec 2012 23:41:06 +0000 (23:41 +0000)]
ipv4: ip_check_defrag must not modify skb before unsharing
ip_check_defrag() might be called from af_packet within the
RX path where shared SKBs are used, so it must not modify
the input SKB before it has unshared it for defragmentation.
Use skb_copy_bits() to get the IP header and only pull in
everything later.
The same is true for the other caller in macvlan as it is
called from dev->rx_handler which can also get a shared SKB.
Reported-by: Eric Leblond <eric@regit.org> Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to reinstate the __GFP_NO_KSWAPD flag that has been
removed, the removal reverted, and then removed again. Making this
commit a pointless fixup for a problem that was caused by the removal of
__GFP_NO_KSWAPD flag.
The thing is, we really don't want to wake up kswapd for THP allocations
(because they fail quite commonly under any kind of memory pressure,
including when there is tons of memory free), and these patches were
just trying to fix up the underlying bug: the original removal of
__GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD")
was simply bogus.
Neal Cardwell [Sun, 9 Dec 2012 11:09:54 +0000 (11:09 +0000)]
inet_diag: validate port comparison byte code to prevent unsafe reads
Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.
Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Sat, 8 Dec 2012 19:43:23 +0000 (19:43 +0000)]
inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).
This change is needed for two reasons:
(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.
(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Sat, 8 Dec 2012 19:43:22 +0000 (19:43 +0000)]
inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.
Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Sat, 8 Dec 2012 19:43:21 +0000 (19:43 +0000)]
inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.
With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Weiner [Thu, 6 Dec 2012 20:23:25 +0000 (15:23 -0500)]
mm: vmscan: fix inappropriate zone congestion clearing
commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due
to individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion clearing,
which now happens unconditionally on higher order reclaim.
This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.
Remove the clearing from the zone compaction section entirely. The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 8 Dec 2012 00:48:39 +0000 (16:48 -0800)]
vfs: fix O_DIRECT read past end of block device
The direct-IO write path already had the i_size checks in mm/filemap.c,
but it turns out the read path did not, and removing the block size
checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make
private to fs/buffer.c") removed the magic "shrink IO to past the end of
the device" code there.
Fix it by truncating the IO to the size of the block device, like the
write path already does.
NOTE! I suspect the write path would be *much* better off doing it this
way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The
mm/filemap.c code is extremely hard to follow, and has various
conditionals on the target being a block device (ie the flag passed in
to 'generic_write_checks()', along with a conditional update of the
inode timestamp etc).
It is also quite possible that we should treat this whole block device
size as a "s_maxbytes" issue, and try to make the logic even more
generic. However, in the meantime this is the fairly minimal targeted
fix.
Noted by Milan Broz thanks to a regression test for the cryptsetup
reencrypt tool.
Reported-and-tested-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
"Two stragglers:
1) The new code that adds new flushing semantics to GRO can cause SKB
pointer list corruption, manage the lists differently to avoid the
OOPS. Fix from Eric Dumazet.
2) When TCP fast open does a retransmit of data in a SYN-ACK or
similar, we update retransmit state that we shouldn't triggering a
WARN_ON later. Fix from Yuchung Cheng."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: gro: fix possible panic in skb_gro_receive()
tcp: bug fix Fast Open client retransmission
Eric Dumazet [Thu, 6 Dec 2012 13:54:59 +0000 (13:54 +0000)]
net: gro: fix possible panic in skb_gro_receive()
commit 2e71a6f8084e (net: gro: selective flush of packets) added
a bug for skbs using frag_list. This part of the GRO stack is rarely
used, as it needs skb not using a page fragment for their skb->head.
Most drivers do use a page fragment, but some of them use GFP_KERNEL
allocations for the initial fill of their RX ring buffer.
napi_gro_flush() overwrite skb->prev that was used for these skb to
point to the last skb in frag_list.
Fix this using a separate field in struct napi_gro_cb to point to the
last fragment.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 6 Dec 2012 08:45:32 +0000 (08:45 +0000)]
tcp: bug fix Fast Open client retransmission
If SYN-ACK partially acks SYN-data, the client retransmits the
remaining data by tcp_retransmit_skb(). This increments lost recovery
state variables like tp->retrans_out in Open state. If loss recovery
happens before the retransmission is acked, it triggers the WARN_ON
check in tcp_fastretrans_alert(). For example: the client sends
SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
another 4 data packets and get 3 dupacks.
Since the retransmission is not caused by network drop it should not
update the recovery state variables. Further the server may return a
smaller MSS than the cached MSS used for SYN-data, so the retranmission
needs a loop. Otherwise some data will not be retransmitted until timeout
or other loss recovery events.
Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mel Gorman [Wed, 5 Dec 2012 22:01:41 +0000 (14:01 -0800)]
tmpfs: fix shared mempolicy leak
This fixes a regression in 3.7-rc, which has since gone into stable.
Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
on expecting alloc_page_vma() to drop the refcount it had acquired.
This deserves a rework: but for now fix the leak in shmem_alloc_page().
Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
the same refcounting there as in shmem_alloc_page(), delete its onstack
mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
those were invented to let swapin_readahead() make an unknown number of
calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
alloc_pages_vma() has kept refcount in balance, so now no problem.
Johannes Weiner [Tue, 4 Dec 2012 16:11:31 +0000 (11:11 -0500)]
mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones
When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the node's
memory that is considered balanced.
This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all bigger
zones in the node had plenty of free memory. Arguably, the same should
apply to compaction: if a significant part of the node is balanced
enough to run compaction, do not get hung up on that tiny zone that
might never get in shape.
When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced). Remove the individual zone checks
that restart the kswapd cycle.
Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.
Mel Gorman [Thu, 6 Dec 2012 19:01:14 +0000 (19:01 +0000)]
mm: compaction: validate pfn range passed to isolate_freepages_block
Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
new MAX_ORDER_NR_PAGES block during isolation for migration") added a
check for pfn_valid() when isolating pages for migration as the scanner
does not necessarily start pageblock-aligned.
Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
where it left off"), the free scanner has the same problem. This patch
makes sure that the pfn range passed to isolate_freepages_block() is
within the same block so that pfn_valid() checks are unnecessary.
In answer to Henrik's wondering why others have not reported this:
reproducing this requires a large enough hole with the right aligment to
have compaction walk into a PFN range with no memmap. Size and
alignment depends in the memory model - 4M for FLATMEM and 128M for
SPARSEMEM on x86. It needs a "lucky" machine.
mmc: sh-mmcif: avoid oops on spurious interrupts (second try)
On some systems, e.g., kzm9g, MMCIF interfaces can produce spurious
interrupts without any active request. To prevent the Oops, that results
in such cases, don't dereference the mmc request pointer until we make
sure, that we are indeed processing such a request.