Annotate struct fs_struct's usage count to indicate the restrictions upon it.
It may not be incremented, except by clone(CLONE_FS), as this affects the
check in check_unsafe_exec() in fs/exec.c.
check_unsafe_exec() also notes whether the fs_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
use of get_fs_struct(), which also raises that sharing count.
This might occasionally cause a setuid program not to change euid,
in the same way as happened with files->count (check_unsafe_exec
also looks at sighand->count, but /proc doesn't raise that one).
We'd prefer exec not to unshare fs_struct: so fix this in procfs,
replacing get_fs_struct() by get_fs_path(), which does path_get
while still holding task_lock, instead of raising fs->count.
Joe Malicki reports that setuid sometimes doesn't: very rarely,
a setuid root program does not get root euid; and, by the way,
they have a health check running lsof every few minutes.
Right, check_unsafe_exec() notes whether the files_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
use of get_files_struct(), which also raises that sharing count.
There's a rather simple fix for this: exec's check on files->count
has been redundant ever since 2.6.1 made it unshare_files() (except
while compat_do_execve() omitted to do so) - just remove that check.
[Note to -stable: this patch will not apply before 2.6.29: earlier
releases should just remove the files->count line from unsafe_exec().]
Reported-by: Joe Malicki <jmalicki@metacarta.com> Narrowed-down-by: Michael Itz <mitz@metacarta.com> Tested-by: Joe Malicki <jmalicki@metacarta.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2.6.26's commit fd8328be874f4190a811c58cd4778ec2c74d2c05
"sanitize handling of shared descriptor tables in failing execve()"
moved the unshare_files() from flush_old_exec() and several binfmts
to the head of do_execve(); but forgot to make the same change to
compat_do_execve(), leaving a CLONE_FILES files_struct shared across
exec from a 32-bit process on a 64-bit kernel.
It's arguable whether the files_struct really ought to be unshared
across exec; but 2.6.1 made that so to stop the loading binary's fd
leaking into other threads, and a 32-bit process on a 64-bit kernel
ought to behave in the same way as 32 on 32 and 64 on 64.
powerpc: Sanitize stack pointer in signal handling code
This has been backported to 2.6.29.x from commit efbda86098 in Linus' tree
On powerpc64 machines running 32-bit userspace, we can get garbage bits in the
stack pointer passed into the kernel. Most places handle this correctly, but
the signal handling code uses the passed value directly for allocating signal
stack frames.
This fixes the issue by introducing a get_clean_sp function that returns a
sanitized stack pointer. For 32-bit tasks on a 64-bit kernel, the stack
pointer is masked correctly. In all other cases, the stack pointer is simply
returned.
Additionally, we pass an 'is_32' parameter to get_sigframe now in order to
get the properly sanitized stack. The callers are know to be 32 or 64-bit
statically.
This patch (as1229-3) fixes a few lifetime and locking problems in the
usb-serial driver. The main symptom is that an invalid kevent is
created when the serial device is unplugged while a connection is
active.
Ports should be unregistered when device is disconnected,
not when the parent usb_serial structure is deallocated.
Each open file should hold a reference to the corresponding
port structure, and the reference should be released when
the file is closed.
serial->disc_mutex should be acquired in serial_open(), to
resolve the classic race between open and disconnect.
serial_close() doesn't need to hold both serial->disc_mutex
and port->mutex at the same time.
Release the subdriver's module reference only after releasing
all the other references, in case one of the release routines
needs to invoke some code in the subdriver module.
Replace a call to flush_scheduled_work() (which is prone to
deadlocks) with cancel_work_sync(). Also, add a call to
cancel_work_sync() in the disconnect routine.
Reduce the scope of serial->disc_mutex in serial_disconnect().
The only place it really needs to protect is where the
"disconnected" flag is set.
Call the shutdown method from within serial_disconnect()
instead of destroy_serial(), because some subdrivers expect
the port data structures still to be in existence when
their shutdown method runs.
This fixes the bug reported in
http://bugs.freedesktop.org/show_bug.cgi?id=20703
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
While building the kernel, we end-up calling modpost with -K and -M
options for the same file (Modules.markers). This is resulting in
modpost's main function calling read_markers() and then write_markers() on
the same file.
We then have read_markers() mmap'ing the file, and writer_markers()
opening that same file for writing.
The issue is that read_markers() exits without munmap'ing the file and is
as a matter holding a reference on Modules.markers. When write_markers()
is opening that very same file for writing, we still have a reference on
it and cygwin (Windows?) is then making fopen() fail with EPERM.
Calling release_file() before exiting read_markers() clears that reference
(and memory leak) and fopen() then succeeds.
Tested on both cygwin (1.3.22) and Linux. Also ran modpost within
valgrind on Linux to make sure that the munmap'ed file was not accessed
after read_markers()
Signed-off-by: Cedric Hombourger <chombourger@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The intention of commit aae8679b0ebcaa92f99c1c3cb0cd651594a43915
("pagemap: fix bug in add_to_pagemap, require aligned-length reads of
/proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a
multiple of 8 bytes, but now it allows to read 0 bytes, which actually
puts some data to user's buffer. According to POSIX, if count is zero,
read() should return zero and has no other results.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Cc: Thomas Tuttle <ttuttle@google.com> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch corrects a pretty big oversight in the KMS code for 965+
chips. The current code is missing tiled surface register programming,
so userland can allocate a tiled surface and use it for mode setting,
resulting in corruption. This patch fixes that, allowing for tiled
front buffers on 965+.
st driver uses blk_rq_map_user() in order to just build a request out
of page frames. In this case, map_data->offset is a non zero value and
iov[0].iov_base is NULL. We need to increase nr_pages for that.
Without this patch, Broadcom BCM5906 Ethernet controllers set up via MSI
cause the machine to hang. Tejun agreed that the best is to blacklist
the whole chipset and after adding it, seeing the other VIA quirks
disabling MSI, this very much looks like the right way.
In 64bit signal delivery path, clear_used_math() was happening before saving
the current active FPU state on to the user stack for signal handling. Between
clear_used_math() and the state store on to the user stack, potentially we
can get a page fault for the user address and can block. Infact, while testing
we were hitting the might_fault() in __clear_user() which can do a schedule().
At a later point in time, we will schedule back into this process and
resume the save state (using "xsave/fxsave" instruction) which can lead
to DNA fault. And as used_math was cleared before, we will reinit the FP state
in the DNA fault and continue. This reinit will result in loosing the
FPU state of the process.
Move clear_used_math() to a point after the FPU state has been stored
onto the user stack.
This issue is present from a long time (even before the xsave changes
and the x86 merge). But it can easily be exposed in 2.6.28.x and 2.6.29.x
series because of the __clear_user() in this path, which has an explicit
__cond_resched() leading to a context switch with CONFIG_PREEMPT_VOLUNTARY.
[ Impact: fix FPU state corruption ]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This had been delayed for some time due to failure to work on the one piece
of G41 hardware we had, and lack of success reports from anybody else.
Current hardware appears to be OK.
Signed-off-by: Zhenyu Wang <zhenyu.z.wang@intel.com>
[anholt: hand-applied due to conflicts with IGD patches] Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
unreached code in selinux_ip_postroute_iptables_compat() (CVE-2009-1184)
Not upstream in 2.6.30, as the function was removed there, making this a
non-issue.
Node and port send checks can skip in the compat_net=1 case. This bug
was introduced in commit effad8d.
Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Reported-by: Dan Carpenter <error27@gmail.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
- keep dma functions away from chained scatterlists.
Use the existing scatterlist iteration inside the driver
to call dma_map_single() for each chunk and avoid dma_map_sg().
Signed-off-by: Christian Hohnstaedt <chohnstaedt@innominate.com> Tested-By: Karl Hiramoto <karl@hiramoto.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
/proc/diskstats used to show stats for all disks whether they're
zero-sized or not and their non-zero partitions. Commit 074a7aca7afa6f230104e8e65eba3420263714a5 accidentally changed the
behavior such that it doesn't print out zero sized disks. This patch
implements DISK_PITER_INCL_EMPTY_PART0 flag to partition iterator and
uses it in diskstats_show() such that empty part0 is shown in
/proc/diskstats.
We must not use the device DMA addresses for the kernel DMA API, because
device DMA addresses have an additional offset added for the SSB translation.
Use the original dma_addr_t for the sync operation.
Cc: stable@kernel.org Signed-off-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The virtio-rng drivers checks for spurious callbacks. Since
callbacks can be implemented via shared interrupts (e.g. PCI) this
could lead to guest kernel oopses with lots of virtio devices.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Reported by Alessio Treglia on
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/125250
User was getting the following errors in dmesg:
[ 2158.139386] sd 5:0:0:1: ioctl_internal_command return code = 8000002
[ 2158.139390] : Current: sense key: No Sense
[ 2158.139393] Additional sense: No additional sense information
Adds unusual device support.
modified: drivers/usb/storage/unusual_devs.h
Signed-off-by: Chuck Short <zulcss@ubuntu.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
When checking for overlapping slots on registration of a new one, kvm
currently also considers zero-length (ie. deleted) slots and rejects
requests incorrectly. This finally denies user space from joining slots.
Fix the check by skipping deleted slots and advertise this via a
KVM_CAP_JOIN_MEMORY_REGIONS_WORKS.
Cc: stable@kernel.org Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The large page initialization code concludes there are two large pages spanned
by a slot covering 1 (small) page starting at gfn 1. This is incorrect, and
also results in incorrect write_count initialization in some cases (base = 1,
npages = 513 for example).
Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
"mac80211: fix basic rates setting from association response"
introduced a copy/paste error.
Unfortunately, this not just leads to wrong data being passed
to the driver but is remotely exploitable for some hardware or
driver combinations.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Without patch unplugging of an US122L soundcard didn't reset the
corresponding element of snd_us122l_card_used[] to 0.
The (SNDRV_CARDS + 1)th plugging in did not result in creating the soundcard
device anymore.
Index values supplied with the modprobe command line were not used correctly
anymore after the first unplugging of an US122L.
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de> Cc: stable@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
[chrisw: backport to 2.6.29] Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The set_blink hook code in the LED subdriver would never manage to get
a LED to blink, and instead it would just turn it on. The consequence
of this is that the "timer" trigger would not cause the LED to blink
if given default parameters.
This problem exists since 2.6.26-rc1.
To fix it, switch the deferred LED work handling to use the
thinkpad-acpi-specific LED status (off/on/blink) directly.
This also makes the code easier to read, and to extend later.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: stable@kernel.org Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The RX buffer poison needs to be refreshed, if we recycle an RX buffer,
because it might be (partially) overwritten by some DMA operations.
Cc: stable@kernel.org Cc: Francesco Gringoli <francesco.gringoli@ing.unibs.it> Signed-off-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
This patch adds poisoning and sanity checking to the RX DMA buffers.
This is used for protection against buggy hardware/firmware that raises
RX interrupts without doing an actual DMA transfer.
This mechanism protects against rare "bad packets" (due to uninitialized skb data)
and rare kernel crashes due to uninitialized RX headers.
The poison is selected to not match on valid frames and to be cheap for checking.
The poison check mechanism _might_ trigger incorrectly, if we are voluntarily
receiving frames with bad PLCP headers. However, this is nonfatal, because the
chance of such a match is basically zero and in case it happens it just results
in dropping the packet.
Bad-PLCP RX defaults to off, and you should leave it off unless you want to listen
to the latest news broadcasted by your microwave oven.
This patch also moves the initialization of the RX-header "length" field in front of
the mapping of the DMA buffer. The CPU should not touch the buffer after we mapped it.
Cc: stable@kernel.org Reported-by: Francesco Gringoli <francesco.gringoli@ing.unibs.it> Signed-off-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Currently rx status for frames which are completed from reorder buffer
is taken from it's cb area which is not always right, cb is not holding
the rx status when driver uses mac80211's non-irq rx handler to pass it's
received frames. This results in dropping almost all frames from reorder
buffer when security is enabled by doing double decryption (first in hw,
second in sw because of wrong rx status). This patch copies rx status into
cb area before the frame is put into reorder buffer. After this patch,
there is a significant improvement in throughput with ath9k + WPA2(AES).
Signed-off-by: Vasanthakumar Thiagarajan <vasanth@atheros.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Cc: stable@kernel.org Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Reset phy state on resume, fixing a regression caused by powering down
the phy on hibernate.
Signed-off-by: Ed Swierk <eswierk@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Fix a zero address hole bug in the bonding arp_ip_target list
that was causing the bond to ignore ARP replies (bugz 13006).
Instead of just setting the array entry to zero, we now
copy any additional entries down one slot, putting the
zero entry at the end. With this change we can now have
all the loops that walk the array stop when they hit a zero
since there will be no addresses after it.
Changes are based in part on code fragment provided in kernel:
bugzilla 13006:
http://bugzilla.kernel.org/show_bug.cgi?id=13006
by Steve Howard <steve@astutenetworks.com>
Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The BUG_ON(skge->tx_ring.to_use != skge->tx_ring.to_clean) in skge_up()
was sometimes observed when setting MTU.
skge_down() disables the TX queue, but then reenables it by mistake via
skge_tx_clean().
Fix it by moving the waking of the queue from skge_tx_clean() to the
other caller. And to make sure start_xmit is not in progress on another
CPU, skge_down() should call netif_tx_disable().
The bug was reported to me by Jiri Jilek whose Debian system sometimes
failed to boot. He tested the patch and the bug did not happen anymore.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Noticing that every caller of mpt_config has its CONFIGPARMS struct
declared on the stack and thus the &pCfg->timer is always on the stack I
changed init_timer() to init_timer_on_stack() and it seems to have shut
up.....
Cc: "Moore, Eric Dean" <Eric.Moore@lsil.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: "Desai, Kashyap" <Kashyap.Desai@lsi.com> Cc: <stable@kernel.org> [2.6.29.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
# mount -o size=MM -t hugetlbfs none /huge
hugetlbfs: Bad value 'MM' for mount option 'size=MM'
------------[ cut here ]------------
kernel BUG at fs/super.c:996!
Due to
BUG_ON(!mnt->mnt_sb);
in vfs_kern_mount().
Also, remove unused #include <linux/quotaops.h>
Cc: William Irwin <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Commit ae46141ff08f1965b17c531b571953c39ce8b9e2 (NFSv3: Fix posix ACL code)
introduces a bug in the calculation of the XDR header iovec. In the case
where we are inlining the acls, we need to adjust the length of the iovec
req->rq_svec, in addition to adjusting the total buffer length.
When GRO/frag_list support was added to GSO, I made an error
which broke the support for segmenting linear GSO packets (GSO
packets are normally non-linear in the payload).
These days most of these packets are constructed by the tun
driver, which prefers to allocate linear memory if possible.
This is fixed in the latest kernel, but for 2.6.29 and earlier
it is still the norm.
Therefore this bug causes failures with GSO when used with tun
in 2.6.29.
Reported-by: James Huang <jamesclhuang@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
AGP pages might be mapped into userspace finally, so the pages should be
set to zero before userspace can use it. Otherwise there is potential
information leakage.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Break out of wait_event_interruptible() if freezing has been requested,
in the vballoon thread. Without this change vballoon refuses to stop and
the system can't suspend.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Ingo Brueckl was assuming that reverting to 1:1 mapping for chars >= 128
was not useful, but it happens to be: due to the limitations of the
Linux console, when a blind user wants to read BIG5 on it, he has no
other way than loading a font without SFM and let the 1:1 mapping permit
the screen reader to get the BIG5 encoding.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The commit 6902c0bead4ce266226fc0c5b3828b850bdc884a that moved
driver registration out of kgameportd thread was incomplete and
did not add the code necessary to actually attach driver to
already registered devices, rectify that.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Remove page level granularity track and untrack of vm_insert_pfn.
memtype tracking at page granularity does not scale and cleaner
approach would be for the driver to request a type for a bigger
IO address range or PCI io memory range for that device, either at
mmap time or driver init time and just use that type during
vm_insert_pfn.
This patch just removes the track/untrack of vm_insert_pfn. That
means we will be in same state as 2.6.28, with respect to these APIs.
Newer APIs for the drivers to request a memtype for a bigger region
is coming soon.
is_long_mode currently checks the LongModeEnable bit in
EFER instead of the LongModeActive bit. This is wrong, but
we survived this till now since it wasn't triggered. This
breaks guests that go from long mode to compatibility mode.
This is noticed on a solaris guest and fixes bug #1842160
Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
setup_msrs() should be called when entering long mode to save the
shadow state for the 64-bit guest state.
Using vmx_set_efer() in enter_lmode() removes some duplicated code
and also ensures we call setup_msrs(). We can safely pass the value
of shadow_efer to vmx_set_efer() as no other bits in the efer change
while enabling long mode (guest first sets EFER.LME, then sets CR0.PG
which causes a vmexit where we activate long mode).
only need to set assigned_dev_id for deassignment, use
match->flags to judge and deassign it.
Acked-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Weidong Han <weidong.han@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The function kvm_is_mmio_pfn is called before put_page is called on a
page by KVM. This is a problem when when this function is called on some
struct page which is part of a compund page. It does not test the
reserved flag of the compound page but of the struct page within the
compount page. This is a problem when KVM works with hugepages allocated
at boot time. These pages have the reserved bit set in all tail pages.
Only the flag in the compount head is cleared. KVM would not put such a
page which results in a memory leak.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Allow clients to request notifications when the guest masks or unmasks a
particular irq line. This complements irq ack notifications, as the guest
will not ack an irq line that is masked.
Currently implemented for the ioapic only.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
When kvm emulates an invlpg instruction, it can drop a shadow pte, but
leaves the guest tlbs intact. This can cause memory corruption when
swapping out.
Without this the other cpu can still write to a freed host physical page.
tlb smp flush must happen if rmap_remove is called always before mmu_lock
is released because the VM will take the mmu_lock before it can finally add
the page to the freelist after swapout. mmu notifier makes it safe to flush
the tlb after freeing the page (otherwise it would never be safe) so we can do
a single flush for multiple sptes invalidated.
Cc: stable@kernel.org Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
[mtosatti: backport to 2.6.29] Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The g_ether USB gadget driver currently decides whether or not there's a
link to report back for eth_get_link based on if the USB link speed is
set. The USB gadget speed is however often set even before the device is
enumerated. It seems more sensible to only report a "link" if we're
actually connected to a host that wants to talk to us. The patch below
does this for me - tested with the PXA27x UDC driver.
Signed-off-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Oleg Nesterov found a couple of races in the ptrace-bts code
and fixes are queued up for it but they did not get ready in time
for the merge window. We'll merge them in v2.6.31 - until then
mark the feature as CONFIG_BROKEN. There's no user-space yet
making use of this so it's not a big issue.
sg_rq_end_io() is called via rq->end_io. In some rare cases,
sg_rq_end_io calls blk_put_request/blk_rq_unmap_user (when a program
issuing a command has gone before the command completion; e.g. by
interrupting a program issuing a command before the command
completes).
The problem is that scsi_error_handler() calls rq->end_io too. We
can't call blk_put_request/blk_rq_unmap_user too in this path (we hold
q->queue_lock).
To avoid the above problem, in these rare cases, this patch always
uses schedule_work() instead of execute_in_process_context().
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Cc: Stable Tree <stable@kernel.org> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
You can reproduce this bug by interrupting a program before a sg
response completes. This leads to the special sg state (the orphan
state), then sg calls blk_put_request in interrupt (rq->end_io).
The above bug report shows the recursive lock problem because sg calls
blk_put_request in interrupt. We could call __blk_put_request here
instead however we also need to handle blk_rq_unmap_user here, which
can't be called in interrupt too.
In the orphan state, we don't need to care about the data transfer
(the program revoked the command) so adding 'just free the resource'
mode to blk_rq_unmap_user is a possible option.
I prefer to avoid complicating the blk mapping API when possible. I
change the orphan state to call sg_finish_rem_req via
execute_in_process_context. We hold sg_fd->kref so sg_fd doesn't go
away until keventd_wq finishes our work. copy_from_user/to_user fails
so blk_rq_unmap_user just frees the resource without the data
transfer.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
sg_io_owned needs to be set before the command is sent to the midlevel;
otherwise, a quickly-completing command may cause a different CPU
to see "srp->done == 1 && !srp->sg_io_owned", which would lead to
incorrect behavior.
Check srp->done and set srp->orphan while holding rq_list_lock to
prevent races with sg_rq_end_io().
There is no need to check sfp->closed from read/write/ioctl/poll/etc.
since the kernel guarantees that this won't happen.
The usefulness of sg_srp_done() was questionable before; now it is
definitely not needed.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
sg has the following problems related to device removal:
* opening a sg fd races with removing a device
* closing a sg fd races with removing a device
* /proc/scsi/sg/* access races with removing a device
* command completion races with removing a device
* command completion races with closing a sg fd
* can rmmod sg with active commands
These problems can cause kernel oopses, memory-use-after-free, or
double-free errors. This patch fixes these problems by using krefs
to manage the lifetime of sg_device and sg_fd.
Each command submitted to the midlevel holds a reference to sg_fd
until the completion callback. This ensures that sg_fd doesn't go
away if the fd is closed with commands still outstanding.
sg_fd gets the reference of sg_device (with scsi_device) and also
makes sure that the sg module doesn't go away.
/proc/scsi/sg/* functions don't play nicely with krefs because they
give information about sg_fds which have been closed but not yet
freed due to still having outstanding commands and sg_devices which
have been removed but not yet freed due to still being referenced
by one or more sg_fds. To deal with this safely without removing
functionality, /proc functions now access sg_device and sg_fd while
holding a lock instead of using kref_get()/kref_put().
Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
[chrisw: big for -stable, helps fix real bug, and made it through rc2 upstream] Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Tetsuo Handa reports seeing the WARN_ON(current->mm == NULL) in
security_vm_enough_memory(), when do_execve() is touching the
target mm's stack, to set up its args and environment.
Yes, a UMH_NO_WAIT or UMH_WAIT_PROC call_usermodehelper() spawns
an mm-less kernel thread to do the exec. And in any case, that
vm_enough_memory check when growing stack ought to be done on the
target mm, not on the execer's mm (though apart from the warning,
it only makes a slight tweak to OVERCOMMIT_NEVER behaviour).
The libata driver has copied the code from the IDE driver which caused a post
2.4.18 regression on many HPT370[A] chips -- DMA stopped to work completely,
only causing timeouts. Now remove hpt370_bmdma_start() for good...
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The big driver change in 2.4.19-rc1 introduced a regression for many HPT370[A]
chips -- DMA stopped to work completely, only causing endless timeouts...
The culprit has been identified (at last!): it turned to be the code resetting
the DMA state machine before each transfer. Stop doing it now as this counter-
measure has clearly caused more harm than good.
This should fix the kernel.org bug #7703.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Richard Henderson pointed out that the powerpc __futex_atomic_op has a
bug: it will write the wrong value if the stwcx. fails and it has to
retry the lwarx/stwcx. loop, since 'oparg' will have been overwritten
by the result from the first time around the loop. This happens
because it uses the same register for 'oparg' (an input) as it uses
for the result.
This fixes it by using separate registers for 'oparg' and 'ret'.
Cc: stable@kernel.org Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Fix the key value generation for get/set amp verbs. The upper bits of
the parameter have to be combined with the verb value to be unique for
each direction/index of amp access.
This fixes the resume problem on some hardwares like Macbook after
the channel mode is changed.
Tested-by: Johannes Berg <johannes@sipsolutions.net> Cc: <stable@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
sfc could call netif_napi_add() multiple times for the same
napi_struct, corrupting the list of napi_structs for the associated
device and leading to a busy-loop on device removal. Move the call to
netif_napi_add() and add a call to netif_napi_del() in the obvious
places.
[bhutchings: backport to 2.6.29] Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
If the ti-usb adapter returns an zero data length frame (which happens)
then we leak a kref. Found by Christoph Mair <christoph.mair@gmail.com>
who proposed a patch. The patch here is different as Christoph's patch
didn't work for the case where tty = NULL and data arrived but Christoph
did all the hard work chasing it down.
Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
The "simplify spi_write_then_read()" patch included two regressions from
the 2.6.27 behaviors:
- The data it wrote out during the (full duplex) read side
of the transfer was not zeroed.
- It fails completely on half duplex hardware, such as
Microwire and most "3-wire" SPI variants.
So, revert that patch. A revised version should be submitted at some
point, which can get the speedup on standard hardware (full duplex)
without breaking on less-capable half-duplex stuff.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: <stable@kernel.org> [2.6.28.x, 2.6.29.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
When POSIX capabilities were introduced during the 2.1 Linux
cycle, the fs mask, which represents the capabilities which having
fsuid==0 is supposed to grant, did not include CAP_MKNOD and
CAP_LINUX_IMMUTABLE. However, before capabilities the privilege
to call these did in fact depend upon fsuid==0.
This patch introduces those capabilities into the fsmask,
restoring the old behavior.
See the thread starting at http://lkml.org/lkml/2009/3/11/157 for
reference.
Note that if this fix is deemed valid, then earlier kernel versions (2.4
and 2.2) ought to be fixed too.
Changelog:
[Mar 23] Actually delete old CAP_FS_SET definition...
[Mar 20] Updated against J. Bruce Fields's patch
Reported-by: Igor Zhbanov <izh1979@gmail.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: stable@kernel.org Cc: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
It appears I inadvertly introduced rq->lock recursion to the
hrtimer_start() path when I delegated running already expired
timers to softirq context.
This patch fixes it by introducing a __hrtimer_start_range_ns()
method that will not use raise_softirq_irqoff() but
__raise_softirq_irqoff() which avoids the wakeup.
It then also changes schedule() to check for pending softirqs and
do the wakeup then, I'm not quite sure I like this last bit, nor
am I convinced its really needed.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org
LKML-Reference: <20090313112301.096138802@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
During irq migration, we send a low priority interrupt to the previous
irq destination. This happens in non interrupt-remapping case after interrupt
starts arriving at new destination and in interrupt-remapping case after
modifying and flushing the interrupt-remapping table entry caches.
This low priority irq cleanup handler can cleanup multiple vectors, as
multiple irq's can be migrated at almost the same time. While
there will be multiple invocations of irq cleanup handler (one cleanup
IPI for each irq migration), first invocation of the cleanup handler
can potentially cleanup more than one vector (as the first invocation can
see the requests for more than vector cleanup). When we cleanup multiple
vectors during the first invocation of the smp_irq_move_cleanup_interrupt(),
other vectors that are to be cleanedup can still be pending in the local
cpu's IRR (as smp_irq_move_cleanup_interrupt() runs with interrupts disabled).
When we are ready to unhook a vector corresponding to an irq, check if that
vector is registered in the local cpu's IRR. If so skip that cleanup and
do a self IPI with the cleanup vector, so that we give a chance to
service the pending vector interrupt and then cleanup that vector
allocation once we execute the lowest priority handler.
This fixes spurious interrupts seen when migrating multiple vectors
at the same time.
[ This is apparently possible even on conventional xapic, although to
the best of our knowledge it has never been seen. The stable
maintainers may wish to consider this one for -stable. ]
[suresh: backport to 2.6.29] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: stable@kernel.org Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Freezing tasks via the cgroup freezer causes the load average to climb
because the freezer's current implementation puts frozen tasks in
uninterruptible sleep (D state).
Some applications which perform job-scheduling functions consult the
load average when making decisions. If a cgroup is frozen, the load
average does not provide a useful measure of the system's utilization
to such applications. This is especially inconvenient if the job
scheduler employs the cgroup freezer as a mechanism for preempting low
priority jobs. Contrast this with using SIGSTOP for the same purpose:
the stopped tasks do not count toward system load.
Change task_contributes_to_load() to return false if the task is
frozen. This results in /proc/loadavg behavior that better meets
users' expectations.
If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
split_job/segment_complete and the subjobs complete before the loop in
split_job completes, the kcopyd callback could be invoked from the
thread that called dm_kcopyd_copy instead of the kcopyd workqueue.
Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also mean that exception store callbacks are called twice
for the same exception - a likely reason for crashes reported inside
pending_complete() / remove_exception().
This patch fixes two problems:
1. job->fn being called from the thread that submitted the job (see above).
- Fix: hand over the completion callback to the kcopyd thread.
2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.
- Fix: pass job->write_err to the callback to report all error bits
(it is done already in run_complete_job)
Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Use a variable in segment_complete() to point to the dm_kcopyd_client
struct and only release job->pages in run_complete_job() if any are
defined. These changes are needed by the next patch.
Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
update_rlimit_cpu() tries to optimize out set_process_cpu_timer() in case
when we already have CPUCLOCK_PROF timer which should expire first. But it
uses cputime_lt() instead of cputime_gt().
Test case:
int main(void)
{
struct itimerval it = {
.it_value = { .tv_sec = 1000 },
};
See http://bugzilla.kernel.org/show_bug.cgi?id=12911
copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because
posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus
fastpath_timer_check() returns false unless we have other expired cpu timers.
Change copy_signal() to set cputime_expires.prof_exp if we have RLIMIT_CPU.
Also, set cputimer.running = 1 in that case. This is not strictly necessary,
but imho makes sense.
Reported-by: Peter Lojkin <ia6432@inbox.ru> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Peter Lojkin <ia6432@inbox.ru> Cc: Roland McGrath <roland@redhat.com> Cc: stable@kernel.org
LKML-Reference: <20090327000607.GA10104@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
This patch re-introduces a couple of functions, task_sched_runtime
and thread_group_sched_runtime, which was once removed at the
time of 2.6.28-rc1.
These functions protect the sampling of thread/process clock with
rq lock. This rq lock is required not to update rq->clock during
the sampling.
i.e.
The clock_gettime() may return
((accounted runtime before update) + (delta after update))
that is less than what it should be.
v2 -> v3:
- Rename static helper function __task_delta_exec()
to do_task_delta_exec() since -tip tree already has
a __task_delta_exec() of different version.
v1 -> v2:
- Revises comments of function and patch description.
- Add note about accuracy of thread group's runtime.
One-liner: capsh --print is broken without this patch.
In certain cases, cap_prctl returns error > 0 for success. However,
the 'no_change' label was always setting error to 0. As a result,
for example, 'prctl(CAP_BSET_READ, N)' would always return 0.
It should return 1 if a process has N in its bounding set (as
by default it does).
I'm keeping the no_change label even though it's now functionally
the same as 'error'.
Signed-off-by: Serge Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Le lundi 30 mars 2009, Chris Wright a écrit :
> q->queue could be ERR_PTR(-ENOMEM) which will break unwinding
> on error. Make iscsi_pool_free more defensive.
>
Making the freeing of q->queue dependent on q->pool being set looks
really weird (although it is correct at the moment. But this seems
to be fixable in a much simpler way.
With the benefit that only the error case is slowed down. In both
cases we have a problem if q->queue contains an error value but it's
not -ENOMEM. Apparently this can't happen today, but it doesn't feel
right to assume this will always be true. Maybe it's the right time
to fix this as well.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
[chrisw: this is a fixlet to f474a37b, also in -stable] Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Memory freeing in iscsi_pool_free() looks wrong to me. Either q->pool
can be NULL and this should be tested before dereferencing it, or it
can't be NULL and it shouldn't be tested at all. As far as I can see,
the only case where q->pool is NULL is on early error in
iscsi_pool_init(). One possible way to fix the bug is thus to not
call iscsi_pool_free() in this case (nothing needs to be freed anyway)
and then we can get rid of the q->pool check.
Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
David Miller [Wed, 8 Apr 2009 09:51:47 +0000 (02:51 -0700)]
sparc64: Fix bug in ("sparc64: Flush TLB before releasing pages.")
[ No upstream commit, this regression was added only to 2.6.29.1 ]
Unfortunately I merged an earlier version of commit b6816b706138c3870f03115071872cad824f90b4 ("sparc64: Flush TLB before
releasing pages.") than what I actually tested and merged upstream.
Simply diffing asm/tlb_64.h in Linus's tree vs. what ended up in
2.6.29.1 confirms this.
Sync things up to fix BUG() triggers some users are seeing.
Reported-by: Dennis Gilmore <dennis@ausil.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
There's a possible deadlock in generic_file_splice_write(),
splice_from_pipe() and ocfs2_file_splice_write():
- task A calls generic_file_splice_write()
- this calls inode_double_lock(), which locks i_mutex on both
pipe->inode and target inode
- ordering depends on inode pointers, can happen that pipe->inode is
locked first
- __splice_from_pipe() needs more data, calls pipe_wait()
- this releases lock on pipe->inode, goes to interruptible sleep
- task B calls generic_file_splice_write(), similarly to the first
- this locks pipe->inode, then tries to lock inode, but that is
already held by task A
- task A is interrupted, it tries to lock pipe->inode, but fails, as
it is already held by task B
- ABBA deadlock
Fix this by explicitly ordering locks: the outer lock must be on
target inode and the inner lock (which is later unlocked and relocked)
must be on pipe->inode. This is OK, pipe inodes and target inodes
form two nonoverlapping sets, generic_file_splice_write() and friends
are not called with a target which is a pipe.
Commit e1b4b9f ([NETFILTER]: {ip,ip6,arp}_tables: fix exponential worst-case
search for loops) introduced a regression in the loop detection algorithm,
causing sporadic incorrectly detected loops.
When a chain has already been visited during the check, it is treated as
having a standard target containing a RETURN verdict directly at the
beginning in order to not check it again. The real target of the first
rule is then incorrectly treated as STANDARD target and checked not to
contain invalid verdicts.
Fix by making sure the rule does actually contain a standard target.
Based on patch by Francis Dupont <Francis_Dupont@isc.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
kthreadd/2 is trying to release lock (&rp->lock) at:
[<c06b3080>] pre_handler_kretprobe+0xea/0xf4
but there are no more locks to release!
other info that might help us debug this:
1 lock held by kthreadd/2:
#0: (rcu_read_lock){..--}, at: [<c06b2b24>] __atomic_notifier_call_chain+0x0/0x5a
The Aspire One's ACPI-WMI interface is a placeholder that does nothing,
and the invalid results that we get from it are now causing userspace
problems as acer-wmi always returns that the rfkill is enabled (i.e. the
radio is off, when it isn't). As it's hardware controlled, acer-wmi
isn't needed on the Aspire One either.
Thanks to Andy Whitcroft at Canonical for tracking down Ubuntu's userspace
issues to this.
Signed-off-by: Carlos Corbacho <carlos@strangeworlds.co.uk> Reported-by: Andy Whitcroft <apw@canonical.com> Cc: stable@kernel.org Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
When the total length is shorter than the calculated number of unaligned bytes, the call to shash->update breaks. For example, calling crc32c on unaligned buffer with length of 1 can result in a system crash.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Patch "af_rose/x25: Sanity check the maximum user frame size"
(commit 83e0bbcbe2145f160fbaa109b0439dae7f4a38a9) from Alan Cox got
locking wrong. If we bail out due to user frame size being too large,
we must unlock the socket beforehand.
Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Otherwise we can wrap the sizes and end up sending garbage.
Closes #10423
Signed-off-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
upgrade_mode() sets bdev to NULL temporarily, and does not have any
locking to exclude anything from seeing that NULL.
In dm_table_any_congested() bdev_get_queue() can dereference that NULL and
cause a reported oops.
Fix this by not changing that field during the mode upgrade.
Cc: stable@kernel.org Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Refcounting with non-atomic ops under shared lock will corrupt the counter
in multi-processor system and may trigger BUG_ON().
Use module refcount.
# same approach as dm-target-use-module-refcount-directly.patch here
# https://www.redhat.com/archives/dm-devel/2008-December/msg00075.html
The tt_internal's 'use' field is superfluous: the module's refcount can do
the work properly. An acceptable side-effect is that this increases the
reference counts reported by 'lsmod'.
Remove the superfluous test when removing a target module.
[Crash possible without this on SMP - agk]
Cc: stable@kernel.org Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
We need to check if the exception was completed after dropping the lock.
After regaining the lock, __find_pending_exception checks if the exception
was already placed into &s->pending hash.
But we don't check if the exception was already completed and placed into
&s->complete hash. If the process waiting in alloc_pending_exception was
delayed at this point because of a scheduling latency and the exception
was meanwhile completed, we'd miss that and allocate another pending
exception for already completed chunk.
It would lead to a situation where two records for the same chunk exist
and potential data corruption because multiple snapshot I/Os to the
affected chunk could be redirected to different locations in the
snapshot.
Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>