git.karo-electronics.de Git - karo-tx-linux.git/log

netfilter: nf_conntrack: fix confirmation race condition

commit 5c8ec910e789a92229978d8fd1fce7b62e8ac711 upstream.

New connection tracking entries are inserted into the hash before they
are fully set up, namely the CONFIRMED bit is not set and the timer not
started yet. This can theoretically lead to a race with timer, which
would set the timeout value to a relative value, most likely already in
the past.

Perform hash insertion as the final step to fix this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

netfilter: nf_log: fix sleeping function called from invalid context

commit 266d07cb1c9a0c345d7d3aea889f92062894059e upstream.

Fix regression introduced by 17625274 "netfilter: sysctl support of
logger choice":

BUG: sleeping function called from invalid context at /mnt/s390test/linux-2.6-tip/arch/s390/include/asm/uaccess.h:234
in_atomic(): 1, irqs_disabled(): 0, pid: 3245, name: sysctl
CPU: 1 Not tainted 2.6.30-rc8-tipjun10-02053-g39ae214 #1
Process sysctl (pid: 3245, task: 000000007f675da0, ksp: 000000007eb17cf0)
0000000000000000 000000007eb17be8 0000000000000002 0000000000000000
       000000007eb17c88 000000007eb17c00 000000007eb17c00 0000000000048156
       00000000003e2de8 000000007f676118 000000007eb17f10 0000000000000000
       0000000000000000 000000007eb17be8 000000000000000d 000000007eb17c58
       00000000003e2050 000000000001635c 000000007eb17be8 000000007eb17c30
Call Trace:
(Ý<00000000000162e6>¨ show_trace+0x13a/0x148)
Ý<00000000000349ea>¨ __might_sleep+0x13a/0x164
Ý<0000000000050300>¨ proc_dostring+0x134/0x22c
Ý<0000000000312b70>¨ nf_log_proc_dostring+0xfc/0x188
Ý<0000000000136f5e>¨ proc_sys_call_handler+0xf6/0x118
Ý<0000000000136fda>¨ proc_sys_read+0x26/0x34
Ý<00000000000d6e9c>¨ vfs_read+0xac/0x158
Ý<00000000000d703e>¨ SyS_read+0x56/0x88
Ý<0000000000027f42>¨ sysc_noemu+0x10/0x16

Use the nf_log_mutex instead of RCU to fix this.

Reported-and-tested-by: Maran Pakkirisamy <maranpsamy@in.ibm.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

powerpc/mpic: Fix mapping of "DCR" based MPIC variants

commit 5a2642f620eb6e40792822fa0eafe23046fbb55e upstream.

Commit 31207dab7d2e63795eb15823947bd2f7025b08e2
"Fix incorrect allocation of interrupt rev-map"
introduced a regression crashing on boot on machines using
a "DCR" based MPIC, such as the Cell blades.

The reason is that the irq host data structure is initialized
much later as a result of that patch, causing our calls to
mpic_map() do be done before we have a host setup.

Unfortunately, this breaks _mpic_map_dcr() which uses the
mpic->irqhost to get to the device node.

This fixes it by, instead, passing the device node explicitely
to mpic_map().

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Akira Tsukamoto <akirat@rd.scei.sony.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

hwmon: (max6650) Fix lock imbalance

commit 025dc740d01f99ccba945df1f9ef9e06b1c15d96 upstream.

Add omitted update_lock to one switch/case in set_div.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

elf: fix one check-after-use

commit e2dbe12557d85d81f4527879499f55681c3cca4f upstream.

Check before use it.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

mm: mark page accessed before we write_end()

commit c8236db9cd7aa492dcfcdcca702638e704abed49 upstream.

In testing a backport of the write_begin/write_end AOPs, a 10% re-read
regression was noticed when running iozone. This regression was
introduced because the old AOPs would always do a mark_page_accessed(page)
after the commit_write, but when the new AOPs where introduced, the only
place this was kept was in pagecache_write_end().

This patch does the same thing in the generic case as what is done in
pagecache_write_end(), which is just to mark the page accessed before we
do write_end().

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: don't use 'access_ok()' as a range check in get_user_pages_fast()

[ Upstream commit 7f8189068726492950bf1a2dcfd9b51314560abf - modified
  for stable to not use the sloppy __VIRTUAL_MASK_SHIFT ]

It's really not right to use 'access_ok()', since that is meant for the
normal "get_user()" and "copy_from/to_user()" accesses, which are done
through the TLB, rather than through the page tables.

Why? access_ok() does both too few, and too many checks.  Too many,
because it is meant for regular kernel accesses that will not honor the
'user' bit in the page tables, and because it honors the USER_DS vs
KERNEL_DS distinction that we shouldn't care about in GUP.  And too few,
because it doesn't do the 'canonical' check on the address on x86-64,
since the TLB will do that for us.

So instead of using a function that isn't meant for this, and does
something else and much more complicated, just do the real rules: we
don't want the range to overflow, and on x86-64, we want it to be a
canonical low address (on 32-bit, all addresses are canonical).

Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

vmscan: do not unconditionally treat zones that fail zone_reclaim() as full

commit fa5e084e43eb14c14942027e1e2e894aeed96097 upstream.

vmscan: do not unconditionally treat zones that fail zone_reclaim() as full

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.  The problem is that zone_reclaim() failing at all means the
zone gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full.  Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall.  The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued.  The zone only gets marked full
if it really is unreclaimable.  If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.

There is a side-effect to this patch.  Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead.  With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced.  It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option.  Due to the age of the bug, it should be
considered a -stable candidate.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Staging: rt2870: Add USB ID for Sitecom WL-608

commit 8dfb00571819ce491ce1760523d50e85bcd2185f upstream.

Add the USB id 0x0DF6,0x003F to the rt2870.h file such that the
Sitecom WL-608 device will be recognized by this driver.

Signed-off-by: Jorrit Schippers <jorrit@ncode.nl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86, setup (2.6.30-stable) fix 80x34 and 80x60 console modes

Note: this is not in upstream since upstream is not affected due to the
      new "BIOS glovebox" subsystem.

As coded, most INT10 calls in video-vga.c allow the compiler to assume
EAX remains unchanged across them, which is not always the case.  This
triggers an optimisation issue that causes vga_set_vertical_end() to be
called with an incorrect number of scanlines.  Fix this by beefing up
the asm constraints on these calls.

Reported-by: Marc Aurele La France <tsi@xfree86.org>
Signed-off-by: Marc Aurele La France <tsi@xfree86.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

parisc: fix ldcw inline assembler

commit 7d17e2763129ea307702fcdc91f6e9d114b65c2d upstream.

There are two reasons to expose the memory *a in the asm:

1) To prevent the compiler from discarding a preceeding write to *a, and
2) to prevent it from caching *a in a register over the asm.

The change has had a few days testing with a SMP build of 2.6.22.19
running on a rp3440.

This patch is about the correctness of the __ldcw() macro itself.
The use of the macro should be confined to small inline functions
to try to limit the effect of clobbering memory on GCC's optimization
of loads and stores.

Signed-off-by: Dave Anglin <dave.anglin@nrc-cnrc.gc.ca>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

parisc: ensure broadcast tlb purge runs single threaded

commit e82a3b75127188f20c7780bec580e148beb29da7 upstream.

The TLB flushing functions on hppa, which causes PxTLB broadcasts on the system
bus, needs to be protected by irq-safe spinlocks to avoid irq handlers to deadlock
the kernel. The deadlocks only happened during I/O intensive loads and triggered
pretty seldom, which is why this bug went so long unnoticed.

Signed-off-by: Helge Deller <deller@gmx.de>
[edited to use spin_lock_irqsave on UP as well since we'd been locking there
all this time anyway, --kyle]
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86-64: Fix bad_srat() to clear all state

commit 429b2b319af3987e808c18f6b81313104caf782c upstream.

Need to clear both nodes and nodes_add state for start/end.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20090718065657.GA2898@basil.fritz.box>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: Add quirk for Intel DG45ID board to avoid low memory corruption

commit 6aa542a694dc9ea4344a8a590d2628c33d1b9431 upstream.

AMI BIOS with low memory corruption was found on Intel DG45ID
board (Bug 13710). Add this board to the blacklist - in the
(somewhat optimistic) hope of future boards/BIOSes from Intel
not having this bug.

Also see:

http://bugzilla.kernel.org/show_bug.cgi?id=13736

Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Cc: ykzhao <yakui.zhao@intel.com>
Cc: alan@lxorguk.ukuu.org.uk
Cc: <stable@kernel.org>
LKML-Reference: <1247660169-4503-1-git-send-email-bug-track@fisher-privat.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: Fix movq immediate operand constraints in uaccess.h

commit ebe119cd0929df4878f758ebf880cb435e4dcaaf upstream.

The movq instruction, generated by __put_user_asm() when used for
64-bit data, takes a sign-extended immediate ("e") not a zero-extended
immediate ("Z").

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: Fix movq immediate operand constraints in uaccess_64.h

commit 155b73529583c38f30fd394d692b15a893960782 upstream.

arch/x86/include/asm/uaccess_64.h uses wrong asm operand constraint
("ir") for movq insn. Since movq sign-extends its immediate operand,
"er" constraint should be used instead.

Attached patch changes all uses of __put_user_asm in uaccess_64.h to use
"er" when "q" insn suffix is involved.

Patch was compile tested on x86_64 with defconfig.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: geode: Mark mfgpt irq IRQF_TIMER to prevent resume failure

commit d6c585a4342a2ff627a29f9aea77c5ed4cd76023 upstream.

Timer interrupts are excluded from being disabled during suspend. The
clock events code manages the disabling of clock events on its own
because the timer interrupt needs to be functional before the resume
code reenables the device interrupts.

The mfgpt timer request its interrupt without setting the IRQF_TIMER
flag so suspend_device_irqs() disables it as well which results in a
fatal resume failure.

Adding IRQF_TIMER to the interupt flags when requesting the mrgpt
timer interrupt solves the problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Andres Salomon <dilinger@debian.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dm raid1: wake kmirrord when requeueing delayed bios after remote recovery

commit 69885683d22d8c05910fd808c01fdce1322739b4 upstream.

The recent commit 7513c2a761d69d2a93f17146b3563527d3618ba0 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.

The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing). Otherwise the bios could sit on the list indefinitely.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

sched: fix nr_uninterruptible accounting of frozen tasks really

commit 6301cb95c119ebf324bb96ee226fa9ddffad80a7 upstream.

commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
the nr_uninterruptible accounting on freeze/thaw. On freeze the task
is excluded from accounting with a check for (task->flags &
PF_FROZEN), but that flag is cleared before the task is thawed. So
while we prevent that the task with state TASK_UNINTERRUPTIBLE
is accounted to nr_uninterruptible on freeze we decrement
nr_uninterruptible on thaw.

Use a separate flag which is handled by the freezing task itself. Set
it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
clear it after we return from frozen state.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86/pci: insert ioapic resource before assigning unassigned resources

commit 857fdc53a0a90c3ba7fcf5b1fb4c7a62ae03cf82 upstream.

Stephen reported that his DL585 G2 needed noapic after 2.6.22 (?)

Dann bisected it down to:
  commit 30a18d6c3f1e774de656ebd8ff219d53e2ba4029
  Date:   Tue Feb 19 03:21:20 2008 -0800

      x86: multi pci root bus with different io resource range, on
      64-bit

It turns out that:
  1. that AMD-based systems have two HT chains.
  2. BIOS doesn't allocate resources for BAR 6 of devices under 8132 etc
  3. that multi-peer-root patch will try to split root resources to peer
     root resources according to PCI conf of NB
  4. PCI core assigns unassigned resources, but they overlap with BARs
     that are used by ioapic addr of io4 and 8132.

The reason: at that point ioapic address are not inserted yet.  Solution
is to insert ioapic resources into the tree a bit earlier.

Reported-by: Stephen Frost <sfrost@snowman.net>
Reported-and-Tested-by: dann frazier <dannf@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

tracing/function: Fix the return value of ftrace_trace_onoff_callback()

commit 04aef32d39cc4ef80087c0ce8ed113c6d64f1a6b upstream.

ftrace_trace_onoff_callback() will return an error even if we do the
right operation, for example:

# echo _spin_*:traceon:10 > set_ftrace_filter
-bash: echo: write error: Invalid argument
# cat set_ftrace_filter
#### all functions enabled ####
_spin_trylock_bh:traceon:count=10
_spin_unlock_irq:traceon:count=10
_spin_unlock_bh:traceon:count=10
_spin_lock_irq:traceon:count=10
_spin_unlock:traceon:count=10
_spin_trylock:traceon:count=10
_spin_unlock_irqrestore:traceon:count=10
_spin_lock_irqsave:traceon:count=10
_spin_lock_bh:traceon:count=10
_spin_lock:traceon:count=10

We want to set _spin_*:traceon:10 to set_ftrace_filter, it complains
with "Invalid argument", but the operation is successful.

This is because ftrace_process_regex() returns the number of functions that
matched the pattern. If the number is not 0, this value is returned
by ftrace_regex_write() whereas we want to return the number of bytes
virtually written.
Also the file offset pointer is not updated in this case.

If the number of matched functions is lower than the number of bytes written
by the user, this results to a reprocessing of the string given by the user with
a lower size, leading to a malformed ftrace regex and then a -EINVAL returned.

So, this patch fixes it by returning 0 if no error occured.
The fix also applies on 2.6.30

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

sched_rt: Fix overload bug on rt group scheduling

commit a1ba4d8ba9f06a397e97cbd67a93ee306860b40a upstream.

Fixes an easily triggerable BUG() when setting process affinities.

Make sure to count the number of migratable tasks in the same place:
the root rt_rq. Otherwise the number doesn't make sense and we'll hit
the BUG in set_cpus_allowed_rt().

Also, make sure we only count tasks, not groups (this is probably
already taken care of by the fact that rt_se->nr_cpus_allowed will be 0
for groups, but be more explicit)

Tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <1247067476.9777.57.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

nilfs2: fix disorder in cp count on error during deleting checkpoints

commit d9a0a345ab7a58a30ec38e5bb7401a28714914d2 upstream.

This fixes a bug that checkpoint count gets wrong on errors when
deleting a series of checkpoints.

The count error is persistent since the checkpoint count is stored on
disk. Some userland programs refer to the count via ioctl, and this
bugfix is needed to prevent malfunction of such programs.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

nilfs2: fix incorrect KERN_CRIT messages in case of write failures

commit 4a52df779700080de4afb0436d9dd9188514a69b upstream.

In case of write-failure retries, the following KERN_CRIT level
messages are mistakenly output by nilfs_dat_commit_start() function:

nilfs_dat_commit_start: vbn = 408463, start = 12506, end = 18446744073709551615, pbn = 530210
nilfs_dat_commit_start: vbn = 408515, start = 12506, end = 18446744073709551615, pbn = 530211
nilfs_dat_commit_start: vbn = 408464, start = 12506, end = 18446744073709551615, pbn = 530212
...

This suppresses these messages.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

nilfs2: fix hang problem of log writer which occurs after write failures

commit 8227b29722fdbac72357aae155d171a5c777670c upstream.

Leandro Lucarella gave me a report that nilfs gets stuck after its
write function fails.

The problem turned out to be caused by bugs which leave writeback flag
on pages. This fixes the problem by ensuring to clear the writeback
flag in error path.

Reported-by: Leandro Lucarella <llucax@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

nilfs2: remove unlikely directive causing mis-conversion of error code

commit 0cfae3d8795f388f9de78adb0171520d19da77e9 upstream.

The following error code handling in nilfs_segctor_write() function
wrongly converted negative error codes to a truth value (i.e. 1):

   err = unlikely(err) ? : res;

which originaly meant to be

   err = err ? : res;

This mis-conversion caused that write or sync functions receive the
unexpected error code.  This fixes the bug by removing the unlikely
directive.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

block: fix sg SG_DXFER_TO_FROM_DEV regression

commit ecb554a846f8e9d2a58f6d6c118168a63ac065aa upstream.

I overlooked SG_DXFER_TO_FROM_DEV support when I converted sg to use
the block layer mapping API (2.6.28).

Douglas Gilbert explained SG_DXFER_TO_FROM_DEV:

http://www.spinics.net/lists/linux-scsi/msg37135.html

=
The semantics of SG_DXFER_TO_FROM_DEV were:
   - copy user space buffer to kernel (LLD) buffer
   - do SCSI command which is assumed to be of the DATA_IN
     (data from device) variety. This would overwrite
     some or all of the kernel buffer
   - copy kernel (LLD) buffer back to the user space.

The idea was to detect short reads by filling the original
user space buffer with some marker bytes ("0xec" it would
seem in this report). The "resid" value is a better way
of detecting short reads but that was only added this century
and requires co-operation from the LLD.
=

This patch changes the block layer mapping API to support this
semantics. This simply adds another field to struct rq_map_data and
enables __bio_copy_iov() to copy data from user space even with READ
requests.

It's better to add the flags field and kills null_mapped and the new
from_user fields in struct rq_map_data but that approach makes it
difficult to send this patch to stable trees because st and osst
drivers use struct rq_map_data (they were converted to use the block
layer in 2.6.29 and 2.6.30). Well, I should clean up the block layer
mapping API.

zhou sf reported this regiression and tested this patch:

http://www.spinics.net/lists/linux-scsi/msg37128.html
http://www.spinics.net/lists/linux-scsi/msg37168.html

Reported-by: zhou sf <sxzzsf@gmail.com>
Tested-by: zhou sf <sxzzsf@gmail.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: Fix fixmap page order for FIX_TEXT_POKE0,1

commit 12b9d7ccb841805e347fec8f733f368f43ddba40 upstream.

Masami reported:

> Since the fixmap pages are assigned higher address to lower,
> text_poke() has to use it with inverted order (FIX_TEXT_POKE1
> to FIX_TEXT_POKE0).

I prefer to just invert the order of the fixmap declaration.
It's simpler and more straightforward.

Backward fixmaps seems to be used by both x86 32 and 64.

It's really rare but a nasty bug, because it only hurts when
instructions to patch are crossing a page boundary. If this
happens, the fixmap write accesses will spill on the following
fixmap, which may very well crash the system. And this does not
crash the system, it could leave illegal instructions in place.
Thanks Masami for finding this.

It seems to have crept into the 2.6.30-rc series, so this calls
for a -stable inclusion.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <20090701213722.GH19926@Krystal>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

x86: Fix fixmap ordering

commit 789d03f584484af85dbdc64935270c8e45f36ef7 upstream.

The merge of the 32- and 64-bit fixmap headers made a latent
bug on x86-64 a real one: with the right config settings
it is possible for FIX_OHCI1394_BASE to overlap the FIX_BTMAP_*
range.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A4A0A8702000078000082E8@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

vc: create vcs(a) devices for consoles

commit c46a7aec556ffdbdb7357db0b05904b176cb3375 upstream.

The buffer for the consoles are unconditionally allocated at con_init()
time, which miss the creation of the vcs(a) devices.

Since 2.6.30 (commit 4995f8ef9d3aac72745e12419d7fbaa8d01b1d81, 'vcs:
hook sysfs devices into object lifetime instead of "binding"' to be
exact) these devices are no longer created at open() and removed on
close(), but controlled by the lifetime of the buffers.

Reported-by: Gerardo Exequiel Pozzi <vmlinuz386@yahoo.com.ar>
Tested-by: Gerardo Exequiel Pozzi <vmlinuz386@yahoo.com.ar>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

partitions: fix broken uevent_suppress conversion

commit f8c73c790c588fd70fda1632c8927a87b3d31dcd upstream.

git commit f67f129e "Driver core: implement uevent suppress in kobject"
contains this chunk for fs/partitions/check.c:

/* suppress uevent if the disk supresses it */
- if (!ddev->uevent_suppress)
+ if (!dev_get_uevent_suppress(pdev))
kobject_uevent(&pdev->kobj, KOBJ_ADD);

However that should have been

- if (!ddev->uevent_suppress)
+ if (!dev_get_uevent_suppress(ddev))

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ASoC: Fix register cache initialisation for WM8753

commit 1df892cba45f9856d369a6a317ad2d1e44bca423 upstream.

The wrong register cache variable was being used to provide the size for
the memcpy(), resulting in a copy of only a void * of data.

Reported-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

mvsdio: fix handling of partial word at the end of PIO transfer

commit 6cdbf734493d6e8f5afc6f539b82897772809d43 upstream.

Standard data flow for MMC/SD/SDIO cards requires that the mvsdio
controller be set for big endian operation. This is causing problems
with buffers which length is not a multiple of 4 bytes as the last
partial word doesn't get shifted all the way and stored properly in
memory. Let's compensate for this.

Signed-off-by: Nicolas Pitre <nico@marvell.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

HID: hiddev, fix lock imbalance

commit 4859484b0957ddc7fe3e0fa349d98b0f1c7876bd upstream.

Add omitted BKL to one switch/case.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ALSA: hda - Fix mute control with some ALC262 models

commit 8de56b7deb2534a586839eda52843c1dae680dc5 upstream.

The master mute switch is wrongly implemented as checking the pointer
instead of its value, thus it can be never muted. This patch fixes
the issue.

Reference: Novell bnc#404873
https://bugzilla.novell.com/show_bug.cgi?id=404873

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ALSA: hda - Add quirk for Gateway T6834c laptop

commit 42b95f0c6b524b5a670dd17533a3522db368f600 upstream.

Gateway T6834c laptops need EAPD always on while the default behavior
for the STAC9205 reference board is to turn it off upon every HP plug.
By using the special "eapd" model, which is first introduced for Gateway
T1616 laptops for this same reason, this peculiarity can be properly
handled.

Signed-off-by: Hao Song <baritono.tux@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ALSA: hda - Fix pin-setup for Sony VAIO with STAC9872 codecs

commit b04add956616b6d89ff21da749b46ad2bd58ef32 upstream.

The recent rewrite of the codec parser for STAC9872 caused a regression
for some Sony VAIO models that don't give proper pin default configs
by BIOS. Even using model=vaio doesn't work because the pin definitions
are set after the pin overrides.

This patch fixes the pin definitions in patch_stac9872() to be put
in the right place before the pin overrides. Also the patch adds the
new quirk entry for VAIO F/S to have the correct pin default configs.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ALSA: ca0106 - Fix the max capture buffer size

commit 34fdeb2d07102e07ecafe79dec170bd6733f2e56 upstream.

The capture buffer size with 64kB seems broken with CA0106.
At least, either the update timing or the DMA position is wrong,
and this screws up pulseaudio badly.

This patch restricts the max buffer size less than that to make life
a bit easier.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

cifs: free nativeFileSystem field before allocating a new one

commit 90a98b2f3f3647fb17667768a348b2b219f2a9f7 upstream.

...otherwise, we'll leak this memory if we have to reconnect (e.g. after
network failure).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

cifs: fix regression with O_EXCL creates and optimize away lookup

commit 5ddf1e0ff00fd808c048d0b920784828276cc516 upstream.

cifs: fix regression with O_EXCL creates and optimize away lookup

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Shirish Pargaonkar <shirishp@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: EHCI: report actual_length for iso transfers

commit ec6d67e39f5638c792eb7490bf32586ccb9d8005 upstream.

This patch (as1259b) makes ehci-hcd return the total number of bytes
transferred in urb->actual_length for Isochronous transfers.
Until now, the actual_length value was unaccountably left at 0.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: fix LANGID=0 regression

commit 0cce2eda19923e5e5ccc8b042dec5af87b3ffad0 upstream.

commit b7af0bb ("USB: allow malformed LANGID descriptors") broke support
for devices without string descriptor support.

Reporting string descriptors is optional to USB devices, and a device
lets us know it can't deal with strings by responding to the LANGID
request with a STALL token.

The kernel handled that correctly before b7af0bb came in, but failed
hard if the LANGID was reported but broken. More than that, if a device
was not able to provide string descriptors, the LANGID was retrieved
over and over again at each string read request.

This patch changes the behaviour so that

a) the LANGID is only queried once
b) devices which can't handle string requests are not asked again
c) devices with malformed LANGID values have a sane fallback to 0x0409

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: RNDIS gadget, fix issues talking from PXA

commit 4e19f220d4e84f5728cb7edde36352ab425cfba4 upstream.

The reworked Ethernet gadget has an RNDIS interop problem when used
with the CDC subset driver ... e.g. on PXA 2xx and 3xx hardware,
which currently has a hard time talking to MS-Windows hosts.

The issue is that Microsoft requires USB_CLASS_COMM. Fix by tweaking
the CDC subset driver to not switch to USB_CLASS_VENDOR_SPEC if RNDIS
is used in some other device configuration.

[ UPDATED: some "statements" were comma-terminated; fix that. ]

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Aric Blumer <aric@sdgsystems.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: fix memleak in usbfs

commit d794a02111cd3393da69bc7d6dd2b6074bd037cc upstream.

This patch fixes a memory leak in devio.c::processcompl

If writing to user space fails the packet must be discarded, as it
already has been removed from the queue of completed packets.

Signed-off-by: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: fix uninitialised variable in ti_do_download

commit 87ea8c887905d8b13ae90b537117592ed027632a upstream.

Signed-off-by: Oliver Neukum <oliver@neukum.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: ti_usb_3410_5052: fix duplicate device ids.

commit 3c43f27bf57b0502df2478253699559ee1d43f6d upstream.

commit 1a1fab513734b3a4fca1bee8229e5ff7e1cb873c accidentally added the
device id to both tables in the driver, which causes problems as this is
only a single port device, not a multiple port device.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

USB: handle zero-length usbfs submissions correctly

commit 9180135bc80ab11199d482b6111e23f74d65af4a upstream.

This patch (as1262) fixes a bug in usbfs: It refuses to accept
zero-length transfers, and it insists that the buffer pointer be valid
even if there is no data being transferred.

The patch also consolidates a bunch of repetitive access_ok() checks
into a single check, which incidentally fixes the lack of such a check
for Isochronous URBs.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Staging: prevent rtl8187se from crashing dev_ioctl() in SIOCGIWNAME

commit 02c8baecf5d8850dba40b47cdf003ed2e04e66dd upstream.

I repeatedly get __stack_chk_fail panic()s with this driver before
applying the attached fix.

ieee80211_wx_get_name() ignores sizeof(wrqu->name) which is IFNAMSIZ (16), and
on certain conditions, the concatenated string will be larger than IFNAMSIZ
including the terminating zero.

length ("802.11" ++ "b" ++ "/g" ++ " linked" ++ "\x00") == 17

This fix uses strl{cpy,cat} in addition to the reduction of the total
possible length of the output string by a char.

It can be applied to 2.6.30-stable as well.

Signed-off-by: Dan Aloni <dan@aloni.org>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

cfg80211: fix refcount leak

commit 2dce4c2b5f0b43bd25bf9ea6ded06b7f8a54c91f upstream.

The code in cfg80211's cfg80211_bss_update erroneously
grabs a reference to the BSS, which means that it will
never be freed.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

gigaset: accept connection establishment messages in any order

commit bceb0f126f25184eaec3f3c8f00c92b0d899e5de upstream.

ISDN connection setup failed if the "connection active" and
"B channel up" messages from the device arrived in a different
order than expected. Modify the state machine to accept them in
any order.

Impact: bugfix

Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dsa: fix 88e6xxx statistics counter snapshotting

commit 1ded3f59f35a2642852b3e2a1c0fa8a97777e9af upstream.

The bit that tells us whether a statistics counter snapshot operation
has completed is located in the GLOBAL register block, not in the
GLOBAL2 register block, so fix up mv88e6xxx_stats_wait() to poll the
right register address.

Signed-off-by: Stephane Contri <Stephane.Contri@grassvalley.com>
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]b

commit 7ed9f7e5db58c6e8c2b4b738a75d5dcd8e17aad5 upstream.

Jesper noted that kmem_cache_destroy() invokes synchronize_rcu() rather than
rcu_barrier() in the SLAB_DESTROY_BY_RCU case, which could result in RCU
callbacks accessing a kmem_cache after it had been destroyed.

Acked-by: Matt Mackall <mpm@selenic.com>
Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

sound: usb-audio: add workaround for Blue Microphones devices

commit 8886f33f25083a47d5fa24ad7b57bb708c5c5403 upstream.

Blue Microphones USB devices have an alternate setting that sends two
channels of data to the computer. Unfortunately, the descriptors of
that altsetting have a wrong channel setting, which means that any
recorded data from such a device has twice the sample rate from what
would be expected.

This patch adds a workaround to ignore that altsetting. Since these
devices have only one actual channel, no data is lost.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

sound: virtuoso: fix Xonar D1/DX silence after resume

commit 826390796d09444b93e1f957582f8970ddfd9b3d upstream.

When resuming, we better take the DACs out of the reset state before
trying to use them.

Reference: kernel bug #13599
http://bugzilla.kernel.org/show_bug.cgi?id=13599

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

NFSD: Don't hold unrefcounted creds over call to nfsd_setuser()

commit 033a666ccb842ab4134fcd0c861d5ba9f5d6bf3a upstream.

nfsd_open() gets an unrefcounted pointer to the current process's effective
credentials at the top of the function, then calls nfsd_setuser() via
fh_verify() - which may replace and destroy the current process's effective
credentials - and then passes the unrefcounted pointer to dentry_open() - but
the credentials may have been destroyed by this point.

Instead, the value from current_cred() should be passed directly to
dentry_open() as one of its arguments, rather than being cached in a variable.

Possibly fh_verify() should return the creds to use.

This is a regression introduced by
745ca2475a6ac596e3d8d37c2759c0fbe2586227 "CRED: Pass credentials through
dentry_open()".

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-Verified-By: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

SCSI: zalon: fix oops on attach failure

commit d3a263a8168f78874254ea9da9595cfb0f3e96d7 upstream.

I recently discovered on my zalon that if the attachment fails because
of a bus misconfiguration (I scrapped my HVD array, so the card is now
unterminated) then the system oopses. The reason is that if
ncr_attach() returns NULL (signalling failure) that NULL is passed by
the goto failed straight into ncr_detach() which oopses.

The fix is just to return -ENODEV in this case.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Linux 2.6.30.3

fbmon: work around compiler bug in gcc-4.2.4

commit 3730793d457fed79a6d49bae72996d458c8e4f2d upstream.

There's some odd bug in gcc-4.2 where it miscompiles a simple loop whent
he loop counter is of type 'unsigned char' and it should count to 128.

The compiler will incorrectly decide that a trivial loop like this:

unsigned char i, ...

for (i = 0; i < 128; i++) {
..

is endless, and will compile it to a single instruction that just
branches to itself.

This was triggered by the addition of '-fno-strict-overflow', and we
could play games with compiler versions and go back to '-fwrapv'
instead, but the trivial way to avoid it is to just make the loop
induction variable be an 'int' instead.

Thanks to Krzysztof Oledzki for reporting and testing and to Troy Moure
for digging through assembler differences and finding it.

Reported-and-tested-by: Krzysztof Oledzki <olel@ans.pl>
Found-by: Troy Moure <twmoure@szypr.net>
Gcc-bug-acked-by: Ian Lance Taylor <iant@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Linux 2.6.30.2

Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x

commit a137802ee839ace40079bebde24cfb416f73208a upstream.

This causes kernel images that don't run init to completion with certain
broken gcc versions.

This fixes kernel bugzilla entry:
http://bugzilla.kernel.org/show_bug.cgi?id=13012

I suspect the gcc problem is this:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28230

Fix the problem by using the -fno-strict-overflow flag instead, which
not only does not exist in the known-to-be-broken versions of gcc (it
was introduced later than fwrapv), but seems to be much less disturbing
to gcc too: the difference in the generated code by -fno-strict-overflow
are smaller (compared to using neither flag) than when using -fwrapv.

Reported-by: Barry K. Nathan <barryn@pobox.com>
Pushed-by: Frans Pop <elendil@planet.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

fuse: fix return value of fuse_dev_write()

commit b4c458b3a23d76936e76678f2074b1528f129f7a upstream.

On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
exposes a bug due to a non-careful return of an int or unsigned value:

implement a FUSE filesystem which sends an unsolicited notification to
the kernel with invalid opcode. The respective write to /dev/fuse
will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
errno == EINVAL.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

fuse: fix bad return value in fuse_file_poll()

commit 201fa69a2849536ef2912e8e971ec0b01c04eff4 upstream.

Fix fuse_file_poll() which returned a -errno value instead of a poll
mask.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Fix iommu address space allocation

commit a15a519ed6e5e644f5a33c213c00b0c1d3cfe683 upstream.

This fixes kernel.org bug #13584. The IOVA code attempted to optimise
the insertion of new ranges into the rbtree, with the unfortunate result
that some ranges just didn't get inserted into the tree at all. Then
those ranges would be handed out more than once, and things kind of go
downhill from there.

Introduced after 2.6.25 by ddf02886cbe665d67ca750750196ea5bf524b10b
("PCI: iova RB tree setup tweak").

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: mark gross <mgross@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Fix pci_unmap_addr() et al on i386.

commit 788d84bba47ea3eb377f7a3ae4fd1ee84b84877b upstream.

We can run a 32-bit kernel on boxes with an IOMMU, so we need
pci_unmap_addr() etc. to work -- without it, drivers will leak mappings.

To be honest, this whole thing looks like it's more pain than it's
worth; I'm half inclined to remove the no-op #else case altogether.

But this is the minimal fix, which just does the right thing if
CONFIG_DMAR is set.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

floppy: fix lock imbalance

commit 8516a500029890a72622d245f8ed32c4e30969b7 upstream.

A crappy macro prevents us unlocking on a fail path.

Expand the macro and unlock appropriatelly.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Revert "ipv4: arp announce, arp_proxy and windows ip conflict verification"

commit f8a68e752bc4e39644843403168137663c984524 upstream.

This reverts commit 73ce7b01b4496a5fbf9caf63033c874be692333f.

After discovering that we don't listen to gratuitious arps in 2.6.30
I tracked the failure down to this commit.

The patch makes absolutely no sense. RFC2131 RFC3927 and RFC5227.
are all in agreement that an arp request with sip == 0 should be used
for the probe (to prevent learning) and an arp request with sip == tip
should be used for the gratitous announcement that people can learn
from.

It appears the author of the broken patch got those two cases confused
and modified the code to drop all gratuitous arp traffic. Ouch!

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.

commit b8d966efd9a46a9a35beac50cbff6e30565125ef upstream.

If we try to modify one of the md/ sysfs files
suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

md: fix error path when duplicate name is found on md device creation.

commit 1ec22eb2b4a2e1a763106bce36b11c02eaa84e61 upstream.

When an md device is created by name (rather than number) we need to
check that the name is not already in use. If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.

Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

md/raid5: suspend shouldn't affect read requests.

commit a5c308d4d1659b1f4833b863394e3e24cdbdfc6e upstream.

md allows write to regions on an array to be suspended temporarily.
This allows user-space to participate is aspects of reshape.
In particular, data can be copied with not risk of a race.
We should not be blocking read requests though, so don't.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

blocK: Restore barrier support for md and probably other virtual devices.

commit db64f680ba4b5c56c4be59f0698000df89ff0281 upstream.

The next_ordered flag is only meaningful for devices that use __make_request.
So move the test against next_ordered out of generic code and in to
__make_request

Since this test was added, barriers have not worked on md or any
devices that don't use __make_request and so don't bother to set
next_ordered. (dm explicitly sets something other than
QUEUE_ORDERED_NONE since
commit 99360b4c18f7675b50d283301d46d755affe75fd
but notes in the comments that it is otherwise meaningless).

Cc: Ken Milmore <ken.milmore@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dma-debug: fix off-by-one error in overlap function

commit c79ee4e466dd12347f112e2af306dca35198458f upstream.

This patch fixes a bug in the overlap function which returned true if
one region ends exactly before the second region begins. This is no
overlap but the function returned true in that case.

Reported-by: Andrew Randrianasulu <randrik@mail.ru>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

alpha: fix percpu build breakage

commit b01e8dc34379f4ba2f454390e340a025edbaaa7e upstream.

alpha percpu access requires custom SHIFT_PERCPU_PTR() definition for
modules to work around addressing range limitation.  This is done via
generating inline assembly using C preprocessing which forces the
assembler to generate external reference.  This happens behind the
compiler's back and makes the compiler think that static percpu variables
in modules are unused.

This used to be worked around by using __unused attribute for percpu
variables which prevent the compiler from omitting the variable; however,
recent declare/definition attribute unification change broke this as
__used can't be used for declaration.  Also, in the process,
PER_CPU_ATTRIBUTES definition in alpha percpu.h got broken.

This patch adds PER_CPU_DEF_ATTRIBUTES which is only used for definitions
and make alpha use it to add __used for percpu variables in modules.  This
also fixes the PER_CPU_ATTRIBUTES double definition bug.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: maximilian attems <max@stro.at>
Acked-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

kernel/resource.c: fix sign extension in reserve_setup()

commit 8bc1ad7dd301b7ca7454013519fa92e8c53655ff upstream.

When the 32-bit signed quantities get assigned to the u64 resource_size_t,
they are incorrectly sign-extended.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13253
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9905

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Reported-by: Leann Ogasawara <leann@ubuntu.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Reported-by: <pablomme@googlemail.com>
Tested-by: <pablomme@googlemail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

futexes: Fix infinite loop in get_futex_key() on huge page

commit ce2ae53b750abfaa012ce408e93da131a5b5649b upstream.

get_futex_key() can infinitely loop if it is called on a
virtual address that is within a huge page but not aligned to
the beginning of that page. The call to get_user_pages_fast
will return the struct page for a sub-page within the huge page
and the check for page->mapping will always fail.

The fix is to call compound_head on the page before checking
that it's mapped.

Signed-off-by: Sonny Rao <sonnyrao@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: anton@samba.org
Cc: rajamony@us.ibm.com
Cc: speight@us.ibm.com
Cc: mstephen@us.ibm.com
Cc: grimm@us.ibm.com
Cc: mikey@ozlabs.au.ibm.com
LKML-Reference: <20090710231313.GA23572@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

futex: Fix the write access fault problem for real

commit d0725992c8a6fb63a16bc9e8b2a50094cc4db3cd and aa715284b4d28cabde6c25c568d769a6be712bc8 upstream

commit 64d1304a64 (futex: setup writeable mapping for futex ops which
modify user space data) did address only half of the problem of write
access faults.

The patch was made on two wrong assumptions:

1) access_ok(VERIFY_WRITE,...) would actually check write access.

   On x86 it does _NOT_. It's a pure address range check.

2) a RW mapped region can not go away under us.

   That's wrong as well. Nobody can prevent another thread to call
   mprotect(PROT_READ) on that region where the futex resides. If that
   call hits between the get_user_pages_fast() verification and the
   actual write access in the atomic region we are toast again.

The solution is to not rely on access_ok and get_user() for any write
access related fault on private and shared futexes. Instead we need to
fault it in with verification of write access.

There is no generic non destructive write mechanism which would fault
the user page in trough a #PF, but as we already know that we will
fault we can as well call get_user_pages() directly and avoid the #PF
overhead.

If get_user_pages() returns -EFAULT we know that we can not fix it
anymore and need to bail out to user space.

Remove a bunch of confusing comments on this issue as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Blackfin: fix command line corruption with DEBUG_DOUBLEFAULT

commit 37082511f06108129bd5f96d625a6fae2d5a4ab4 upstream.

Commit 6b3087c6 (which introduced Blackfin SMP) broke command line passing
when the DEBUG_DOUBLEFAULT config option was enabled. Switch the code to
using a scratch register and not R7 which holds the command line.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Blackfin: fix deadlock in SMP IPI handler

commit 86f2008bf546af9a434f480710e8d33891616bf5 upstream.

When a low priority interrupt (like ethernet) is triggered between 2 high
priority IPI messages, a deadlock in disable_irq() is hit by the second
IPI handler. This is because the second IPI message is queued within the
first IPI handler, but the handler doesn't process all messages, and new
ones are inserted rather than appended. So now we process all the pending
messages, and append new ones to the pending list.

URL: http://blackfin.uclinux.org/gf/tracker/5226

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Blackfin: redo handling of bad irqs

commit 26579216f3cdf1ae05f0af8412b444870a167510 upstream.

With the common IRQ code initializing much more of the irq_desc state, we
can't blindly initialize it ourselves to the local bad_irq state. If we
do, we end up wrongly clobbering many fields. So punt most of the bad irq
code as the common layers will handle the default state, and simply call
handle_bad_irq() directly when the IRQ we are processing is invalid.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Blackfin: fix accidental reset in some boot modes

commit 0de4adfb8c9674fa1572b0ff1371acc94b0be901 upstream.

We read the SWRST (Software Reset) register to get at the last reset
state, and then we may configure the DOUBLE_FAULT bit to control behavior
when a double fault occurs. But if the lower bits of the register is
already set (like UART boot mode on a BF54x), we inadvertently make the
system reset by writing to the SYSTEM_RESET field at the same time. So
make sure the lower 4 bits are always cleared.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

personality: fix PER_CLEAR_ON_SETID (CVE-2009-1895)

commit f9fabcb58a6d26d6efde842d1703ac7cfa9427b6 upstream.

We have found that the current PER_CLEAR_ON_SETID mask on Linux doesn't
include neither ADDR_COMPAT_LAYOUT, nor MMAP_PAGE_ZERO.

The current mask is READ_IMPLIES_EXEC|ADDR_NO_RANDOMIZE.

We believe it is important to add MMAP_PAGE_ZERO, because by using this
personality it is possible to have the first page mapped inside a
process running as setuid root.  This could be used in those scenarios:

- Exploiting a NULL pointer dereference issue in a setuid root binary
- Bypassing the mmap_min_addr restrictions of the Linux kernel: by
   running a setuid binary that would drop privileges before giving us
   control back (for instance by loading a user-supplied library), we
   could get the first page mapped in a process we control.  By further
   using mremap and mprotect on this mapping, we can then completely
   bypass the mmap_min_addr restrictions.

Less importantly, we believe ADDR_COMPAT_LAYOUT should also be added
since on x86 32bits it will in practice disable most of the address
space layout randomization (only the stack will remain randomized).

Signed-off-by: Julien Tinnes <jt@cr0.org>
Signed-off-by: Tavis Ormandy <taviso@sdf.lonestar.org>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Eugene Teo <eugene@redhat.com>
[ Shortened lines and fixed whitespace as per Christophs' suggestion ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

tun/tap: Fix crashes if open() /dev/net/tun and then poll() it. (CVE-2009-1897)

commit 3c8a9c63d5fd738c261bd0ceece04d9c8357ca13 upstream.

Fix NULL pointer dereference in tun_chr_pool() introduced by commit
33dccbb050bbe35b88ca8cf1228dcf3e4d4b3554 ("tun: Limit amount of queued
packets per device") and triggered by this code:

int fd;
struct pollfd pfd;
fd = open("/dev/net/tun", O_RDWR);
pfd.fd = fd;
pfd.events = POLLIN | POLLOUT;
poll(&pfd, 1, 0);

Reported-by: Eugene Kapun <abacabadabacaba@gmail.com>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

security: use mmap_min_addr indepedently of security models

commit e0a94c2a63f2644826069044649669b5e7ca75d3 upstream.

This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY.
It also sets a default mmap_min_addr of 4096.

mmapping of addresses below 4096 will only be possible for processes
with CAP_SYS_RAWIO.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Eric Paris <eparis@redhat.com>
Looks-ok-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Add '-fno-delete-null-pointer-checks' to gcc CFLAGS

commit a3ca86aea507904148870946d599e07a340b39bf upstream.

Turning on this flag could prevent the compiler from optimising away
some "useless" checks for null pointers.  Such bugs can sometimes become
exploitable at compile time because of the -O2 optimisation.

See http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Optimize-Options.html

An example that clearly shows this 'problem' is commit 6bf67672.

static void __devexit agnx_pci_remove(struct pci_dev *pdev)
{
     struct ieee80211_hw *dev = pci_get_drvdata(pdev);
-    struct agnx_priv *priv = dev->priv;
+    struct agnx_priv *priv;
     AGNX_TRACE;

     if (!dev)
         return;
+    priv = dev->priv;

By reverting this patch, and compile it with and without
-fno-delete-null-pointer-checks flag, we can see that the check for dev
is compiled away.

    call    printk  #
-   testq   %r12, %r12  # dev
-   je  .L94    #,
    movq    %r12, %rdi  # dev,

Clearly the 'fix' is to stop using dev before it is tested, but building
with -fno-delete-null-pointer-checks flag at least makes it harder to
abuse.

Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Wang Cong <amwang@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

Linux 2.6.30.1

bsdacct: fix access to invalid filp in acct_on()

commit df279ca8966c3de83105428e3391ab17690802a9 upstream.

The file opened in acct_on and freshly stored in the ns->bacct struct can
be closed in acct_file_reopen by a concurrent call after we release
acct_lock and before we call mntput(file->f_path.mnt).

Record file->f_path.mnt in a local variable and use this variable only.

Signed-off-by: Renaud Lottiaux <renaud.lottiaux@kerlabs.com>
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

xfs: fix freeing memory in xfs_getbmap()

commit 7747a0b0af5976ba3828796b4f7a7adc3bb76dbd upstream.

Regression from commit 28e211700a81b0a934b6c7a4b8e7dda843634d2f.
Need to free temporary buffer allocated in xfs_getbmap().

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

KVM: x86: silence preempt warning on kvm_write_guest_time

commit 2dea4c84bc936731668b5a7a9fba5b436a422668 upstream.

This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64)
with PREEMPT enabled.

We're getting syslog warnings like this many (but not all) times qemu
tells KVM to run the VCPU:

BUG: using smp_processor_id() in preemptible [00000000] code:
qemu-system-x86/28938
caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit
Call Trace:
debug_smp_processor_id+0xf7/0x100
kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
? __wake_up+0x4e/0x70
? wake_futex+0x27/0x40
kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm]
enqueue_hrtimer+0x8a/0x110
_spin_unlock_irqrestore+0x27/0x50
vfs_ioctl+0x31/0xa0
do_vfs_ioctl+0x74/0x480
sys_futex+0xb4/0x140
sys_ioctl+0x99/0xa0
system_call_fastpath+0x16/0x1b

As it turns out, the call trace is messed up due to gcc's inlining, but
I isolated the problem anyway: kvm_write_guest_time() is being used in a
non-thread-safe manner on preemptable kernels.

Basically kvm_write_guest_time()'s body needs to be surrounded by
preempt_disable() and preempt_enable(), since the kernel won't let us
query any per-CPU data (indirectly using smp_processor_id()) without
preemption disabled. The attached patch fixes this issue by disabling
preemption inside kvm_write_guest_time().

[marcelo: surround only __get_cpu_var calls since the warning
is harmless]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

drm/i915: correct suspend/resume ordering

commit 9e06dd39f2b6d7e35981e0d7aded618686b32ccb upstream.

We need to save register state *after* idling GEM, clearing the ring,
and uninstalling the IRQ handler, or we might end up saving bogus
fence regs, for one. Our restore ordering should already be correct,
since we do GEM, ring and IRQ init after restoring the last register
state, which prevents us from clobbering things.

I put this together to potentially address a bug, but I haven't heard
back if it fixes it yet. However I think it stands on its own, so I'm
sending it in.

Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Eric Anholt <eric@anholt.net>
Cc: Jie Luo <clotho67@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ide-cd: prevent null pointer deref via cdrom_newpc_intr

commit 39c58f37a10198054c656c28202fb1e6d22fd505 upstream.

With 2.6.30, the error handling code in cdrom_newpc_intr was changed
to deal with partial request failures by normally completing the 'good'
parts of a request and only 'error' the last (and presumably,
incompletely transferred) bio associated with a particular
request. In order to do this, ide_complete_rq is called over
ide_cd_error_cmd() to partially complete the rq. The block layer
does partial completion only for requests with bio's and if the
rq doesn't have one (eg 'GPCMD_READ_DISC_INFO') the request is
completed as a whole and the drive->hwif->rq pointer set to NULL
afterwards. When calling ide_complete_rq again to report
the error, this null pointer is derefenced, resulting in a kernel
crash.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=13399.

Signed-off-by: Rainer Weikusat <rweikusat@mssgmbh.com>
Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

ocfs2: Fix ocfs2_osb_dump()

commit c3d38840abaa45c1c5a5fabbb8ffc9a0d1a764d1 upstream.

Skip printing information that is not valid for local mounts.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

serial: bfin_5xx: fix building as module when early printk is enabled

commit 607c268ef9a4675287e77f732071e426e62c2d86 upstream.

Since early printk only makes sense/works when the serial driver is built
into the kernel, disable the option for this driver when it is going to be
built as a module. Otherwise we get build failures due to the ifdef
handling.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK

commit 69050eee8e08a6234f29fe71a56f8c7c7d4d7186 upstream.

CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK.

This makes it possible to run complete systems out of a CONFIG_BLOCK=n
initramfs on current kernels again (this last worked on 2.6.27.*).

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

lib/genalloc.c: remove unmatched write_lock() in gen_pool_destroy

commit 8e8a2dea0ca91fe2cb7de7ea212124cfe8c82c35 upstream.

There is a call to write_lock() in gen_pool_destroy which is not balanced
by any corresponding write_unlock(). This causes problems with preemption
because the preemption-disable counter is incremented in the write_lock()
call, but never decremented by any call to write_unlock(). This bug is
gen_pool_destroy, and one of them is non-x86 arch-specific code.

Signed-off-by: Zygo Blaxell <zygo.blaxell@xandros.com>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

vmscan: count the number of times zone_reclaim() scans and fails

commit 24cf72518c79cdcda486ed26074ff8151291cf65 upstream.

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly.  Detecting the situation requires some guesswork and
experimentation so this patch adds a counter "zreclaim_failed" to
/proc/vmstat.  If during high CPU utilisation this counter is increasing
rapidly, then the resolution to the problem may be to set
/proc/sys/vm/zone_reclaim_mode to 0.

[akpm@linux-foundation.org: name things consistently]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim

commit 90afa5de6f3fa89a733861e843377302479fcf7e upstream.

A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.

The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall.  The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation.  It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().

Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. The broken heuristic is what was causing the
malloc() stall as it uselessly scanned the LRU constantly. Currently,
zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
could not deal with tmpfs pages at all. This fixes up the heuristic so
that an unnecessary scan is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have
failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
time the zone_reclaim scan-avoidance heuristics fail. If that
counter is rapidly increasing, then zone_reclaim_mode should be
set to 0 as a temporarily resolution and a bug reported because
the scan-avoidance heuristic is still broken.

This patch:

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files.  This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dm: use i_size_read

commit 5657e8fa45cf230df278040c420fb80e06309d8f upstream.

Use i_size_read() instead of reading i_size.

If someone changes the size of the device simultaneously, i_size_read
is guaranteed to return a valid value (either the old one or the new one).

i_size can return some intermediate invalid value (on 32-bit computers
with 64-bit i_size, the reads to both halves of i_size can be interleaved
with updates to i_size, resulting in garbage being returned).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dm exception store: really fix type lookup

commit 874d2f61d31e596c36af7732dc1b3aa2dc233824 upstream.

Fix exception store name handling.

We need to reference exception store by zero terminated string.

Fixes regression introduced in commit f6bd4eb73cdf2a5bf954e497972842f39cabb7e3

Cc: Yi Yang <yi.y.yang@intel.com>
Cc: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dm exception store: fix exstore lookup to be case insensitive

commit f6bd4eb73cdf2a5bf954e497972842f39cabb7e3 upstream.

When snapshots are created using 'p' instead of 'P' as the
exception store type, the device-mapper table loading fails.

This patch makes the code case insensitive as intended and fixes some
regressions reported with device-mapper snapshots.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dm mpath: flush keventd queue in destructor

commit 53b351f972a882ea8b6cdb19602535f1057c884a upstream.

The commit fe9cf30eb8186ef267d1868dc9f12f2d0f40835a moves dm table event
submission from kmultipath queue to kernel kevent queue to avoid a
deadlock.

There is a possibility of race condition because kevent queue is not flushed
in the multipath destructor. The scenario is:
- some event happens and is queued to keventd
- keventd thread is delayed due to scheuling latency or some other work
- multipath device is destroyed
- keventd now attempts to process work_struct that is residing in already
released memory.

The patch flushes the keventd queue in multipath constructor.
I've already fixed similar bug in dm-raid1.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

dm: sysfs skip output when device is being destroyed

commit 4d89b7b4e4726893453d0fb4ddbb5b3e16353994 upstream.

Do not process sysfs attributes when device is being destroyed.

Otherwise code can cause
BUG_ON(test_bit(DMF_FREEING, &md->flags));
in dm_put() call.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>