git.karo-electronics.de Git - mv-sheeva.git/log

PM / Hibernate: Reduce autotuned default image size

The hibernate image size autotuning mechanism sets the default
image size to 5/2 of the total system RAM, but it is reported
that on some systems device drivers allocate substantial
amounts of memory during suspend and the creation of the image
fails as a result (too little memory is preallocated).

Modify the autotuning mechanism to use 1/3 instead of 2/5 of RAM
as the default image size, which is reported to be sufficient for
the affected systems.

References: https://bugzilla.kernel.org/show_bug.cgi?id=30482
Reported-and-tested-by: Martin Steigerwald <Martin@Lichtvoll.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM / Core: Introduce struct syscore_ops for core subsystems PM

Some subsystems need to carry out suspend/resume and shutdown
operations with one CPU on-line and interrupts disabled. The only
way to register such operations is to define a sysdev class and
a sysdev specifically for this purpose which is cumbersome and
inefficient. Moreover, the arguments taken by sysdev suspend,
resume and shutdown callbacks are practically never necessary.

For this reason, introduce a simpler interface allowing subsystems
to register operations to be executed very late during system suspend
and shutdown and very early during resume in the form of
strcut syscore_ops objects.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

PM QoS: Make pm_qos settings readable

I have a machine where entering deep C-states broke.
pm_qos was a hot candidate, but I couldn't find any way to double
check without the need of recompiling.

While in this case it was a driver bug (ath9k):
https://bugzilla.kernel.org/show_bug.cgi?id=27532

powertop or others may want to read out cpu_dma_latency
restrictions which could be the cause of preventing a machine
entering deeper C-states.

Output with this patch:

# default value of 2000 * USEC_PER_SEC (0x77359400)
cat /dev/network_latency |hexdump
0000000 9400 7735
0000004

# value of 55 us which is the reason for not entering C2
cat /dev/cpu_dma_latency |hexdump
0000000 0037 0000
0000004

There is no reason to hide this info -> make pm_qos files readable.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM / OPP: opp_find_freq_exact() documentation fix

opp_find_freq_exact() documentation has is_available instead
of available. This also fixes warning with the kernel-doc:
scripts/kernel-doc drivers/base/power/opp.c >/dev/null
Warning(drivers/base/power/opp.c:246): No description found for parameter 'available'
Warning(drivers/base/power/opp.c:246): Excess function parameter 'is_available' description in 'opp_find_freq_exact'

Signed-off-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM: Documentation/power/states.txt: fix repetition

Remove repetition of "called swsusp".

Signed-off-by: Alexandre Courbot <gnurou@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM: Make system-wide PM and runtime PM treat subsystems consistently

The code handling system-wide power transitions (eg. suspend-to-RAM)
can in theory execute callbacks provided by the device's bus type,
device type and class in each phase of the power transition.  In
turn, the runtime PM core code only calls one of those callbacks at
a time, preferring bus type callbacks to device type or class
callbacks and device type callbacks to class callbacks.

It seems reasonable to make them both behave in the same way in that
respect.  Moreover, even though a device may belong to two subsystems
(eg. bus type and device class) simultaneously, in practice power
management callbacks for system-wide power transitions are always
provided by only one of them (ie. if the bus type callbacks are
defined, the device class ones are not and vice versa).  Thus it is
possible to modify the code handling system-wide power transitions
so that it follows the core runtime PM code (ie. treats the
subsystem callbacks as mutually exclusive).

On the other hand, the core runtime PM code will choose to execute,
for example, a runtime suspend callback provided by the device type
even if the bus type's struct dev_pm_ops object exists, but the
runtime_suspend pointer in it happens to be NULL.  This is confusing,
because it may lead to the execution of callbacks from different
subsystems during different operations (eg. the bus type suspend
callback may be executed during runtime suspend of the device, while
the device type callback will be executed during system suspend).

Make all of the power management code treat subsystem callbacks in
a consistent way, such that:
(1) If the device's type is defined (eg. dev->type is not NULL)
    and its pm pointer is not NULL, the callbacks from dev->type->pm
    will be used.
(2) If dev->type is NULL or dev->type->pm is NULL, but the device's
    class is defined (eg. dev->class is not NULL) and its pm pointer
    is not NULL, the callbacks from dev->class->pm will be used.
(3) If dev->type is NULL or dev->type->pm is NULL and dev->class is
    NULL or dev->class->pm is NULL, the callbacks from dev->bus->pm
    will be used provided that both dev->bus and dev->bus->pm are
    not NULL.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Kevin Hilman <khilman@ti.com>
Reasoning-sounds-sane-to: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

PM: Simplify kernel/power/Kconfig

'n' defaults are pretty pointless and actually bogus when used with
prompt-less config options.

The "bool"/"default y" pair with no prompt can be expressed more
compactly using def_bool.

[rjw: Rebased on top of earlier patches modifying this file.]

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM: Add support for device power domains

The platform bus type is often used to handle Systems-on-a-Chip (SoC)
where all devices are represented by objects of type struct
platform_device.  In those cases the same "platform" device driver
may be used with multiple different system configurations, but the
actions needed to put the devices it handles into a low-power state
and back into the full-power state may depend on the design of the
given SoC.  The driver, however, cannot possibly include all the
information necessary for the power management of its device on all
the systems it is used with.  Moreover, the device hierarchy in its
current form also is not suitable for representing this kind of
information.

The patch below attempts to address this problem by introducing
objects of type struct dev_power_domain that can be used for
representing power domains within a SoC.  Every struct
dev_power_domain object provides a sets of device power
management callbacks that can be used to perform what's needed for
device power management in addition to the operations carried out by
the device's driver and subsystem.

Namely, if a struct dev_power_domain object is pointed to by the
pwr_domain field in a struct device, the callbacks provided by its
ops member will be executed in addition to the corresponding
callbacks provided by the device's subsystem and driver during all
power transitions.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-and-acked-by: Kevin Hilman <khilman@ti.com>

PM: Drop pm_flags that is not necessary

The variable pm_flags is used to prevent APM from being enabled
along with ACPI, which would lead to problems.  However, acpi_init()
is always called before apm_init() and after acpi_init() has
returned, it is known whether or not ACPI will be used.  Namely, if
acpi_disabled is not set after acpi_init() has returned, this means
that ACPI is enabled.  Thus, it is sufficient to check acpi_disabled
in apm_init() to prevent APM from being enabled in parallel with
ACPI.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>

PM: Allow pm_runtime_suspend() to succeed during system suspend

The dpm_prepare() function increments the runtime PM reference
counters of all devices to prevent pm_runtime_suspend() from
executing subsystem-level callbacks. However, this was supposed to
guard against a specific race condition that cannot happen, because
the power management workqueue is freezable, so pm_runtime_suspend()
can only be called synchronously during system suspend and we can
rely on subsystems and device drivers to avoid doing that
unnecessarily.

Make dpm_prepare() drop the runtime PM reference to each device
after making sure that runtime resume is not pending for it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Kevin Hilman <khilman@ti.com>

PM: Clean up PM_TRACE dependencies and drop unnecessary Kconfig option

CONFIG_PM_SLEEP_ADVANCED_DEBUG is not used any more, so drop it
and CONFIG_CAN_PM_TRACE need not depend on EXPERIMENTAL, so remove
that dependency.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM: Remove CONFIG_PM_OPS

After redefining CONFIG_PM to depend on (CONFIG_PM_SLEEP ||
CONFIG_PM_RUNTIME) the CONFIG_PM_OPS option is redundant and can be
replaced with CONFIG_PM.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM: Reorder power management Kconfig options

Reorder configuration options in kernel/power/Kconfig so that
the options depended on are at the top of the list.

This patch doesn't introduce any functional changes.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME)

From the users' point of view CONFIG_PM is really only used for
making it possible to set CONFIG_SUSPEND, CONFIG_HIBERNATION,
CONFIG_PM_RUNTIME and (surprisingly enough) CONFIG_XEN_SAVE_RESTORE
(CONFIG_PM_OPP also depends on CONFIG_PM, but quite artificially).
However, both CONFIG_SUSPEND and CONFIG_HIBERNATION require platform
support (independent of CONFIG_PM) and it is not quite obvious that
CONFIG_PM has to be set for CONFIG_XEN_SAVE_RESTORE to be available.
Thus, from the users' point of view, it would be more logical to
automatically select CONFIG_PM if any of the above options depending
on it are set.

Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME),
which will cause it to be selected when any of CONFIG_SUSPEND,
CONFIG_HIBERNATION, CONFIG_PM_RUNTIME, CONFIG_XEN_SAVE_RESTORE is
set and will clarify its meaning.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM / ACPI: Remove references to pm_flags from bus.c

If direct references to pm_flags are removed from drivers/acpi/bus.c,
CONFIG_ACPI will not need to depend on CONFIG_PM any more. Make that
happen.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>

PM: Do not create wakeup sysfs files for devices that cannot wake up

Currently, wakeup sysfs attributes are created for all devices,
regardless of whether or not they are wakeup-capable. This is
excessive and complicates wakeup device identification from user
space (i.e. to identify wakeup-capable devices user space has to read
/sys/devices/.../power/wakeup for all devices and see if they are not
empty).

Fix this issue by avoiding to create wakeup sysfs files for devices
that cannot wake up the system from sleep states (i.e. whose
power.can_wakeup flags are unset during registration) and modify
device_set_wakeup_capable() so that it adds (or removes) the relevant
sysfs attributes if a device's wakeup capability status is changed.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

USB / Hub: Do not call device_set_wakeup_capable() under spinlock

A subsequent patch will modify device_set_wakeup_capable() in such
a way that it will call functions which may sleep and therefore it
shouldn't be called under spinlocks. In preparation to that, modify
usb_set_device_state() to avoid calling device_set_wakeup_capable()
under device_state_lock.

Tested-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

PM: Use appropriate printk() priority level in trace.c

printk()s without a priority level default to KERN_WARNING. To reduce
noise at KERN_WARNING, this patch sets the priority level appriopriately
for unleveled printks()s. This should be useful to folks that look at
dmesg warnings closely.

Changed these messages to pr_info().

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM / Wakeup: Don't update events_check_enabled in pm_get_wakeup_count()

Since pm_save_wakeup_count() has just been changed to clear
events_check_enabled unconditionally before checking if there are
any new wakeup events registered since the last read from
/sys/power/wakeup_count, the detection of wakeup events during
suspend may be disabled, after it's been enabled, by writing a
"wrong" value back to /sys/power/wakeup_count. For this reason,
it is not necessary to update events_check_enabled in
pm_get_wakeup_count() any more.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM / Wakeup: Make pm_save_wakeup_count() work as documented

According to Documentation/ABI/testing/sysfs-power, the
/sys/power/wakeup_count interface should only make the kernel react
to wakeup events during suspend if the last write to it has been
successful. However, if /sys/power/wakeup_count is written to two
times in a row, where the first write is successful and the second
is not, the kernel will still react to wakeup events during suspend
due to a bug in pm_save_wakeup_count().

Fix the bug by making pm_save_wakeup_count() clear
events_check_enabled unconditionally before checking if there are
any new wakeup events registered since the previous read from
/sys/power/wakeup_count.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

PM / Wakeup: Combine atomic counters to avoid reordering issues

The memory barrier in wakeup_source_deactivate() is supposed to
prevent the callers of pm_wakeup_pending() and pm_get_wakeup_count()
from seeing the new value of events_in_progress (0, in particular)
and the old value of event_count at the same time.  However, if
wakeup_source_deactivate() is executed by CPU0 and, for instance,
pm_wakeup_pending() is executed by CPU1, where both processors can
reorder operations, the memory barrier in wakeup_source_deactivate()
doesn't affect CPU1 which can reorder reads.  In that case CPU1 may
very well decide to fetch event_count before it's modified and
events_in_progress after it's been updated, so pm_wakeup_pending()
may fail to detect a wakeup event.  This issue can be addressed by
using a single atomic variable to store both events_in_progress
and event_count, so that they can be updated together in a single
atomic operation.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

oom: oom_kill_process: fix the child_points logic

oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.

Also, "children has a different mm" doesn't match the reality, we should
check child->mm != t->mm. This check is not exactly correct if t->mm ==
NULL but this doesn't really matter, oom_kill_task() will kill them
anyway.

Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6

* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: NFSROOT should default to "proto=udp"
  nfs4: remove duplicated #include
  NFSv4: nfs4_state_mark_reclaim_nograce() should be static
  NFSv4: Fix the setlk error handler
  NFSv4.1: Fix the handling of the SEQUENCE status bits
  NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
  NFSv4.1 reclaim complete must wait for completion
  NFSv4: remove duplicate clientid in struct nfs_client
  NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY
  sunrpc: Propagate errors from xs_bind() through xs_create_sock()
  (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid
  nfs: fix compilation warning
  nfs: add kmalloc return value check in decode_and_add_ds
  SUNRPC: Remove resource leak in svc_rdma_send_error()
  nfs: close NFSv4 COMMIT vs. CLOSE race
  SUNRPC: Close a race in __rpc_wait_for_completion_task()

Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6

* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/radeon: fix problem with changing active VRAM size. (v2)

Merge git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog

* git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog:
  watchdog: hpwdt: eliminate section mismatch warning
  watchdog: w83697ug_wdt: Fix set bit 0 to activate GPIO2
  watchdog: sch311x_wdt: fix printk condition
  watchdog: sch311x_wdt: Fix LDN active check
  watchdog: cpwd: Fix buffer-overflow

Fix corrupted OSF partition table parsing

The kernel automatically evaluates partition tables of storage devices.
The code for evaluating OSF partitions contains a bug that leaks data
from kernel heap memory to userspace for certain corrupted OSF
partitions.

In more detail:

for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {

iterates from 0 to d_npartitions - 1, where d_npartitions is read from
the partition table without validation and partition is a pointer to an
array of at most 8 d_partitions.

Add the proper and obvious validation.

Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
[ Changed the patch trivially to not repeat the whole le16_to_cpu()
thing, and to use an explicit constant for the magic value '8' ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

thp+memcg-numa: fix BUG at include/linux/mm.h:370!

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

watchdog: hpwdt: eliminate section mismatch warning

hpwdt_init_nmi_decoding() is called in hpwdt_init_one error handling,
thus remove the __devexit annotation of hpwdt_exit_nmi_decoding().

This patch fixes below warning:

WARNING: drivers/watchdog/hpwdt.o(.devinit.text+0x36f): Section mismatch in reference from the function hpwdt_init_one() to the function .devexit.text:hpwdt_exit_nmi_decoding()
The function __devinit hpwdt_init_one() references
a function __devexit hpwdt_exit_nmi_decoding().
This is often seen when error handling in the init function
uses functionality in the exit path.
The fix is often to remove the __devexit annotation of
hpwdt_exit_nmi_decoding() so it may be used outside an exit section.

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Acked-by: Thomas Mingarelli <Thomas.Mingarelli@hp.com>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

watchdog: w83697ug_wdt: Fix set bit 0 to activate GPIO2

outb_p(c || 0x01, WDT_EFDR); -> || should be |

Reported-By: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

watchdog: sch311x_wdt: fix printk condition

"==" has higher precedence than "&". Since
if (sch311x_sio_inb(sio_config_port, 0x30) & (0x01 == 0)) is always
false the message is never printed.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

watchdog: sch311x_wdt: Fix LDN active check

if (sch311x_sio_inb(sio_config_port, 0x30) && 0x01 == 0) -> && should be &

Reported-By: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

watchdog: cpwd: Fix buffer-overflow

cppcheck-1.47 reports:
[drivers/watchdog/cpwd.c:650]: (error) Buffer access out-of-bounds: p.devs

The source code is
for (i = 0; i < 4; i++) {
misc_deregister(&p->devs[i].misc);

where devs is defined as WD_NUMDEVS big and WD_NUMDEVS is equal to 3.
So the 4 should be a 3 or WD_NUMDEVS.

Reported-By: David Binderman
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

drm/radeon: fix problem with changing active VRAM size. (v2)

So we used to use lpfn directly to restrict VRAM when we couldn't
access the unmappable area, however this was removed in
93225b0d7bc030f4a93165347a65893685822d70 as it also restricted
the gtt placements. However it was only later noticed that this
broke on some hw.

This removes the active_vram_size, and just explicitly sets it
when it changes, TTM/drm_mm will always use the real_vram_size,
and the active vram size will change the TTM size used for lpfn
setting.

We should re-work the fpfn/lpfn to per-placement at some point
I suspect, but that is too late for this kernel.

Hopefully this addresses:
https://bugs.freedesktop.org/show_bug.cgi?id=35254

v2: fix reported useful VRAM size to userspace to be correct.

Signed-off-by: Dave Airlie <airlied@redhat.com>

compat breakage in preadv() and pwritev()

Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, the compat_... ones forget to check FMODE_P{READ,WRITE}, so
e.g. on pipe the native preadv() will fail with -ESPIPE and compat one
will act as readv() and succeed.

Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
-final.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
hwmon/f71882fg: Set platform drvdata to NULL later
hwmon/f71882fg: Fix a typo in a comment

Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: break out of shrink_delalloc earlier
  btrfs: fix not enough reserved space
  btrfs: fix dip leak
  Btrfs: make sure not to return overlapping extents to fiemap
  Btrfs: deal with short returns from copy_from_user
  Btrfs: fix regressions in copy_from_user handling

Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] target: Fix t_transport_aborted handling in LUN_RESET + active I/O shutdown

kbuild: Fix computing srcversion for modules

Recent change to fixdep:

    commit b7bd182176960fdd139486cadb9962b39f8a2b50
    Author: Michal Marek <mmarek@suse.cz>
    Date:   Thu Feb 17 15:13:54 2011 +0100

    fixdep: Do not record dependency on the source file itself

changed the format of the *.cmd files without realizing that it is also
used by modpost. Put the path to the source file to the file back, in a
special variable, so that modpost sees all source files when calculating
srcversion for modules.

Reported-and-tested-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.infradead.org/users/dwmw2/mtd-2.6.38

* git://git.infradead.org/users/dwmw2/mtd-2.6.38:
  mtd: add "platform:" prefix for platform modalias
  mtd: mtd_blkdevs: fix double free on error path
  mtd: amd76xrom: fix oops at boot when resources are not available
  mtd: fix race in cfi_cmdset_0001 driver
  mtd: jedec_probe: initialise make sector erase command variable
  mtd: jedec_probe: Change variable name from cfi_p to cfi

Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6

* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/radeon: fix page flipping hangs on r300/r400
drm/radeon: add pageflip hooks for fusion

Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: fix mis-synchronisation in blkdev_issue_zeroout()

Merge branch 'fix/asoc' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6

* 'fix/asoc' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ASoC: Ensure WM8958 gets all WM8994 late revision widgets
  ASoC: Fix typo in late revision WM8994 DAC2R name
  ASoC: Use the correct DAPM context when cleaning up final widget set
  ASoC: Fix broken bitfield definitions in WM8978
  ASoC: AM3517: Update codec name after multi-component update

gpio: add MODULE_DEVICE_TABLE

The device table is required to load modules based on modaliases.

After adding MODULE_DEVICE_TABLE, below entries will be added to
modules.pcimap:

pch_gpio 0x00008086 0x00008803 0xffffffff 0xffffffff 0x00000000 0x00000000 0x0
ml_ioh_gpio 0x000010db 0x0000802e 0xffffffff 0xffffffff 0x00000000 0x00000000 0x0

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Cc: Tomoya MORINAGA <tomoya-linux@dsn.okisemi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

thp: fix page_referenced to modify mapcount/vm_flags only if page is found

When vmscan.c calls page_referenced(), if an anon page was created
before a process forked, rmap will search for it in both of the
processes, even though one of them might have since broken COW.

If the child process mlocks the vma where the COWed page belongs to,
page_referenced() running on the page mapped by the parent would lead to
*vm_flags getting VM_LOCKED set erroneously (leading to the references
on the parent page being ignored and evicting the parent page too
early).

*mapcount would also be decremented by page_referenced_one even if the
page wasn't found by page_check_address.

This also lets pmdp_clear_flush_young_notify() go ahead on a
pmd_trans_splitting() pmd.

We hold the page_table_lock so __split_huge_page_map() must wait the
pmdp_clear_flush_young_notify() to complete before it can modify the
pmd. The pmd is also still mapped in userland so the young bit may
materialize through a tlb miss before split_huge_page_map runs.

This will provide a more accurate page_referenced() behavior during
split_huge_page().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel<riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hwmon/f71882fg: Set platform drvdata to NULL later

This avoids a possible race leading to trying to dereference NULL.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Cc: stable@kernel.org
Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>

hwmon/f71882fg: Fix a typo in a comment

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>

drm/radeon: fix page flipping hangs on r300/r400

We've been getting reports of complete system lockups with rv3xx hw on
AGP and PCIE when running gnome-shell or kwin with compositing.

It appears the hw really doesn't like setting these registers while
stuff is running, this moves the setting of the registers into the modeset
since they aren't required to be changed anywhere else.

fixes: https://bugs.freedesktop.org/show_bug.cgi?id=35183

Reported-and-tested-by: Álmos <aaalmosss@gmail.com
Signed-off-by: Dave Airlie <airlied@redhat.com>

Btrfs: break out of shrink_delalloc earlier

Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations.  The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made.  This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down.  This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

NFS: NFSROOT should default to "proto=udp"

There have been a number of recent reports that NFSROOT is no longer
working with default mount options, but fails only with certain NICs.

Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS:
Use super.c for NFSROOT mount option parsing".  Among other things,
this commit changes the default mount options for NFSROOT to use TCP
instead of UDP as the underlying transport.

TCP seems less able to deal with NICs that are slow to initialize.
The system logs that have accompanied reports of problems all show
that NFSROOT attempts to establish a TCP connection before the NIC is
fully initialized, and thus the TCP connection attempt fails.

When a TCP connection attempt fails during a mount operation, the
NFS stack needs to fail the operation.  Usually user space knows how
and when to retry it.  The network layer does not report a distinct
error code for this particular failure mode.  Thus, there isn't a
clean way for the RPC client to see that it needs to retry in this
case, but not in others.

Because NFSROOT is used in some environments where it is not possible
to update the kernel command line to specify "udp", the proper thing
to do is change NFSROOT to use UDP by default, as it did before commit
56463e50.

To make it easier to see how to change default mount options for
NFSROOT and to distinguish default settings from mandatory settings,
I've adjusted a couple of areas to document the specifics.

root_nfs_cat() is also modified to deal with commas properly when
concatenating strings containing mount option lists.  This keeps
root_nfs_cat() call sites simpler, now that we may be concatenating
multiple mount option strings.

Tested-by: Brian Downing <bdowning@lavos.net>
Tested-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> # 2.6.37
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

nfs4: remove duplicated #include

Remove duplicated #include('s) in
fs/nfs/nfs4proc.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: nfs4_state_mark_reclaim_nograce() should be static

There are no more external users of nfs4_state_mark_reclaim_nograce() or
nfs4_state_mark_reclaim_reboot(), so mark them as static.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Fix the setlk error handler

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4.1: Fix the handling of the SEQUENCE status bits

We want SEQUENCE status bits to be handled by the state manager in order
to avoid threading issues.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses

nfs4_schedule_state_recovery() should only be used when we need to force
the state manager to check the lease. If we just want to start the
state manager in order to handle a state recovery situation, we should be
using nfs4_schedule_state_manager().

This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
its use with a set of helper functions that do the right thing.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

block: fix mis-synchronisation in blkdev_issue_zeroout()

BZ29402
https://bugzilla.kernel.org/show_bug.cgi?id=29402

We can hit serious mis-synchronization in bio completion path of
blkdev_issue_zeroout() leading to a panic.

The problem is that when we are going to wait_for_completion() in
blkdev_issue_zeroout() we check if the bb.done equals issued (number of
submitted bios). If it does, we can skip the wait_for_completition()
and just out of the function since there is nothing to wait for.
However, there is a ordering problem because bio_batch_end_io() is
calling atomic_inc(&bb->done) before complete(), hence it might seem to
blkdev_issue_zeroout() that all bios has been completed and exit. At
this point when bio_batch_end_io() is going to call complete(bb->wait),
bb and wait does not longer exist since it was allocated on stack in
blkdev_issue_zeroout() ==> panic!

(thread 1)                      (thread 2)
bio_batch_end_io()              blkdev_issue_zeroout()
  if(bb) {                      ...
    if (bb->end_io)             ...
      bb->end_io(bio, err);     ...
    atomic_inc(&bb->done);      ...
    ...                         while (issued != atomic_read(&bb.done))
    ...                         (let issued == bb.done)
    ...                         (do the rest of the function)
    ...                         return ret;
    complete(bb->wait);
    ^^^^^^^^
    panic

We can fix this easily by simplifying bio_batch and completion counting.

Also remove bio_end_io_t *end_io since it is not used.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Eric Whitney <eric.whitney@hp.com>
Tested-by: Eric Whitney <eric.whitney@hp.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

mtd: add "platform:" prefix for platform modalias

Since 43cc71eed1250755986da4c0f9898f9a635cb3bf (platform: prefix MODALIAS
with "platform:"), the platform modalias is prefixed with "platform:".

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org

mtd: mtd_blkdevs: fix double free on error path

This one liner patch fixes double free that will occur if add_mtd_blktrans_dev
fails. On failure it frees the input argument, but all its users also free it
on error which is natural thing to do. Thus don't free it.

All credit for finding that bug belongs to reporters of the bug in the android bugzilla
http://code.google.com/p/android/issues/detail?id=13761

Commit message tweaked by Artem.

Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org

mtd: amd76xrom: fix oops at boot when resources are not available

For some unknown reasons resources needed by amd76xrom driver can be
unavailable. And instead of returning an error, the driver keeps going
and crash the kernel. This patch fixes the problem by making the driver
return -EBUSY if the resources are not available.

Commit messages tweaked by Artem.

Reported-by: Russell Whitaker <russ@ashlandhome.net>
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org

mtd: fix race in cfi_cmdset_0001 driver

As inval_cache_and_wait_for_operation() drop and reclaim the lock
to invalidate the cache, some other thread may suspend the operation
before reaching the for(;;) loop. Therefore the loop must start with
checking the chip->state before reading status from the chip.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Acked-by: Michael Cashwell <mboards@prograde.net>
Acked-by: Stefan Bigler <stefan.bigler@keymile.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org

mtd: jedec_probe: initialise make sector erase command variable

In the commit 08968041bef437ec363623cd3218c2b083537ada
(mtd: cfi_cmdset_0002: make sector erase command variable)
introdused a field sector_erase_cmd. In the same commit initialisation
of cfi->sector_erase_cmd made in cfi_chip_setup()
(file drivers/mtd/chips/cfi_probe.c), so the CFI chip has no problem:

...
cfi->cfi_mode = CFI_MODE_CFI;
cfi->sector_erase_cmd = CMD(0x30);
...

But for the JEDEC chips this initialisation is not carried out,
so the JEDEC chips have sector_erase_cmd == 0.

This patch adds the missing initialisation.

Signed-off-by: Antony Pavlov <antony@niisi.msk.ru>
Acked-by: Guillaume LECERF <glecerf@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
CC: stable@kernel.org

mtd: jedec_probe: Change variable name from cfi_p to cfi

In the following commit, we'll need to use the CMD() macro in order to
fix the initialisation of the sector_erase_cmd field. That requires the
local variable to be called 'cfi', so change it first in a simple patch.

Signed-off-by: Antony Pavlov <antony@niisi.msk.ru>
Acked-by: Guillaume LECERF <glecerf@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
CC: stable@kernel.org

drm/radeon: add pageflip hooks for fusion

Looks like these got passed over with both being merged at the same
time but not quite meeting in the middle.

should fix: https://bugs.freedesktop.org/show_bug.cgi?id=34137
along with Michael's phoronix article.

Reported-by: Chi-Thanh Christopher Nguyen
Article-written-by: Michael Larabel @ phoronix
Signed-off-by: Dave Airlie <airlied@redhat.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  ariadne: remove redundant NULL check
  ip6ip6: autoload ip6 tunnel
  net: bridge builtin vs. ipv6 modular
  ipv6: Don't create clones of host routes.
  pktgen: fix errata in show results
  ipv4: Fix erroneous uses of ifa_address.
  vxge: update MAINTAINERS
  r6040: bump to version 0.27 and date 23Feb2011
  r6040: fix multicast operations
  rds: prevent BUG_ON triggering on congestion map updates
  bonding 802.3ad: Rename rx_machine_lock to state_machine_lock
  bonding 802.3ad: Fix the state machine locking v2
  drivers/net/macvtap: fix error check
  net: fix multithreaded signal handling in unix recv routines
  net: Enter net/ipv6/ even if CONFIG_IPV6=n
  net/smsc911x.c: Set the VLAN1 register to fix VLAN MTU problem
  bnx2x: fix MaxBW configuration
  bnx2x: (NPAR) prevent HW access in D3 state
  bnx2x: fix link notification
  bnx2x: fix non-pmf device load flow

Doing my first --no-ff merge here, to get the explicit merge commit.

David did a back-merge in order to get commit 8909c9ad8ff0 ("net: don't
allow CAP_NET_ADMIN to load non-netdev kernel modules") so that we can
add Stephen Hemminger's fix to handle ip6 tunnels as well, which uses
the MODULE_ALIAS_NETDEV() macro created by that change.

ariadne: remove redundant NULL check

Simply remove redundant 'dev' NULL check.

Signed-off-by: Jinqiu Yang <crindy646@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6ip6: autoload ip6 tunnel

Add necessary alias to autoload ip6ip6 tunnel module.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of /home/davem/src/GIT/linux-2.6/

net: bridge builtin vs. ipv6 modular

When configs BRIDGE=y and IPV6=m, this build error occurs:

br_multicast.c:(.text+0xa3341): undefined reference to `ipv6_dev_get_saddr'

BRIDGE_IGMP_SNOOPING is boolean; if it were tristate, then adding
depends on IPV6 || IPV6=n
to BRIDGE_IGMP_SNOOPING would be a good fix. As it is currently,
making BRIDGE depend on the IPV6 config works.

Reported-by: Patrick Schaaf <netdev@bof.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'media_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6

* 'media_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6:
  [media] mantis_pci: remove asm/pgtable.h include
  [media] tda829x: fix regression in probe functions
  [media] mceusb: don't claim multifunction device non-IR parts
  [media] nuvoton-cir: fix wake from suspend
  [media] cx18: Add support for Hauppauge HVR-1600 models with s5h1411
  [media] ivtv: Fix corrective action taken upon DMA ERR interrupt to avoid hang
  [media] cx25840: fix probing of cx2583x chips
  [media] cx23885: Remove unused 'err:' labels to quiet compiler warning
  [media] cx23885: Revert "Check for slave nack on all transactions"
  [media] DiB7000M: add pid filtering
  [media] Fix sysfs rc protocol lookup for rc-5-sz
  [media] au0828: fix VBI handling when in V4L2 streaming mode
  [media] ir-raw: Properly initialize the IR event (BZ#27202)
  [media] s2255drv: firmware re-loading changes
  [media] Fix double free of video_device in mem2mem_testdev
  [media] DM04/QQBOX memcpy to const char fix

ipmi: Fix IPMI errors due to timing problems

This patch fixes an issue in OpenIPMI module where sometimes an ABORT command
is sent after sending an IPMI request to BMC causing the IPMI request to fail.

Signed-off-by: YiCheng Doe <yicheng.doe@hp.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Tom Mingarelli <thomas.mingarelli@hp.com>
Tested-by: Andy Cress <andy.cress@us.kontron.com>
Tested-by: Mika Lansirine <Mika.Lansirinne@stonesoft.com>
Tested-by: Brian De Wolf <bldewolf@csupomona.edu>
Cc: Jean Michel Audet <Jean-Michel.Audet@ca.Kontron.com>
Cc: Jozef Sudelsky <jozef.sudolsky@elbiahosting.sk>
Acked-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  fs/dcache: allow d_obtain_alias() to return unhashed dentries
  Check for immutable/append flag in fallocate path
  sysctl: the include of rcupdate.h is only needed in the kernel
  fat: fix d_revalidate oopsen on NFS exports
  jfs: fix d_revalidate oopsen on NFS exports
  ocfs2: fix d_revalidate oopsen on NFS exports
  gfs2: fix d_revalidate oopsen on NFS exports
  fuse: fix d_revalidate oopsen on NFS exports
  ceph: fix d_revalidate oopsen on NFS exports
  reiserfs xattr ->d_revalidate() shouldn't care about RCU
  /proc/self is never going to be invalidated...

Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, UV: Initialize the broadcast assist unit base destination node id properly
  x86, numa: Fix numa_emulation code with memory-less node0
  x86, build: Make sure mkpiggy fails on read error

Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix sched rt group scheduling when hierachy is enabled

Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Avoid resolving [kernel.kallsyms] to real path for buildid cache
perf symbols: Fix vmlinux path when not using --symfs

drm/i915: Revive combination mode for backlight control

This reverts commit 951f3512dba5bd44cda3e5ee22b4b522e4bb09fb

drm/i915: Do not handle backlight combination mode specially

since this commit introduced other regressions due to untouched LBPC
register, e.g. the backlight dimmed after resume.

In addition to the revert, this patch includes a fix for the original
issue (weird backlight levels) by removing the wrong bit shift for
computing the current backlight level.
Also, including typo fixes (lpbc -> lbpc).

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34524
Acked-by: Indan Zupancic <indan@nul.nu>
Reviewed-by: Keith Packard <keithp@keithp.com>
Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

NFSv4.1 reclaim complete must wait for completion

Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: fix whitespace errors]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: remove duplicate clientid in struct nfs_client

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY

Fix bug where we currently retry the EXCHANGEID call again, eventhough
we already have a valid clientid. Instead, delay and retry the CREATE_SESSION
call.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

sunrpc: Propagate errors from xs_bind() through xs_create_sock()

xs_create_sock() is supposed to return a pointer or an ERR_PTR-encoded
error, but it currently returns 0 if xs_bind() fails.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org [v2.6.37]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid

The problem was use of an int32, which when converted to a uint64
is sign extended resulting in a fileid that doesn't fit in 32 bits
even though the intent of the function is to fit the fileid into
32 bits.

Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[Trond: Added an include for compat.h]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

nfs: fix compilation warning

this commit fix compilation warning as following:
linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

nfs: add kmalloc return value check in decode_and_add_ds

add kmalloc return value check in decode_and_add_ds

Signed-off-by: Stanislav Fomichev <kernel@fomichev.me>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Remove resource leak in svc_rdma_send_error()

We leak the memory allocated to 'ctxt' when we return after
'ib_dma_mapping_error()' returns !=0.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

nfs: close NFSv4 COMMIT vs. CLOSE race

I've been adding in more artificial delays in the NFSv4 commit and close
codepaths to uncover races. The kernel I'm testing has the patch to
close the race in __rpc_wait_for_completion_task that's in Trond's
cthon2011 branch. The reproducer I've been using does this in a loop:

mkdir("DIR");
fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
write(fd, "abcdefg", 7);
close(fd);
unlink("DIR/FILE");
rmdir("DIR");

The above reproducer shouldn't result in any silly-renaming. However,
when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
nfs_commit_release, I can almost always force one to occur. If I can
force it to occur with that, then it can happen without that delay
given the right timing.

nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
for the task to complete before putting its reference to it, so the last
reference get put in rpc_release task and gets queued to a workqueue.

In this situation, the last open context reference may be put by the
COMMIT release instead of the close() syscall. The close() syscall
returns too quickly and the unlink runs while the d_count is still
high since the COMMIT release hasn't put its dentry reference yet.

Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
before putting the task reference when FLUSH_SYNC is set. With this, the
last reference is put by the process that's initiating the FLUSH_SYNC
commit and the race is closed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Close a race in __rpc_wait_for_completion_task()

Although they run as rpciod background tasks, under normal operation
(i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
and nfs4_do_close() want to be fully synchronous. This means that when we
exit, we want all references to the rpc_task to be gone, and we want
any dentry references etc. held by that task to be released.

For this reason these functions call __rpc_wait_for_completion_task(),
followed by rpc_put_task() in the expectation that the latter will be
releasing the last reference to the rpc_task, and thus ensuring that the
callback_ops->rpc_release() has been called synchronously.

This patch fixes a race which exists due to the fact that
rpciod calls rpc_complete_task() (in order to wake up the callers of
__rpc_wait_for_completion_task()) and then subsequently calls
rpc_put_task() without ensuring that these two steps are done atomically.

In order to avoid adding new spin locks, the patch uses the existing
waitqueue spin lock to order the rpc_task reference count releases between
the waiting process and rpciod.
The common case where nobody is waiting for completion is optimised for by
checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
reference count is 1: in those cases we drop trying to grab the spin lock,
and immediately free up the rpc_task.

Those few processes that need to put the rpc_task from inside an
asynchronous context and that do not care about ordering are given a new
helper: rpc_put_task_async().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

btrfs: fix not enough reserved space

btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

btrfs: fix dip leak

The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound-2.6 into fix/asoc

fs/dcache: allow d_obtain_alias() to return unhashed dentries

Without this patch, inodes are not promptly freed on last close of an
unlinked file by an nfs client:

client$ mount -tnfs4 server:/export/ /mnt/
client$ tail -f /mnt/FOO
...
server$ df -i /export
server$ rm /export/FOO
(^C the tail -f)
server$ df -i /export
server$ echo 2 >/proc/sys/vm/drop_caches
server$ df -i /export

the df's will show that the inode is not freed on the filesystem until
the last step, when it could have been freed after killing the client's
tail -f. On-disk data won't be deallocated either, leading to possible
spurious ENOSPC.

This occurs because when the client does the close, it arrives in a
compound with a putfh and a close, processed like:

- putfh: look up the filehandle.  The only alias found for the
  inode will be DCACHE_UNHASHED alias referenced by the filp
  this, so it creates a new DCACHE_DISCONECTED dentry and
  returns that instead.
- close: closes the existing filp, which is destroyed
  immediately by dput() since it's DCACHE_UNHASHED.
- end of the compound: release the reference
  to the current filehandle, and dput() the new
  DCACHE_DISCONECTED dentry, which gets put on the
  unused list instead of being destroyed immediately.

Nick Piggin suggested fixing this by allowing d_obtain_alias to return
the unhashed dentry that is referenced by the filp, instead of making it
create a new dentry.

Leave __d_find_alias() alone to avoid changing behavior of other
callers.

Also nfsd doesn't need all the checks of __d_find_alias(); any dentry,
hashed or unhashed, disconnected or not, should work.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Check for immutable/append flag in fallocate path

In the fallocate path the kernel doesn't check for the immutable/append
flag. It's possible to have a race condition in this scenario: an
application open a file in read/write and it does something, meanwhile
root set the immutable flag on the file, the application at that point
can call fallocate with success. In addition, we don't allow to do any
unreserve operation on an append only file but only the reserve one.

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

sysctl: the include of rcupdate.h is only needed in the kernel

Fixes this built error:

include/linux/sysctl.h:28: included file 'linux/rcupdate.h' is not exported

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fat: fix d_revalidate oopsen on NFS exports

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

jfs: fix d_revalidate oopsen on NFS exports

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ocfs2: fix d_revalidate oopsen on NFS exports

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

gfs2: fix d_revalidate oopsen on NFS exports

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fuse: fix d_revalidate oopsen on NFS exports

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ceph: fix d_revalidate oopsen on NFS exports

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

reiserfs xattr ->d_revalidate() shouldn't care about RCU

... it returns an error unconditionally

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

/proc/self is never going to be invalidated...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ipv6: Don't create clones of host routes.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=29252
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30462

In commit d80bc0fd262ef840ed4e82593ad6416fa1ba3fc4 ("ipv6: Always
clone offlink routes.") we forced the kernel to always clone offlink
routes.

The reason we do that is to make sure we never bind an inetpeer to a
prefixed route.

The logic turned on here has existed in the tree for many years,
but was always off due to a protecting CPP define. So perhaps
it's no surprise that there is a logic bug here.

The problem is that we canot clone a route that is already a
host route (ie. has DST_HOST set). Because if we do, an identical
entry already exists in the routing tree and therefore the
ip6_rt_ins() call is going to fail.

This sets off a series of failures and high cpu usage, because when
ip6_rt_ins() fails we loop retrying this operation a few times in
order to handle a race between two threads trying to clone and insert
the same host route at the same time.

Fix this by simply using the route as-is when DST_HOST is set.

Reported-by: slash@ac.auone-net.jp
Reported-by: Ernst Sjöstrand <ernstp@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/pseries: Disable VPNH feature
powerpc/iseries: Fix early init access to lppaca