git.karo-electronics.de Git - karo-tx-linux.git/log

md/raid1: submit IO from originating thread instead of md thread.

queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.

So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.

Signed-off-by: NeilBrown <neilb@suse.de>

md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.

'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.

We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.

So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC. This is consistent with the documented meaning
of REQ_NOIDLE:

__REQ_NOIDLE, /* don't anticipate more IO after this one */

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid1: don't abort a resync on the first badblock.

If a resync of a RAID1 array with 2 devices finds a known bad block
one device it will neither read from, or write to, that device for
this block offset.
So there will be one read_target (The other device) and zero write
targets.
This condition causes md/raid1 to abort the resync assuming that it
has finished - without known bad blocks this would be true.

When there are no write targets because of the presence of bad blocks
we should only skip over the area covered by the bad block.
RAID10 already gets this right, raid1 doesn't. Or didn't.

As this can cause a 'sync' to abort early and appear to have succeeded
it could lead to some data corruption, so it suitable for -stable.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md: remove duplicated test on ->openers when calling do_md_stop()

do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.

So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.

Signed-off-by: NeilBrown <neilb@suse.de>

raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer

Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.

V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid1: prevent merging too large request

For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.

So when should we split big requests? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. This patch only considers request with size above optimal io size.

If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.

So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.

Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.

Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid1: read balance chooses idlest disk for SSD

SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.

Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle. For
SSD, we should try best to distribute requests to as many disks as possible.
There isn't spindle move penality anyway.

With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.

The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.

Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.

V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid1: make sequential read detection per disk based

Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

MD RAID10: Export md_raid10_congested

md/raid10: Export is_congested test.

In similar fashion to commits
11d8a6e3719519fbc0e2c9d61b6fa931b84bf813
1ed7242e591af7e233234d483f12d33818b189d9
we export the RAID10 congestion checking function so that dm-raid.c can
make use of it and make use of the personality. The 'queue' and 'gendisk'
structures will not be available to the MD code when device-mapper sets
up the device, so we conditionalize access to these fields also.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

MD: Move macros from raid1*.h to raid1*.c

MD RAID1/RAID10: Move some macros from .h file to .c file

There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
in both raid1.h and raid10.h. They are only used in there respective .c files.
However, if we wish to make RAID10 accessible to the device-mapper RAID
target (dm-raid.c), then we need to move these macros into the .c files where
they are used so that they do not conflict with each other.

The macros from the two files are identical and could be moved into md.h, but
I chose to leave the duplication and have them remain in the personality
files.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

MD RAID1: rename mirror_info structure

MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'

The same structure name ('mirror_info') is used by raid10.  Each of these
structures are defined in there respective header files.  If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide.  While only one of these structure
names needs to change, this patch adds consistency to the naming of the
structure.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

MD RAID10: rename mirror_info structure

MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'

The same structure name ('mirror_info') is used by raid1. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

MD RAID10: Fix compiler warning.

MD RAID10: Fix compiler warning.

Initialize variable to prevent compiler warning.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: raid5d handle stripe in batch way

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: make_request use batch stripe release

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

blk: pass from_schedule to non-request unplug functions.

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>

block: stack unplug

MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

blk: centralize non-request unplug handling.

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>

md: remove plug_cnt feature of plugging.

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.

Signed-off-by: NeilBrown <neilb@suse.de>

raid5: add a per-stripe lock

Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.

stripe ->toread, ->towrite are protected by per-stripe lock.  Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.

If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared,  there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe

Let's have an example:
|  stripe1 |  stripe2    |  stripe3  |
...bio1......|bio2|bio3|....bio4.....

stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.

before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: remove unnecessary bitmap write optimization

Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
is unnecessary, because the chance one stripe gets written twice in the mean
time is rare. We can always do a bitmap_startwrite when a write request is
added to a stripe and bitmap_endwrite after write request is done. Delete the
optimization. With it, we can delete some cases of device_lock.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: lockless access raid5 overrided bi_phys_segments

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

raid5: reduce chance release_stripe() taking device_lock

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md/raid1: close some possible races on write errors during resync

commit 4367af556133723d0f443e14ca8170d9447317cb
   md/raid1: clear bad-block record when write succeeds.

Added a 'reschedule_retry' call possibility at the end of
end_sync_write, but didn't add matching code at the end of
sync_request_write.  So if the writes complete very quickly, or
scheduling makes it seem that way, then we can miss rescheduling
the request and the resync could hang.

Also commit 73d5c38a9536142e062c35997b044e89166e063b
    md: avoid races when stopping resync.

Fix a race condition in this same code in end_sync_write but didn't
make the change in sync_request_write.

This patch updates sync_request_write to fix both of those.
Patch is suitable for 3.1 and later kernels.

Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>

md: avoid crash when stopping md array races with closing other open fds.

md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.

However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.

If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).

So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.

This is needed when setting a read-write device to read-only too.

Cc: stable@vger.kernel.org
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

md: fix bug in handling of new_data_offset

commit c6563a8c38fde3c1c7fc925a10bde3ca20799301
md: add possibility to change data-offset for devices.

introduced a 'new_data_offset' attribute which should normally
be the same as 'data_offset', but can be explicitly set to a different
value to allow a reshape operation to move the data.

Unfortunately when the 'data_offset' is explicitly set through
sysfs, the new_data_offset is not also set, so the two would become
out-of-sync incorrectly.

One result of this is that trying to set the 'size' after the
'data_offset' would fail because it is not permitted to set the size
when the 'data_offset' and 'new_data_offset' are different - as that
can be confusing.
Consequently when mdadm tried to do this while assembling an IMSM
array it would fail.

This bug was introduced in 3.5-rc1.

Reported-by: Brian Downing <bdowning@lavos.net>
Bisected-by: Brian Downing <bdowning@lavos.net>
Tested-by: Brian Downing <bdowning@lavos.net>
Signed-off-by: NeilBrown <neilb@suse.de>

Linux 3.5-rc7

blk: fix wrong idr_pre_get() error check in loop.c

The idr_pre_get() function never returns a value < 0. It returns 0 (no
memory) or 1 (OK).

Reported-by: Silva Paulo <psdasilva@yahoo.com>
[ Rewrote Silva's patch, but attributing it to Silva anyway - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sound-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Containing the regression fixes for USB-audio due to the transition to
  the new streaming logic, mostly found on Logitech webcams."

* tag 'sound-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: snd-usb: move calls to usb_set_interface
  ALSA: usb-audio: Fix the first PCM interface assignment

Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

Pull ACPI patch from Len Brown.

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
ACPICA: Fix possible fault in return package object repair code

vsyscall_64: add missing ifdef CONFIG_SECCOMP

vsyscall_seccomp introduced a dependency on __secure_computing. On
configurations with CONFIG_SECCOMP disabled, compilation will fail.

Reported-by: feng xiangjun <fengxj325@gmail.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'cpufreq-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull cpufreq fix from Rafael Wysocki:
"This fixes a regression preventing the ACPI cpufreq driver from
  loading on some systems where it worked previously without any
  problems."

* tag 'cpufreq-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq / ACPI: Fix not loading acpi-cpufreq driver regression

Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM Samsung SoC fixes from Arnd Bergmann.

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: S3C24XX: Correct CAMIF interrupt definitions
  ARM: S3C24XX: Correct AC97 clock control bit for S3C2440
  ARM: SAMSUNG: fix race in s3c_adc_start for ADC
  ARM: SAMSUNG: Update default rate for xusbxti clock
  ARM: EXYNOS: register devices in 'need_restore' state for pm_domains
  ARM: EXYNOS: read initial state of power domain from hw registers

Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU, perf, and scheduler fixes from Ingo Molnar.

The RCU fix is a revert for an optimization that could cause deadlocks.

One of the scheduler commits (164c33c6adee "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes.  See this thread on
lkml:

  [PATCH] fork: fix error handling in dup_task()

The perf fixes are just trivial oneliners.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf kvm: Fix segfault with report and mixed guestmount use
  perf kvm: Fix regression with guest machine creation
  perf script: Fix format regression due to libtraceevent merge
  ring-buffer: Fix accounting of entries when removing pages
  ring-buffer: Fix crash due to uninitialized new_pages list head

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS/sched: Update scheduler file pattern
  sched/nohz: Rewrite and fix load-avg computation -- again
  sched: Fix fork() error path to not crash

ACPICA: Fix possible fault in return package object repair code

Fixes a problem that can occur when a lone package object is
wrapped with an outer package object in order to conform to
the ACPI specification. Can affect these predefined names:
_ALR,_MLS,_PSS,_TRT,_TSS,_PRT,_HPX,_DLM,_CSD,_PSD,_TSD

https://bugzilla.kernel.org/show_bug.cgi?id=44171

This problem was introduced in 3.4-rc1 by commit
6a99b1c94d053b3420eaa4a4bc8b2883dd90a2f9
(ACPICA: Object repair code: Support to add Package wrappers)

Reported-by: Vlastimil Babka <caster@gentoo.org>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Cc: <stable@vger.kernel.org> # 3.4
Signed-off-by: Len Brown <len.brown@intel.com>

Merge branch 'v3.5-samsung-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into fixes

From Kukjin Kim <kgene.kim@samsung.com>:

* 'v3.5-samsung-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
  ARM: S3C24XX: Correct CAMIF interrupt definitions
  ARM: S3C24XX: Correct AC97 clock control bit for S3C2440
  ARM: SAMSUNG: fix race in s3c_adc_start for ADC
  ARM: SAMSUNG: Update default rate for xusbxti clock
  ARM: EXYNOS: register devices in 'need_restore' state for pm_domains
  ARM: EXYNOS: read initial state of power domain from hw registers

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Merge tag 'md-3.5-fixes' of git://neil.brown.name/md

Pull use-after-free RAID1 bugfix from NeilBrown.

* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md/raid1: fix use-after-free bug in RAID1 data-check code.

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull the leap second fixes from Thomas Gleixner:
"It's a rather large series, but well discussed, refined and reviewed.
  It got a massive testing by John, Prarit and tip.

  In theory we could split it into two parts.  The first two patches

    f55a6faa3843: hrtimer: Provide clock_was_set_delayed()
    4873fa070ae8: timekeeping: Fix leapsecond triggered load spike issue

  are merely preventing the stuff loops forever issues, which people
  have observed.

  But there is no point in delaying the other 4 commits which achieve
  full correctness into 3.6 as they are tagged for stable anyway.  And I
  rather prefer to have the full fixes merged in bulk than a "prevent
  the observable wreckage and deal with the hidden fallout later"
  approach."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Update hrtimer base offsets each hrtimer_interrupt
  timekeeping: Provide hrtimer update function
  hrtimers: Move lock held region in hrtimer_interrupt()
  timekeeping: Maintain ktime_t based offsets for hrtimers
  timekeeping: Fix leapsecond triggered load spike issue
  hrtimer: Provide clock_was_set_delayed()

x86/vsyscall: allow seccomp filter in vsyscall=emulate

If a seccomp filter program is installed, older static binaries and
distributions with older libc implementations (glibc 2.13 and earlier)
that rely on vsyscall use will be terminated regardless of the filter
program policy when executing time, gettimeofday, or getcpu. This is
only the case when vsyscall emulation is in use (vsyscall=emulate is the
default).

This patch emulates system call entry inside a vsyscall=emulate by
populating regs->ax and regs->orig_ax with the system call number prior
to calling into seccomp such that all seccomp-dependencies function
normally. Additionally, system call return behavior is emulated in line
with other vsyscall entrypoints for the trace/trap cases.

[ v2: fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to luto@mit.edu) ]
Reported-and-tested-by: Owen Kibel <qmewlo@gmail.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging

Please pull one hwmon subsystem fix from Jean Delvare.

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
hwmon: (it87) Preserve configuration register bits on init

Merge tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
- Fix an NFSv4 mount regression
- Fix O_DIRECT list manipulation snafus

* tag 'nfs-for-3.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix an NFSv4 mount regression
NFS: Fix list manipulation snafus in fs/nfs/direct.c

Remove easily user-triggerable BUG from generic_setlease

This can be trivially triggered from userspace by passing in something unexpected.

    kernel BUG at fs/locks.c:1468!
    invalid opcode: 0000 [#1] SMP
    RIP: 0010:generic_setlease+0xc2/0x100
    Call Trace:
      __vfs_setlease+0x35/0x40
      fcntl_setlease+0x76/0x150
      sys_fcntl+0x1c6/0x810
      system_call_fastpath+0x1a/0x1f

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: stable@kernel.org # 3.2+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input layer fixes from Dmitry Torokhov:
"The changes are limited to adding new VID/PID combinations to drivers
  to enable support for new versions of hardware, most notably hardware
  found in new MacBook Pro Retina boxes."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: xpad - add Andamiro Pump It Up pad
  Input: xpad - add signature for Razer Onza Tournament Edition
  Input: xpad - handle all variations of Mad Catz Beat Pad
  Input: bcm5974 - Add support for 2012 MacBook Pro Retina
  HID: add support for 2012 MacBook Pro Retina

Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
- Some regression fixes at the audio part for devices with
   cx23885/cx25840
- A DMA corruption fix at cx231xx
- two fixes at the winbond IR driver
- Several fixes for the EXYNOS media driver (s5p)
- two fixes at the OMAP3 preview driver
- one fix at the dvb core failure path
- an include missing (slab.h) at smiapp-core causing compilation
   breakage
- em28xx was not loading the IR driver driver anymore.

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (31 commits)
  [media] Revert "[media] V4L: JPEG class documentation corrections"
  [media] s5p-fimc: Add missing FIMC-LITE file operations locking
  [media] omap3isp: preview: Fix contrast and brightness handling
  [media] omap3isp: preview: Fix output size computation depending on input format
  [media] winbond-cir: Initialise timeout, driver_type and allowed_protos
  [media] winbond-cir: Fix txandrx module info
  [media] cx23885: Silence unknown command warnings
  [media] cx23885: add support for HVR-1255 analog (cx23888 variant)
  [media] cx23885: make analog support work for HVR_1250 (cx23885 variant)
  [media] cx25840: fix vsrc/hsrc usage on cx23888 designs
  [media] cx25840: fix regression in HVR-1800 analog audio
  [media] cx25840: fix regression in analog support hue/saturation controls
  [media] cx25840: fix regression in HVR-1800 analog support
  [media] s5p-mfc: Fixed setup of custom controls in decoder and encoder
  [media] cx231xx: don't DMA to random addresses
  [media] em28xx: fix em28xx-rc load
  [media] dvb-core: Release semaphore on error path dvb_register_device()
  [media] s5p-fimc: Stop media entity pipeline if fimc_pipeline_validate fails
  [media] s5p-fimc: Fix compiler warning in fimc-lite.c
  [media] s5p-fimc: media_entity_pipeline_start() may fail
  ...

Merge tag 'mmc-fixes-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc

Pull MMC fixes from Chris Ball:
- Revert a patch that made failing to select power class fatal;
   it turns out that it fails non-fatally on Tegra boards.
   Regression against 3.5-rc1.
- Add the IRQF_ONESHOT flag to the cd-gpio driver, which turned
   into a regression in 3.5-rc1 when IRQF_ONESHOT became required
   for threaded IRQs with no handler.

* tag 'mmc-fixes-for-3.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
  mmc: cd-gpio: pass IRQF_ONESHOT to request_threaded_irq()
  mmc: core: Revert "skip card initialization if power class selection fails"

Merge tag 'for-linus-20120712' of git://git.infradead.org/linux-mtd

Pull late MTD fixes from David Woodhouse:
- fix 'sparse warning fix' regression which totally breaks MXC NAND
- fix GPMI NAND regression when used with UBI
- update/correct sysfs documentation for new 'bitflip_threshold' field
- fix nandsim build failure

* tag 'for-linus-20120712' of git://git.infradead.org/linux-mtd:
  mtd: nandsim: don't open code a do_div helper
  mtd: ABI documentation: clarification of bitflip_threshold
  mtd: gpmi-nand: fix read page when reading to vmalloced area
  mtd: mxc_nand: use 32bit copy functions

Merge tag 'mfd-for-linus-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6

Pull MFD Fixes from Samuel Ortiz:
- Three Palmas fixes, One of them being a build error fix.
- Two mc13xx fixes.  One for fixing an SPI regmap configuration and
   another one for working around an i.Mx hardware bug.
- One omap-usb regression fix.
- One twl6040 build breakage fix.
- One file deletion (ab5500-core.h) that was overlooked during the last
   merge window.

* tag 'mfd-for-linus-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
  mfd: Add missing hunk to change palmas irq to clear on read
  mfd: Fix palmas regulator pdata missing
  mfd: USB: Fix the omap-usb EHCI ULPI PHY reset fix issues.
  mfd: Update twl6040 Kconfig to avoid build breakage
  mfd: Delete ab5500-core.h
  mfd: mc13xxx workaround SPI hardware bug on i.Mx
  mfd: Fix mc13xxx SPI regmap
  mfd: Add terminating entry for i2c_device_id palmas table

Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh

Pull SuperH fixes from Paul Mundt.

* tag 'sh-for-linus' of git://github.com/pmundt/linux-sh:
SH: Convert out[bwl] macros to inline functions
sh: Fix up se7721 GPIOLIB=y build warnings.

Merge git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull a couple of KVM fixes from Avi Kivity:
"One is an adjustment for an irq layer change that affected device
  assignment, the other a one-liner ppc fix."

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  powerpc/kvm: Fix "PR" KVM implementation of H_CEDE
  KVM: Fix device assignment threaded irq handler

block: fix infinite loop in __getblk_slow

Commit 080399aaaf35 ("block: don't mark buffers beyond end of disk as
mapped") exposed a bug in __getblk_slow that causes mount to hang as it
loops infinitely waiting for a buffer that lies beyond the end of the
disk to become uptodate.

The problem was initially reported by Torsten Hilbrich here:

    https://lkml.org/lkml/2012/6/18/54

and also reported independently here:

    http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511

and then Richard W.M.  Jones and Marcos Mello noted a few separate
bugzillas also associated with the same issue.  This patch has been
confirmed to fix:

    https://bugzilla.redhat.com/show_bug.cgi?id=835019

The main problem is here, in __getblk_slow:

        for (;;) {
                struct buffer_head * bh;
                int ret;

                bh = __find_get_block(bdev, block, size);
                if (bh)
                        return bh;

                ret = grow_buffers(bdev, block, size);
                if (ret < 0)
                        return NULL;
                if (ret == 0)
                        free_more_memory();
        }

__find_get_block does not find the block, since it will not be marked as
mapped, and so grow_buffers is called to fill in the buffers for the
associated page.  I believe the for (;;) loop is there primarily to
retry in the case of memory pressure keeping grow_buffers from
succeeding.  However, we also continue to loop for other cases, like the
block lying beond the end of the disk.  So, the fix I came up with is to
only loop when grow_buffers fails due to memory allocation issues
(return value of 0).

The attached patch was tested by myself, Torsten, and Rich, and was
found to resolve the problem in call cases.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-and-Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Josh Boyer <jwboyer@redhat.com>
Cc: Stable <stable@vger.kernel.org> # 3.0+
[ Jens is on vacation, taking this directly  - Linus ]
--
Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ARM: S3C24XX: Correct CAMIF interrupt definitions

Properly define the CAMIF interrupt resources. This device have two
interrupts - corresponding to the "codec" and "preview" data paths.
IRQ_CAM is handled internally by the architecture and demultiplexed
to IRQ_S3C2440_CAM_C and IRQ_S3C2440_CAM_P - these interrupts only
should be handled in the driver.

Signed-off-by: Sylwester Nawrocki <sylvester.nawrocki@gmail.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>

ARM: S3C24XX: Correct AC97 clock control bit for S3C2440

Use correct gate control bit for AC97 clock which is
S3C2440_CLKCON_AC97, not S3C2440_CLKCON_CAMERA.

Signed-off-by: Sylwester Nawrocki <sylvester.nawrocki@gmail.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>

ALSA: snd-usb: move calls to usb_set_interface

The rework of the snd-usb endpoint logic moved the calls to
snd_usb_set_interface() into the snd_usb_endpoint implemenation. This
changed the order in which these calls are issued to the device, and
thereby caused regressions for some webcams.

Fix this by moving the calls back to pcm.c for now to make it work again
and use snd_usb_endpoint_activate() to really tear down all remaining
URBs in the flight, consequently fixing another regression caused by USB
packets on the wire after altsetting 0 has been selected.

Signed-off-by: Daniel Mack <zonque@gmail.com>
Reported-and-tested-by: Philipp Dreimann <philipp@dreimann.net>
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Input: xpad - add Andamiro Pump It Up pad

I couldn't find the vendor ID in any of the online databases, but this
mat has a Pump It Up logo on the top side of the controller compartment,
and a disclaimer stating that Andamiro will not be liable on the bottom.

Signed-off-by: Yuri Khan <yurivkhan@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

ARM: SAMSUNG: fix race in s3c_adc_start for ADC

Checking for adc->ts_pend already claimed should be done with the
lock held.

Signed-off-by: Todd Poynor <toddpoynor@google.com>
Acked-by: Ben Dooks <ben-linux@fluff.org>
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>

ARM: SAMSUNG: Update default rate for xusbxti clock

The rate of xusbxti clock is set in individual machine files.
The default value should be defined at the clock definition
and individual machine files should modify it if required.

Division by zero in kernel.
[<c0011849>] (unwind_backtrace+0x1/0x9c) from [<c022c663>] (Ldiv0+0x9/0x12)
[<c022c663>] (Ldiv0+0x9/0x12) from [<c001a3c3>] (s3c_setrate_clksrc+0x33/0x78)
[<c001a3c3>] (s3c_setrate_clksrc+0x33/0x78) from [<c0019e67>] (clk_set_rate+0x2f/0x78)

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>

hwmon: (it87) Preserve configuration register bits on init

We were accidentally losing one bit in the configuration register on
device initialization. It was reported to freeze one specific system
right away. Properly preserve all bits we don't explicitly want to
change in order to prevent that.

Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>

cpufreq / ACPI: Fix not loading acpi-cpufreq driver regression

Commit d640113fe80e45ebd4a5b420b introduced a regression on SMP
systems where the processor core with ACPI id zero is disabled
(typically should be the case because of hyperthreading).
The regression got spread through stable kernels.
On 3.0.X it got introduced via 3.0.18.

Such platforms may be rare, but do exist.
Look out for a disabled processor with acpi_id 0 in dmesg:
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] disabled)

This problem has been observed on a:
HP Proliant BL280c G6 blade

This patch restricts the introduced workaround to platforms
with nr_cpu_ids <= 1.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>

ARM: EXYNOS: register devices in 'need_restore' state for pm_domains

Commit ca1d72f033 ('PM / Domains: Make it possible to add devices to
inactive domains') introduced possibility to add devices to inactive
power domains and added pm_genpd_dev_need_restore() function which lets
platform core to notify power domain core that the specified device must
be restored (with its runtime_resume() callback) before first use.

This patch adds the pm_genpd_dev_need_restore() call what brings back
the suspend/resume behaviour for the client devices known from the
previous power domain driver (removed by commit 91cfbd4ee0 - 'ARM:
EXYNOS: Hook up power domains to generic power domain infrastructure').
Client device drivers relay on that suspend/resume behaviour, thus this
patch fixes runtime pm operation for client devices.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>

ARM: EXYNOS: read initial state of power domain from hw registers

Some bootloaders disable unused power domains to reduce power
consuption. Power domain driver can easily read the actual state from
the hardware registers instead of assuming that their initial state is
always 'on'.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>

SH: Convert out[bwl] macros to inline functions

The macros just called BUG(), but that results in unused variable
warnings all over the place, like in the IPMI driver. The build
regression emails were annoying me, so here's the fix. I have
not even compile tested this, but it's rather obvious.

[ port type mangled to unsigned long ]

Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

Merge tag 'fbdev-fixes-for-3.5-2' of git://github.com/schandinat/linux-2.6

Pull fbdev fixes from Florian Tobias Schandinat:
"Two fixes for OMAPDSS by Tomi Valkeinen:
   - one to avoid warnings when runtime PM is not enabled
   - one workaround to dependancy issues during suspend/resume"

* tag 'fbdev-fixes-for-3.5-2' of git://github.com/schandinat/linux-2.6:
  OMAPDSS: fix warnings if CONFIG_PM_RUNTIME=n
  OMAPDSS: Use PM notifiers for system suspend

Merge branch 'akpm' (Andrew's patch-bomb)

Merge random patches from Andrew Morton.

* Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (32 commits)
  memblock: free allocated memblock_reserved_regions later
  mm: sparse: fix usemap allocation above node descriptor section
  mm: sparse: fix section usemap placement calculation
  xtensa: fix incorrect memset
  shmem: cleanup shmem_add_to_page_cache
  shmem: fix negative rss in memcg memory.stat
  tmpfs: revert SEEK_DATA and SEEK_HOLE
  drivers/rtc/rtc-twl.c: fix threaded IRQ to use IRQF_ONESHOT
  fat: fix non-atomic NFS i_pos read
  MAINTAINERS: add OMAP CPUfreq driver to OMAP Power Management section
  sgi-xp: nested calls to spin_lock_irqsave()
  fs: ramfs: file-nommu: add SetPageUptodate()
  drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning
  mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() fails
  h8300/uaccess: add mising __clear_user()
  h8300/uaccess: remove assignment to __gu_val in unhandled case of get_user()
  h8300/time: add missing #include <asm/irq_regs.h>
  h8300/signal: fix typo "statis"
  h8300/pgtable: add missing #include <asm-generic/pgtable.h>
  drivers/rtc/rtc-ab8500.c: ensure correct probing of the AB8500 RTC when Device Tree is enabled
  ...

memblock: free allocated memblock_reserved_regions later

memblock_free_reserved_regions() calls memblock_free(), but
memblock_free() would double reserved.regions too, so we could free the
old range for reserved.regions.

Also tj said there is another bug which could be related to this.

| I don't think we're saving any noticeable
| amount by doing this "free - give it to page allocator - reserve
| again" dancing.  We should just allocate regions aligned to page
| boundaries and free them later when memblock is no longer in use.

in that case, when DEBUG_PAGEALLOC, will get panic:

     memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
  BUG: unable to handle kernel paging request at ffff88102febd948
  IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155
  PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU 0
  Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
  RIP: 0010:[<ffffffff836a5774>]  [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155

See the discussion at https://lkml.org/lkml/2012/6/13/469

So try to allocate with PAGE_SIZE alignment and free it later.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: sparse: fix usemap allocation above node descriptor section

After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
in alloc_bootmem_section"), usemap allocations may easily be placed
outside the optimal section that holds the node descriptor, even if
there is space available in that section. This results in unnecessary
hotplug dependencies that need to have the node unplugged before the
section holding the usemap.

The reason is that the bootmem allocator doesn't guarantee a linear
search starting from the passed allocation goal but may start out at a
much higher address absent an upper limit.

Fix this by trying the allocation with the limit at the section end,
then retry without if that fails. This keeps the fix from f5bf18fa22f8
of not panicking if the allocation does not fit in the section, but
still makes sure to try to stay within the section at first.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: sparse: fix section usemap placement calculation

Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the
bootmem allocator") introduced a bug in the allocation goal calculation
that put section usemaps not in the same section as the node
descriptors, creating unnecessary hotplug dependencies between them:

  node 0 must be removed before remove section 16399
  node 1 must be removed before remove section 16399
  node 2 must be removed before remove section 16399
  node 3 must be removed before remove section 16399
  node 4 must be removed before remove section 16399
  node 5 must be removed before remove section 16399
  node 6 must be removed before remove section 16399

The reason is that it applies PAGE_SECTION_MASK to the physical address
of the node descriptor when finding a suitable place to put the usemap,
when this mask is actually intended to be used with PFNs.  Because the
PFN mask is wider, the target address will point beyond the wanted
section holding the node descriptor and the node must be offlined before
the section holding the usemap can go.

Fix this by extending the mask to address width before use.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

xtensa: fix incorrect memset

Addresses: https://bugzilla.kernel.org/show_bug.cgi?id=43871

Reported-by: <dcb314@hotmail.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

shmem: cleanup shmem_add_to_page_cache

shmem_add_to_page_cache() has three callsites, but only one of them wants
the radix_tree_preload() (an exceptional entry guarantees that the radix
tree node is present in the other cases), and only that site can achieve
mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the
other cases). We did it this way originally to reflect
add_to_page_cache_locked(); but it's confusing now, so move the radix_tree
preloading and mem_cgroup uncharging to that one caller.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

shmem: fix negative rss in memcg memory.stat

When adding the page_private checks before calling shmem_replace_page(), I
did realize that there is a further race, but thought it too unlikely to
need a hurried fix.

But independently I've been chasing why a mem cgroup's memory.stat
sometimes shows negative rss after all tasks have gone: I expected it to
be a stats gathering bug, but actually it's shmem swapping's fault.

It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
the page may have been removed from swapcache before getting the lock; or
it may have been freed and reused and be back in swapcache; and it can
even be using the same swap location as before (page_private same).

The swapoff case is already secure against this (swap cannot be reused
until the whole area has been swapped off, and a new swapped on); and
shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for
the expected radix_tree entry - but a little too late.

By that time, we might have already decided to shmem_replace_page(): I
don't know of a problem from that, but I'd feel more at ease not to do so
spuriously.  And we have already done mem_cgroup_cache_charge(), on
perhaps the wrong mem cgroup: and this charge is not then undone on the
error path, because PageSwapCache ends up preventing that.

It's this last case which causes the occasional negative rss in
memory.stat: the page is charged here as cache, but (sometimes) found to
be anon when eventually it's uncharged - and in between, it's an
undeserved charge on the wrong memcg.

Fix this by adding an earlier check on the radix_tree entry: it's
inelegant to descend the tree twice, but swapping is not the fast path,
and a better solution would need a pair (try+commit) of memcg calls, and a
rework of shmem_replace_page() to keep out of the swapcache.

We can use the added shmem_confirm_swap() function to replace the
find_get_page+page_cache_release we were already doing on the error path.
And add a comment on that -EEXIST: it seems a peculiar errno to be using,
but originates from its use in radix_tree_insert().

[It can be surprising to see positive rss left in a memcg's memory.stat
after all tasks have gone, since it is supposed to count anonymous but not
shmem.  Aside from sharing anon pages via fork with a task in some other
memcg, it often happens after swapping: because a swap page can't be freed
while under writeback, nor while locked.  So it's not an error, and these
residual pages are easily freed once pressure demands.]

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

tmpfs: revert SEEK_DATA and SEEK_HOLE

Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").  I believe
it's correct, and it's been nice to have from rc1 to rc6; but as the
original commit said:

I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
would be of any use to them on tmpfs.  This code adds 92 lines and 752
bytes on x86_64 - is that bloat or worthwhile?

Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to
the dumb generic support for v3.5.  We can always reinstate it later if
useful, and anyone needing it in a hurry can just get it out of git.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/rtc/rtc-twl.c: fix threaded IRQ to use IRQF_ONESHOT

Requesting a threaded interrupt without a primary handler and without
IRQF_ONESHOT is dangerous, and after commit 1c6c6952 ("genirq: Reject
bogus threaded irq requests"), these requests are rejected. This causes
->probe() to fail, and the RTC driver not to be availble.

To fix, add IRQF_ONESHOT to the IRQ flags.

Tested on OMAP3730/OveroSTORM and OMAP4430/Panda board using rtcwake to
wake from system suspend multiple times.

Signed-off-by: Kevin Hilman <khilman@ti.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fat: fix non-atomic NFS i_pos read

fat_encode_fh() can fetch an invalid i_pos value on systems where 64-bit
accesses are not atomic. Make it use the same accessor as the rest of the
FAT code.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: add OMAP CPUfreq driver to OMAP Power Management section

Add the OMAP CPUFreq driver to the list of files in the OMAP Power
Management section.

I've already been maintaining this driver, this just makes it official.

Signed-off-by: Kevin Hilman <khilman@ti.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sgi-xp: nested calls to spin_lock_irqsave()

The code here has a nested spin_lock_irqsave(). It's not needed since
IRQs are already disabled and it causes a problem because it means that
IRQs won't be enabled again at the end. The second call to
spin_lock_irqsave() will overwrite the value of irq_flags and we can't
restore the proper settings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs: ramfs: file-nommu: add SetPageUptodate()

There is a bug in the below scenario for !CONFIG_MMU:

1. create a new file
2. mmap the file and write to it
3. read the file can't get the correct value

Because

sys_read() -> generic_file_aio_read() -> simple_readpage() -> clear_page()

which causes the page to be zeroed.

Add SetPageUptodate() to ramfs_nommu_expand_for_mapping() so that
generic_file_aio_read() do not call simple_readpage().

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning

Fixes

  WARNING: at irq/handle.c:146 handle_irq_event_percpu+0x19c/0x1b8()
  irq 25 handler mxc_rtc_interrupt+0x0/0xac enabled interrupts
  Modules linked in:
   (unwind_backtrace+0x0/0xf0) from (warn_slowpath_common+0x4c/0x64)
   (warn_slowpath_common+0x4c/0x64) from (warn_slowpath_fmt+0x30/0x40)
   (warn_slowpath_fmt+0x30/0x40) from (handle_irq_event_percpu+0x19c/0x1b8)
   (handle_irq_event_percpu+0x19c/0x1b8) from (handle_irq_event+0x28/0x38)
   (handle_irq_event+0x28/0x38) from (handle_level_irq+0x80/0xc4)
   (handle_level_irq+0x80/0xc4) from (generic_handle_irq+0x24/0x38)
   (generic_handle_irq+0x24/0x38) from (handle_IRQ+0x30/0x84)
   (handle_IRQ+0x30/0x84) from (avic_handle_irq+0x2c/0x4c)
   (avic_handle_irq+0x2c/0x4c) from (__irq_svc+0x40/0x60)
  Exception stack(0xc050bf60 to 0xc050bfa8)
  bf60: 00000001 00000000 003c4208 c0018e20 c050a000 c050a000 c054a4c8 c050a000
  bf80: c05157a8 4117b363 80503bb4 00000000 01000000 c050bfa8 c0018e2c c000e808
  bfa0: 60000013 ffffffff
   (__irq_svc+0x40/0x60) from (default_idle+0x1c/0x30)
   (default_idle+0x1c/0x30) from (cpu_idle+0x68/0xa8)
   (cpu_idle+0x68/0xa8) from (start_kernel+0x22c/0x26c)

Signed-off-by: Benoît Thébaudeau <benoit.thebaudeau@advansee.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Sascha Hauer <kernel@pengutronix.de>
Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug.c: release memory resources if hotadd_new_pgdat() fails

We should goto error to release memory resource if hotadd_new_pgdat()
failed.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki ISIMATU <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

h8300/uaccess: add mising __clear_user()

Fix the build error:

include/linux/regset.h: In function 'user_regset_copyout_zero':
include/linux/regset.h:289:3: error: implicit declaration of function '__clear_user' [-Werror=implicit-function-declaration]

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

h8300/uaccess: remove assignment to __gu_val in unhandled case of get_user()

__gu_val is const if the passed ptr is const, giving:

  include/linux/pagemap.h: In function 'fault_in_pages_readable':
  include/linux/pagemap.h:442:2: error: assignment of read-only variable '__gu_val'
  include/linux/pagemap.h:448:4: error: assignment of read-only variable '__gu_val'
  include/linux/pagemap.h: In function 'fault_in_multipages_readable':
  include/linux/pagemap.h:499:3: error: assignment of read-only variable '__gu_val'
  include/linux/pagemap.h:508:3: error: assignment of read-only variable '__gu_val'
  make[4]: *** [init/main.o] Error 1

As we don't care about the actual value of __gu_val in the unhandled
case (it will cause a link error anyway), just remove the assignment.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

h8300/time: add missing #include <asm/irq_regs.h>

Fix the build error:

  arch/h8300/kernel/time.c: In function 'h8300_timer_tick':
  arch/h8300/kernel/time.c:39:2: error: implicit declaration of function 'get_irq_regs' [-Werror=implicit-function-declaration]
  arch/h8300/kernel/time.c:39:42: error: invalid type argument of '->' (have 'int')

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

h8300/signal: fix typo "statis"

The keyword is "static", not "statis":

  arch/h8300/kernel/signal.c:455:8: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'void'
  arch/h8300/kernel/signal.c: In function 'do_notify_resume':
  arch/h8300/kernel/signal.c:511:3: error: implicit declaration of function 'do_signal' [-Werror=implicit-function-declaration]
  arch/h8300/kernel/signal.c: At top level:
  arch/h8300/kernel/signal.c:414:1: warning: 'handle_signal' defined but not used [-Wunused-function]

Introduced in commit 7ae4e32a6514 ("h8300: switch to saved_sigmask-based
sigsuspend/rt_sigsuspend")

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

h8300/pgtable: add missing #include <asm-generic/pgtable.h>

Fix the h8300 build error:

kernel/sched/core.c: In function 'context_switch':
kernel/sched/core.c:2061:2: error: implicit declaration of function 'arch_start_context_switch' [-Werror=implicit-function-declaration]

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/rtc/rtc-ab8500.c: ensure correct probing of the AB8500 RTC when Device Tree is enabled

Without this patch, if Device Tree is enabled the AB8500 RTC wouldn't get
probed at all, as there is no reference to it from platform code. This
patch ensures the driver is probed during normal DT start-up.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/rtc/rtc-ab8500.c: use IRQF_ONESHOT when requesting a threaded IRQ

This driver's IRQ registration is failing because the kernel now forces
IRQs to be ONESHOT if no IRQ handler is passed.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, thp: abort compaction if migration page cannot be charged to memcg

If page migration cannot charge the temporary page to the memcg,
migrate_pages() will return -ENOMEM.  This isn't considered in memory
compaction however, and the loop continues to iterate over all
pageblocks trying to isolate and migrate pages.  If a small number of
very large memcgs happen to be oom, however, these attempts will mostly
be futile leading to an enormous amout of cpu consumption due to the
page migration failures.

This patch will short circuit and fail memory compaction if
migrate_pages() returns -ENOMEM.  COMPACT_PARTIAL is returned in case
some migrations were successful so that the page allocator will retry.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c/r: prctl: less paranoid prctl_set_mm_exe_file()

"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.

Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.

Plus, this patch fixes exit code in case operation success.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: fix NULL pointer dereference in __ocfs2_change_file_space()

As ocfs2_fallocate() will invoke __ocfs2_change_file_space() with a NULL
as the first parameter (file), it may trigger a NULL pointer dereferrence
due to a missing check.

Addresses http://bugs.launchpad.net/bugs/1006012

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reported-by: Bret Towe <magnade@gmail.com>
Tested-by: Bret Towe <magnade@gmail.com>
Cc: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mn10300: use "#elif defined(CONFIG_*)" instead of "#elif CONFIG_*"

Fix the warnings:

arch/mn10300/kernel/irq.c:173:7: warning: "CONFIG_MN10300_TTYSM1_TIMER9" is not defined [-Wundef]
arch/mn10300/kernel/irq.c:175:7: warning: "CONFIG_MN10300_TTYSM1_TIMER3" is not defined [-Wundef]

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mn10300: mm/dma-alloc.c needs <linux/export.h>

Fix the warnings:

  arch/mn10300/mm/dma-alloc.c: At top level:
  arch/mn10300/mm/dma-alloc.c:63:1: warning: data definition has no type or storage class [enabled by default]
  arch/mn10300/mm/dma-alloc.c:63:1: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL' [-Wimplicit-int]
  arch/mn10300/mm/dma-alloc.c:63:1: warning: parameter names (without types) in function declaration [enabled by default]
  arch/mn10300/mm/dma-alloc.c:75:1: warning: data definition has no type or storage class [enabled by default]
  arch/mn10300/mm/dma-alloc.c:75:1: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL' [-Wimplicit-int]
  arch/mn10300/mm/dma-alloc.c:75:1: warning: parameter names (without types) in function declaration [enabled by default]

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mn10300: kernel/traps.c needs <linux/export.h>

Fix the warning:

  arch/mn10300/kernel/traps.c:304:1: warning: data definition has no type or storage class [enabled by default]
  arch/mn10300/kernel/traps.c:304:1: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL' [-Wimplicit-int]
  arch/mn10300/kernel/traps.c:304:1: warning: parameter names (without types) in function declaration [enabled by default]

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mn10300: kernel/internal.h needs <linux/irqreturn.h>

Fix the nm10300 build failure:

In file included from arch/mn10300/kernel/csrc-mn10300.c:14:0:
arch/mn10300/kernel/internal.h:42:1: error: unknown type name 'irqreturn_t'

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mn10300: remove duplicate definition of PTRACE_O_TRACESYSGOOD

Fix the warning:

include/linux/ptrace.h:66:0: warning: "PTRACE_O_TRACESYSGOOD" redefined [enabled by default]
arch/mn10300/include/asm/ptrace.h:85:0: note: this is the location of the previous definition

We already have it in <linux/ptrace.h>, so remove it from <asm/ptrace.h>

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mn10300: move setup_jiffies_interrupt() to cevt-mn10300.c

Move the static inline function setup_jiffies_interrupt() from
<asm/timex.h> to arch/mn10300/kernel/cevt-mn10300.c, which is its only
callsite.

This allows to remove the inclusion of <asm/hardirq.h> and <linux/irq.h>
from <asm/timex.h> and <unit/timex.h>, fixing include hell like:

  include/linux/jiffies.h:260:31: warning: "CLOCK_TICK_RATE" is not defined [-Wundef]
  include/linux/jiffies.h:260:31: warning: "CLOCK_TICK_RATE" is not defined [-Wundef]
  include/linux/jiffies.h:46:42: error: division by zero in #if
  ...
  make[4]: *** [arch/mn10300/kernel/asm-offsets.s] Error 1

and (after a quick hack for the above by defining CLOCK_TICK_RATE in
<linux/jiffies.h>):

  In file included from include/linux/notifier.h:15:0,
                 from include/linux/memory_hotplug.h:6,
                 from include/linux/mmzone.h:718,
                 from include/linux/gfp.h:4,
                 from include/linux/irq.h:20,
                 from arch/mn10300/unit-asb2303/include/unit/timex.h:15,
                 from arch/mn10300/include/asm/timex.h:15,
                 from include/linux/timex.h:174,
                 from include/linux/jiffies.h:8,
                 from include/linux/ktime.h:25,
                 from include/linux/timer.h:5,
                 from include/linux/workqueue.h:8,
  include/linux/srcu.h:55:22: error: field 'work' has incomplete type

As a consequence, we do need a few more inclusions of <asm/irq.h>, namely
in arch/mn10300/unit-asb2303/smc91111.c and
arch/mn10300/unit-asb2305/unit-init.c.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/rtc/rtc-spear.c: fix use-after-free in spear_rtc_remove()

`config' is freed and is then used in the rtc_device_unregister() call,
causing a kernel panic.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
Reviewed-by: Viresh Kumar <viresh.linux@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memory hotplug: fix invalid memory access caused by stale kswapd pointer

kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined.  But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again.  Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
  PGD 0
  Oops: 0000 [#1] SMP
  last sysfs file: /sys/devices/system/memory/memory391/state
  CPU 11
  Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
  RIP: 0010:exit_creds+0x12/0x78
  RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
  RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
  RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
  R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
  R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
  FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
  Stack:
   ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
   ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
   0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
  Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
  Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
  RIP  exit_creds+0x12/0x78
   RSP <ffff8806044f1d78>
  CR2: 0000000000000000

[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hrtimer: Update hrtimer base offsets each hrtimer_interrupt

The update of the hrtimer base offsets on all cpus cannot be made
atomically from the timekeeper.lock held and interrupt disabled region
as smp function calls are not allowed there.

clock_was_set(), which enforces the update on all cpus, is called
either from preemptible process context in case of do_settimeofday()
or from the softirq context when the offset modification happened in
the timer interrupt itself due to a leap second.

In both cases there is a race window for an hrtimer interrupt between
dropping timekeeper lock, enabling interrupts and clock_was_set()
issuing the updates. Any interrupt which arrives in that window will
see the new time but operate on stale offsets.

So we need to make sure that an hrtimer interrupt always sees a
consistent state of time and offsets.

ktime_get_update_offsets() allows us to get the current monotonic time
and update the per cpu hrtimer base offsets from hrtimer_interrupt()
to capture a consistent state of monotonic time and the offsets. The
function replaces the existing ktime_get() calls in hrtimer_interrupt().

The overhead of the new function vs. ktime_get() is minimal as it just
adds two store operations.

This ensures that any changes to realtime or boottime offsets are
noticed and stored into the per-cpu hrtimer base structures, prior to
any hrtimer expiration and guarantees that timers are not expired early.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

timekeeping: Provide hrtimer update function

To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

hrtimers: Move lock held region in hrtimer_interrupt()

We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

timekeeping: Maintain ktime_t based offsets for hrtimers

We need to update the hrtimer clock offsets from the hrtimer interrupt
context. To avoid conversions from timespec to ktime_t maintain a
ktime_t based representation of those offsets in the timekeeper. This
puts the conversion overhead into the code which updates the
underlying offsets and provides fast accessible values in the hrtimer
interrupt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

timekeeping: Fix leapsecond triggered load spike issue

The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.

The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix

Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>