Jingoo Han [Wed, 20 Feb 2013 02:15:23 +0000 (13:15 +1100)]
backlight: vgg2432a4: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:23 +0000 (13:15 +1100)]
backlight: ams369fg06: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:22 +0000 (13:15 +1100)]
backlight: lms283gf05: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:22 +0000 (13:15 +1100)]
backlight: tdo24m: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:22 +0000 (13:15 +1100)]
backlight: ltv350qv: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:21 +0000 (13:15 +1100)]
backlight: s6e63m0: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:21 +0000 (13:15 +1100)]
backlight: ld9040: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:15:21 +0000 (13:15 +1100)]
backlight: l4f00242t03: use spi_get_drvdata and spi_set_drvdata
Use the wrapper functions for getting and setting the driver data using
spi_device instead of using dev_{get|set}_drvdata with &spi->dev, so we
can directly pass a struct spi_device.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WARNING: please write a paragraph that describes the config symbol fully
#45: FILE: drivers/video/backlight/Kconfig:380:
+config BACKLIGHT_LP8788
ERROR: code indent should use tabs where possible
#446: FILE: include/linux/mfd/lp8788.h:242:
+^I^I Only valid when bl_mode is LP8788_BL_COMB_PWM_BASED$
total: 1 errors, 1 warnings, 403 lines checked
NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
scripts/cleanfile
./patches/backlight-add-new-lp8788-backlight-driver.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: "Kim, Milo" <Milo.Kim@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kim, Milo [Wed, 20 Feb 2013 02:15:20 +0000 (13:15 +1100)]
backlight: add new lp8788 backlight driver
TI LP8788 PMU supports regulators, battery charger, RTC, ADC, backlight
dri= ver and current sinks. This patch enables LP8788 backlight module.
(Brightness mode)
The brightness is controlled by PWM input or I2C register.
All modes are supported in the driver.
(Platform data)
Configurable data can be defined in the platform side.
name : backlight driver name. (default: "lcd-backlight")
initial_brightness : initial value of backlight brightness
bl_mode : brightness control by PWM or lp8788 register
dim_mode : dimming mode selection
full_scale : full scale current setting
rise_time : brightness ramp up step time
fall_time : brightness ramp down step time
pwm_pol : PWM polarity setting when bl_mode is PWM based
period_ns : platform specific PWM period value. unit is nano.
The default values are set in case no platform data is defined.
Signed-off-by: Milo(Woogyom) Kim <milo.kim@ti.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Thierry Reding <thierry.reding@avionic-design.de> Cc: "devendra.aaru" <devendra.aaru@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'before_power' was used to check the previous status when resume() is
called. However, FB_BLANK_POWERDOWN was used in suspend() all the time,
so there is no need to check the previous status. Also, redundant return
variables are removed to reduce the code.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove unnecessary NULL deference check, because it was already checked in
ams369fg06_probe(). Also, unnecessary parentheses are removed in
ams369fg06_power_is_on().
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'before_power' was used to check the previous status when resume() is
called. However, FB_BLANK_POWERDOWN was used in suspend() all the time,
so there is no need to check the previous status. Also, redundant return
variables are removed to reduce the code.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 20 Feb 2013 02:15:15 +0000 (13:15 +1100)]
backlight-add-lms501kf03-lcd-driver-fix
remove unused variable `before_power'
Cc: "devendra.aaru" <devendra.aaru@gmail.com> Cc: Ilho Lee <Ilho215.lee@samsung.com> Cc: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jean Delvare [Wed, 20 Feb 2013 02:15:14 +0000 (13:15 +1100)]
MAINTAINERS: Remove Mark M. Hoffman
Mark M. Hoffman stopped working on the Linux kernel several years ago, so
he should no longer be listed as a driver maintainer. I'm not even sure
if his e-mail address still works.
I can take over 3 drivers he was responsible for, the 4th one will
fall down to the subsystem maintainer.
Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:15:13 +0000 (13:15 +1100)]
MAINTAINERS: mm: add additional include files to listing
Add gfp.h, mmzone.h, memory_hotplug.h & vmalloc.h to the "MEMORY
MANAGMENT" section so scripts/get_maintainer.pl can do a better job of
making recommendations.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Warren [Wed, 20 Feb 2013 02:15:13 +0000 (13:15 +1100)]
get_maintainer: allow keywords to match filenames
Allow K: entries in MAINTAINERS to match directly against filenames;
either those extracted from patch +++ or --- lines, or those specified on
the command-line using the -f option.
This potentially allows fewer lines in a MAINTAINERS entry, if all the
relevant files are scattered throughout the whole kernel tree, yet contain
some common keyword. An example would be using an ARM SoC name as the
keyword to catch all related drivers.
I don't think setting exact_pattern_match_hash would be appropriate here;
at least for intended Tegra use case, this feature is to ensure that all
Tegra-related driver changes get Cc'd to the Tegra mailing list. Setting
exact_pattern_match_hash would prevent git history parsing for e.g. S-o-b
tags, which still seems like it would be useful. Hence, this flag isn't
set.
Signed-off-by: Stephen Warren <swarren@nvidia.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
get_maintainer.pl: find maintainers for removed files
For removed files, get_maintainer.pl doesn't find any maintainers (besides
the default linux-kernel@vger.kernel.org), as it only looks at the "+++"
lines, which are "/dev/null" for removals. Fix this by extending the
parsing to the "---" lines.
E.g. for the two line test patch below the real score maintainers will now
be found:
lib/vsprintf.c: add %pa format specifier for phys_addr_t types
Add the %pa format specifier for printing a phys_addr_t type and its
derivative types (such as resource_size_t), since the physical address
size on some platforms can vary based on build options, regardless of the
native integer type.
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org> Cc: Rob Landley <rob@landley.net> Cc: George Spelvin <linux@horizon.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Andrei Emeltchenko <andrei.emeltchenko@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fan Du [Wed, 20 Feb 2013 02:15:11 +0000 (13:15 +1100)]
include/linux/fs.h: disable preempt when acquire i_size_seqcount write lock
Two rt tasks bind to one CPU core.
The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority rt task
A try to acquire read seq lock, it's doomed to lockup.
rt task A with lower priority: call write
i_size_write rt task B with higher priority: call sync, and preempt task A
write_seqcount_begin(&inode->i_size_seqcount); i_size_read
inode->i_size = i_size; read_seqcount_begin <-- lockup here...
So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.
Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chuansheng Liu [Wed, 20 Feb 2013 02:15:11 +0000 (13:15 +1100)]
smp: Give WARN()ing when calling smp_call_function_many()/single() in serving irq
Currently the functions smp_call_function_many()/single() will give a
WARN()ing only in the case of irqs_disabled(), but that check is not
enough to guarantee execution of the SMP cross-calls.
In many other cases such as softirq handling/interrupt handling, the two
APIs still can not be called, just as the smp_call_function_many()
comments say:
* You must not call this function with disabled interrupts or from a
* hardware interrupt handler or from a bottom half handler. Preemption
* must be disabled when calling this function.
There is a real case for softirq DEADLOCK case:
CPUA CPUB
spin_lock(&spinlock)
Any irq coming, call the irq handler
irq_exit()
spin_lock_irq(&spinlock)
<== Blocking here due to
CPUB hold it
__do_softirq()
run_timer_softirq()
timer_cb()
call smp_call_function_many()
send IPI interrupt to CPUA
wait_csd()
Then both CPUA and CPUB will be deadlocked here.
So we should give a warning in the nmi, hardirq or softirq context as well.
Moreover, adding one new macro in_serving_irq() which indicates we are
processing nmi, hardirq or sofirq.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christian Kujau [Wed, 20 Feb 2013 02:15:10 +0000 (13:15 +1100)]
sun.com documentation fixes
After I came across a help text for SUNGEM mentioning a broken sun.com
URL, I felt like fixing those up, as they are now pointing to oracle.com
URLs.
Signed-off-by: Christian Kujau <lists@nerdbynature.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Wed, 20 Feb 2013 02:15:10 +0000 (13:15 +1100)]
smp: make smp_call_function_many() use logic similar to smp_call_function_single()
I'm testing swapout workload in a two-socket Xeon machine. The workload
has 10 threads, each thread sequentially accesses separate memory region.
TLB flush overhead is very big in the workload. For each page, page
reclaim need move it from active lru list and then unmap it. Both need a
TLB flush. And this is a multthread workload, TLB flush happens in 10
CPUs. In X86, TLB flush uses generic smp_call)function. So this workload
stress smp_call_function_many heavily.
swapout throughput is around 1400M/s. So there is around a 7%
improvement, and total cpu utilization doesn't change.
Without the patch, cfd_data is shared by all CPUs.
generic_smp_call_function_interrupt does read/write cfd_data several times
which will create a lot of cache ping-pong. With the patch, the data
becomes per-cpu. The ping-pong is avoided. And from the perf data, this
doesn't make call_single_queue lock contend.
Next step is to remove generic_smp_call_function_interrupt() from arch
code.
Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the alpha arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 20 Feb 2013 02:15:09 +0000 (13:15 +1100)]
ubifs: wait for page writeback to provide stable pages
When stable pages are required, we have to wait if the page is just going
to disk and we want to modify it. Add proper callback to
ubifs_vm_page_mkwrite().
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 20 Feb 2013 02:15:08 +0000 (13:15 +1100)]
ocfs2: wait for page writeback to provide stable pages
When stable pages are required, we have to wait if the page is just going
to disk and we want to modify it. Add proper callback to
ocfs2_grab_pages_for_write().
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darrick J. Wong [Wed, 20 Feb 2013 02:15:08 +0000 (13:15 +1100)]
block: optionally snapshot page contents to provide stable pages during write
This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2. The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.
For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude. If we're going
to migrate iscsi and raid to use stable page writes, the complaints about
high latency will likely return. We might as well centralize their page
snapshotting thing to one place.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darrick J. Wong [Wed, 20 Feb 2013 02:15:08 +0000 (13:15 +1100)]
9pfs: fix filesystem to wait for stable page writeback
Fix up the ->page_mkwrite handler to provide stable page writes if necessary.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darrick J. Wong [Wed, 20 Feb 2013 02:15:07 +0000 (13:15 +1100)]
mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable page
writes and, if so, performs the necessary wait. Then, make it so that all
points in the memory manager that handle making pages writable use the
helper function. This should provide stable page write support to most
filesystems, while eliminating unnecessary waiting for devices that don't
require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own thing,
so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all. The
blocking behavior is back to what it was before 3.0 if you don't have a
disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darrick J. Wong [Wed, 20 Feb 2013 02:15:07 +0000 (13:15 +1100)]
bdi: allow block devices to say that they require stable page writes
This patchset ("stable page writes, part 2") makes some key modifications
to the original 'stable page writes' patchset. First, it provides
creators (devices and filesystems) of a backing_dev_info a flag that
declares whether or not it is necessary to ensure that page contents
cannot change during writeout. It is no longer assumed that this is true
of all devices (which was never true anyway). Second, the flag is used to
relaxed the wait_on_page_writeback calls so that wait only occurs if the
device needs it. Third, it fixes up the remaining disk-backed filesystems
to use this improved conditional-wait logic to provide stable page writes
on those filesystems.
It is hoped that (for people not using checksumming devices, anyway) this
patchset will give back unnecessary performance decreases since the
original stable page write patchset went into 3.0. Sorry about not fixing
it sooner.
Complaints were registered by several people about the long write
latencies introduced by the original stable page write patchset.
Generally speaking, the kernel ought to allocate as little extra memory as
possible to facilitate writeout, but for people who simply cannot wait, a
second page stability strategy is (re)introduced: snapshotting page
contents. The waiting behavior is still the default strategy; to enable
page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This
flag is used to bandaid^Henable stable page writeback on ext3[1], and is
not used anywhere else.
Given that there are already a few storage devices and network FSes that
have rolled their own page stability wait/page snapshot code, it would be
nice to move towards consolidating all of these. It seems possible that
iscsi and raid5 may wish to use the new stable page write support to
enable zero-copy writeout.
Thank you to Jan Kara for helping fix a couple more filesystems.
Per Andrew Morton's request, here are the result of using dbench to measure
latencies on ext2:
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, for ext2 the maximum write latency decreases from ~60ms on a
laptop hard disk to ~4ms. I'm not sure why the flush latencies increase,
though I suspect that being able to dirty pages faster gives the flusher more
work to do.
On ext4, the average write latency decreases as well as all the maximum
latencies:
Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own thing,
so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own wait code, or they don't block at all. The blocking
behavior is back to what it was before 3.0 if you don't have a disk
requiring stable page writes.
This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs.
I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3.
[1] The alternative fixes to ext3 include fixing the locking order and page bit
handling like we did for ext4 (but then why not just use ext4?), or setting
PG_writeback so early that ext3 becomes extremely slow. I tried that, but the
number of write()s I could initiate dropped by nearly an order of magnitude.
That was a bit much even for the author of the stable page series! :)
This patch:
Creates a per-backing-device flag that tracks whether or not pages must be
held immutable during writeout. Eventually it will be used to waive
wait_for_page_writeback() if nothing requires stable pages.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the frv arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:15:06 +0000 (13:15 +1100)]
memcg: debugging facility to access dangling memcgs
If memcg is tracking anything other than plain user memory (swap, tcp buf
mem, or slab memory), it is possible - and normal - that a reference will
be held by the group after it is dead. Still, for developers, it would be
extremely useful to be able to query about those states during debugging.
This patch provides a debugging facility in the root memcg, so we can
inspect which memcgs still have pending objects, and what is the cause of
this state.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xi Wang [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
mm/dmapool.c: fix null dev in dma_pool_create()
A few drivers invoke dma_pool_create() with a null dev. Note that dev is
dereferenced in dev_to_node(dev), causing a null pointer dereference.
A long term solution is to disallow null dev. Once the drivers are fixed,
we can simplify the core code here. For now we add WARN_ON(!dev) to
notify the driver maintainers and avoid the null pointer dereference.
Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xi Wang [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
drivers/usb/gadget/amd5536udc.c: avoid calling dma_pool_create() with NULL dev
Calling dma_pool_create() with dev==NULL will oops on a NUMA machine.
Rather than changing dma_pool_create() we wish to disallow passing
dev==NULL. This requires fixing up the small number of drivers which are
passing in dev==NULL.
Use &dev->pdev->dev instead of NULL.
Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
drop_caches: add some documentation and info message
I would like to resurrect Dave's patch. The last time it was posted was
here https://lkml.org/lkml/2010/9/16/250 and there didn't seem to be any
strong opposition.
Kosaki was worried about possible excessive logging when somebody drops
caches too often (but then he claimed he didn't have a strong opinion on
that) but I would say opposite. If somebody does that then I would really
like to know that from the log when supporting a system because it almost
for sure means that there is something fishy going on. It is also worth
mentioning that only root can write drop caches so this is not an flooding
attack vector.
I am bringing that up again because this can be really helpful when
chasing strange performance issues which (surprise surprise) turn out to
be related to artificially dropped caches done because the admin thinks
this would help...
I have just refreshed the original patch on top of the current mm tree
but I could live with KERN_INFO as well if people think that KERN_NOTICE
is too hysterical.
: From: Dave Hansen <dave@linux.vnet.ibm.com>
: Date: Fri, 12 Oct 2012 14:30:54 +0200
:
: There is plenty of anecdotal evidence and a load of blog posts
: suggesting that using "drop_caches" periodically keeps your system
: running in "tip top shape". Perhaps adding some kernel
: documentation will increase the amount of accurate data on its use.
:
: If we are not shrinking caches effectively, then we have real bugs.
: Using drop_caches will simply mask the bugs and make them harder
: to find, but certainly does not fix them, nor is it an appropriate
: "workaround" to limit the size of the caches.
:
: It's a great debugging tool, and is really handy for doing things
: like repeatable benchmark runs. So, add a bit more documentation
: about it, and add a little KERN_NOTICE. It should help developers
: who are chasing down reclaim-related bugs.
[mhocko@suse.cz: refreshed to current -mm tree]
[akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
Rob van der Heij reported the following (paraphrased) on private mail.
The scenario is that I want to avoid backups to fill up the page
cache and purge stuff that is more likely to be used again (this is
with s390x Linux on z/VM, so I don't give it as much memory that
we don't care anymore). So I have something with LD_PRELOAD that
intercepts the close() call (from tar, in this case) and issues
a posix_fadvise() just before closing the file.
This mostly works, except for small files (less than 14 pages)
that remains in page cache after the face.
Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.
The issue is the per-cpu pagevecs for LRU additions. If the pages are added
by one CPU but fadvise() is called on another then the pages remain resident
as the invalidate_mapping_pages() only drains the local pagevecs via its
call to pagevec_release(). The user-visible effect is that a program that
uses fadvise() properly is not obeyed.
A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec
page could not be discarded. The downside with this is that an inode cache
shrink would send a global IPI and memory pressure potentially causing
global IPI storms is very undesirable.
Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages. If a
subset of pages are discarded it drains the LRU pagevecs and tries again. If
the second attempt fails, it assumes it is due to the pages being mapped,
locked or dirty and does not care. With this patch, an application using
fadvise() correctly will be obeyed but there is a downside that a malicious
application can force the kernel to send global IPIs and increase overhead.
If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.
The following test program demonstrates the problem. It should never
report that pages are still resident but will without this patch. It
assumes that CPU 0 and 1 exist.
int main() {
int fd;
int pagesize = getpagesize();
ssize_t written = 0, expected;
char *buf;
unsigned char *vec;
int resident, i;
cpu_set_t set;
/* Prepare the mincore vec */
vec = malloc(FILESIZE_PAGES);
if (vec == NULL) {
printf("ENOMEM\n");
exit(EXIT_FAILURE);
}
/* Bind ourselves to CPU 0 */
CPU_ZERO(&set);
CPU_SET(0, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* open file, unlink and write buffer */
fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
unlink("fadvise-test-file");
while (written < expected) {
ssize_t this_write;
this_write = write(fd, buf + written, expected - written);
if (this_write == -1) {
perror("write");
exit(EXIT_FAILURE);
}
written += this_write;
}
free(buf);
/*
* Force ourselves to another CPU. If fadvise only flushes the local
* CPUs pagevecs then the fadvise will fail to discard all file pages
*/
CPU_ZERO(&set);
CPU_SET(1, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* sync and fadvise to discard the page cache */
fsync(fd);
if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
perror("posix_fadvise");
exit(EXIT_FAILURE);
}
/* map the file and use mincore to see which parts of it are resident */
buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
if (buf == NULL) {
perror("mmap");
exit(EXIT_FAILURE);
}
if (mincore(buf, expected, vec) == -1) {
perror("mincore");
exit(EXIT_FAILURE);
}
/* Check residency */
for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
if (vec[i])
resident++;
}
if (resident != 0) {
printf("Nr unexpected pages resident: %d\n", resident);
exit(EXIT_FAILURE);
}
Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Rob van der Heij <rvdheij@gmail.com> Tested-by: Rob van der Heij <rvdheij@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cliff Wickman [Wed, 20 Feb 2013 02:15:04 +0000 (13:15 +1100)]
mm: export mmu notifier invalidates
We at SGI have a need to address some very high physical address ranges
with our GRU (global reference unit), sometimes across partitioned machine
boundaries and sometimes with larger addresses than the cpu supports. We
do this with the aid of our own 'extended vma' module which mimics the
vma. When something (either unmap or exit) frees an 'extended vma' we use
the mmu notifiers to clean them up.
We had been able to mimic the functions
__mmu_notifier_invalidate_range_start() and
__mmu_notifier_invalidate_range_end() by locking the per-mm lock and
walking the per-mm notifier list. But with the change to a global srcu
lock (static in mmu_notifier.c) we can no longer do that. Our module has
no access to that lock.
So we request that these two functions be exported.
Signed-off-by: Cliff Wickman <cpw@sgi.com> Acked-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this
resulted in taking the mm->page_table_lock 512 times in a row.
We can do better by making use of the page_mask returned by
follow_page_mask (for the huge zero page case), or the size of the page
munlock_vma_page() operated on (for the true THP page case).
Note - I am sending this as RFC only for now as I can't currently put my
finger on what if anything prevents split_huge_page() from operating
concurrently on the same page as munlock_vma_page(), which would mess up
our NR_MLOCK statistics. Is this a latent bug or is there a subtle point
I missed here ?
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: accelerate mm_populate() treatment of THP pages
This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.
follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a
THP page, and to 0 in other cases.
__get_user_pages() makes use of this in order to accelerate populating THP
ranges - that is, when both the pages and vmas arrays are NULL, we don't
need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also
avoid taking mm->page_table_lock that many times).
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Wed, 20 Feb 2013 02:15:03 +0000 (13:15 +1100)]
mm: accurately document nr_free_*_pages functions with code comments
nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
are horribly badly named, so accurately document them with code comments
in case of the misuse of them.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Wed, 20 Feb 2013 02:15:02 +0000 (13:15 +1100)]
HWPOISON: change order of error_states[]'s elements
error_states[] has two separate states "unevictable LRU page" and "mlocked
LRU page", and the former one has the higher priority now. But because of
that the latter one is rarely chosen because pages with PageMlocked highly
likely have PG_unevictable set. On the other hand, PG_unevictable without
PageMlocked is common for ramfs or SHM_LOCKed shared memory, so reversing
the priority of these two states helps us clearly distinguish them.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Wed, 20 Feb 2013 02:15:02 +0000 (13:15 +1100)]
HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
memory_failure() can't handle memory errors on mlocked pages correctly,
because page_action() judges such errors as ones on "unknown pages"
instead of ones on "unevictable LRU page" or "mlocked LRU page". In order
to determine page_state page_action() checks page flags at the timing of
the judgement, but such page flags are not the same with those just after
memory_failure() is called, because memory_failure() does unmapping of the
error pages before doing page_action(). This unmapping changes the page
state, especially page_remove_rmap() (called from try_to_unmap_one())
clears PG_mlocked, so page_action() can't catch mlocked pages after that.
With this patch, we store the page flag of the error page before doing
unmap, and (only) if the first check with page flags at the time decided
the error page is unknown, we do the second check with the stored page
flag. This implementation doesn't change error handling for the page
types for which the first check can determine the page state correctly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:15:01 +0000 (13:15 +1100)]
memcg: stop warning on memcg_propagate_kmem
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
"mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Wed, 20 Feb 2013 02:15:01 +0000 (13:15 +1100)]
net: change type of virtio_chan->p9_max_pages
This member of struct virtio_chan is calculated from nr_free_buffer_pages
so change its type to unsigned long in case of overflow.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Wed, 20 Feb 2013 02:14:59 +0000 (13:14 +1100)]
mm: fix return type for functions nr_free_*_pages
Currently, the amount of RAM that functions nr_free_*_pages return
is held in unsigned int. But in machines with big memory (exceeding
16TB), the amount may be incorrect because of overflow, so fix it.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:59 +0000 (13:14 +1100)]
memcg: cleanup mem_cgroup_init comment
We should encourage all memcg controller initialization independent on a
specific mem_cgroup to be done here rather than exploit css_alloc callback
and assume that nothing happens before root cgroup is created.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:59 +0000 (13:14 +1100)]
memcg: move memcg_stock initialization to mem_cgroup_init
memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.
This patch wraps the current memcg_stock initialization code into a helper
calls it from the controller subsystem initialization code.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:58 +0000 (13:14 +1100)]
memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init
Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg allocation
code with something that can be called when the memcg subsystem is
initialized by mem_cgroup_init along with other controller specific parts.
While we are at it let's make mem_cgroup_soft_limit_tree_init void because
it doesn't make much sense to report memory failure because if we fail to
allocate memory that early during the boot then we are screwed anyway
(this saves some code).
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Wed, 20 Feb 2013 02:14:58 +0000 (13:14 +1100)]
mm: use up free swap space before reaching OOM kill
Recently, Luigi reported there are lots of free swap space when OOM
happens. It's easily reproduced on zram-over-swap, where many instance of
memory hogs are running and laptop_mode is enabled. He said there was no
problem when he disabled laptop_mode. The problem when I investigate
problem is following as.
Assumption for easy explanation: There are no page cache page in system
because they all are already reclaimed.
1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
2. shrink_inactive_list isolates victim pages from inactive anon lru list.
3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
pageout because sc->may_writepage is 0 so the page is rotated back into
inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
retry reclaim with higher priority.
5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
inactive anon lru list is full of dirty pages by 3 so it just returns
without any reclaim progress.
6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
Because sc->nr_scanned is increased by shrink_page_list but we don't call
shrink_page_list in 5 due to short of isolated pages.
Above loop is continued until OOM happens.
The problem didn't happen before [1] was merged because old logic's
isolatation in shrink_inactive_list was successful and tried to call
shrink_page_list to pageout them but it still ends up failed to page out
by may_writepage. But important point is that sc->nr_scanned was
increased although we couldn't swap out them so do_try_to_free_pages could
set may_writepages.
Since f80c067 ("mm: zone_reclaim: make isolate_lru_page() filter-aware")
was introduced, it's not a good idea any more to depends on only the
number of scanned pages for setting may_writepage. So this patch adds new
trigger point of setting may_writepage as below DEF_PRIOIRTY - 2 which is
used to show the significant memory pressure in VM so it's good fit for
our purpose which would be better to lose power saving or clickety rather
than OOM killing.
Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Luigi Semenzato <semenzato@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Robin Holt [Wed, 20 Feb 2013 02:14:57 +0000 (13:14 +1100)]
mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().
Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).
A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() <--- NULL pointer deref.
Additionally, the list traversal in __mmu_notifier_release() is not
protected by the by the mmu_notifier_mm->hlist_lock which can result in
callouts to the ->release() notifier from both mmu_notifier_unregister()
and __mmu_notifier_release().
-stable suggestions:
The stable trees prior to 3.7.y need commits 21a9273 and 7040030
cherry-picked in that order prior to cherry-picking this commit. The
3.7.y tree already has those two commits.
Signed-off-by: Robin Holt <holt@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:14:56 +0000 (13:14 +1100)]
mm: add helper ensure_zone_is_initialized()
ensure_zone_is_initialized() checks if a zone is in a empty & not
initialized state (typically occuring after it is created in memory
hotplugging), and, if so, calls init_currently_empty_zone() to initialize
the zone.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:14:55 +0000 (13:14 +1100)]
mm/page_alloc: add informative debugging message in page_outside_zone_boundaries()
Add a debug message which prints when a page is found outside of the
boundaries of the zone it should belong to. Format is:
"page $pfn outside zone [ $start_pfn - $end_pfn ]"
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:14:54 +0000 (13:14 +1100)]
mm: add & use zone_end_pfn() and zone_spans_pfn()
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.
This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.
Note that in compaction.c I avoid calling zone_end_pfn() repeatedly because I
expect at some point the sycronization issues with start_pfn & spanned_pages
will need fixing, either by actually using the seqlock or clever memory barrier
usage.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:14:53 +0000 (13:14 +1100)]
mm: refactor inactive_file_is_low() to use get_lru_size()
An inactive file list is considered low when its active counterpart is
bigger, regardless of whether it is a global zone LRU list or a memcg zone
LRU list. The only difference is in how the LRU size is assessed.
get_lru_size() does the right thing for both global and memcg reclaim
situations.
Get rid of inactive_file_is_low_global() and
mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare the
numbers in common code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:52 +0000 (13:14 +1100)]
ksm: stop hotremove lockdep warning
Complaints are rare, but lockdep still does not understand the way
ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
problem because notifier callbacks are made under down_read of
blocking_notifier_head->rwsem (so first the mutex is taken while holding
the rwsem, then later the rwsem is taken while still holding the mutex);
but is not in fact a problem because mem_hotplug_mutex is held throughout
the dance.
There was an attempt to fix this with mutex_lock_nested(); but if that
happened to fool lockdep two years ago, apparently it does so no longer.
I had hoped to eradicate this issue in extending KSM page migration not to
need the ksm_thread_mutex. But then realized that although the page
migration itself is safe, we do still need to lock out ksmd and other
users of get_ksm_page() while offlining memory - at some point between
MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish,
and get_ksm_page()'s accesses to them become a violation.
So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
checks, to achieve the same lockout without being caught by lockdep. This
is less elegant for KSM, but it's more important to keep lockdep useful to
other users - and I apologize for how long it took to fix.
Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:52 +0000 (13:14 +1100)]
mm: remove offlining arg to migrate_pages
No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>