Michal Hocko [Wed, 20 Feb 2013 02:14:04 +0000 (13:14 +1100)]
memcg,vmscan: do not break out targeted reclaim without reclaimed pages
Targeted (hard resp. soft) reclaim has traditionally tried to scan one
group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
pages) is reclaimed or all priorities are exhausted. The reclaim is then
retried until the limit is met.
This approach, however, doesn't work well with deeper hierarchies where
groups higher in the hierarchy do not have any or only very few pages
(this usually happens if those groups do not have any tasks and they have
only re-parented pages after some of their children is removed). Those
groups are reclaimed with decreasing priority pointlessly as there is
nothing to reclaim from them.
An easiest fix is to break out of the memcg iteration loop in shrink_zone
only if the whole hierarchy has been visited or sufficient pages have been
reclaimed. This is also more natural because the reclaimer expects that
the hierarchy under the given root is reclaimed. As a result we can
simplify the soft limit reclaim which does its own iteration.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Ying Han <yinghan@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sasha Levin [Wed, 20 Feb 2013 02:14:03 +0000 (13:14 +1100)]
mm/huge_memory.c: use new hashtable implementation
Switch hugemem to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the hugemem.
This also removes the dymanic allocation of the hash table. The upside is
that we save a pointer dereference when accessing the hashtable, but we
lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
doesn't support hugepages.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:14:03 +0000 (13:14 +1100)]
mm: compaction: do not accidentally skip pageblocks in the migrate scanner
Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN. It happened to work when initially
implemented as the starting PFN was also aligned but with caching restarts
and isolating in smaller chunks this is no longer always true.
The impact is that the migrate scanner scans outside its current
pageblock. As pfn_valid() is still checked properly it does not cause any
failure and the impact of the bug is that in some cases it will scan more
than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:14:02 +0000 (13:14 +1100)]
mm: reduce rmap overhead for ex-KSM page copies created on swap faults
When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.
These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.
Use page_add_new_anon_rmap() for these copies. This also puts them on the
proper LRU lists and marks them SwapBacked, so we can get rid of doing
this ad-hoc in the KSM copy code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:14:01 +0000 (13:14 +1100)]
mm: vmscan: compaction works against zones, not lruvecs
The restart logic for when reclaim operates back to back with compaction
is currently applied on the lruvec level. But this does not make sense,
because the container of interest for compaction is a zone as a whole, not
the zone pages that are part of a certain memory cgroup.
Negative impact is bounded. For one, the code checks that the lruvec has
enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled. And the unfairness of hammering on
one particular memory cgroup to make progress in a zone will be amortized
by the round robin manner in which reclaim goes through the memory
cgroups. Still, this can lead to unnecessary allocation latencies when
the code elects to restart on a hard to reclaim or small group when there
are other, more reclaimable groups in the zone.
Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:14:00 +0000 (13:14 +1100)]
mm: vmscan: clean up get_scan_count()
Reclaim pressure balance between anon and file pages is calculated through
a tuple of numerators and a shared denominator.
Exceptional cases that want to force-scan anon or file pages configure the
numerators and denominator such that one list is preferred, which is not
necessarily the most obvious way:
Johannes Weiner [Wed, 20 Feb 2013 02:14:00 +0000 (13:14 +1100)]
mm: vmscan: clarify how swappiness, highest priority, memcg interact
A swappiness of 0 has a slightly different meaning for global reclaim (may
swap if file cache really low) and memory cgroup reclaim (never swap,
ever).
In addition, global reclaim at highest priority will scan all LRU lists
equal to their size and ignore other balancing heuristics. UNLESS
swappiness forbids swapping, then the lists are balanced based on recent
reclaim effectiveness. UNLESS file cache is running low, then anonymous
pages are force-scanned.
This (total mess of a) behaviour is implicit and not obvious from the way
the code is organized. At least make it apparent in the code flow and
document the conditions. It will be it easier to come up with sane
semantics later.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Satoru Moriya <satoru.moriya@hds.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:13:59 +0000 (13:13 +1100)]
mm: vmscan: save work scanning (almost) empty LRU lists
In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.
Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages that
are not there.
Empty LRU lists are quite common with memory cgroups in NUMA environments
because there exists a set of LRU lists for each zone for each memory
cgroup, while the memory of a single cgroup is expected to stay on just
one node. The number of expected empty LRU lists is thus
memcgs * (nodes - 1) * lru types
Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc. Avoid that.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:13:59 +0000 (13:13 +1100)]
mm: memcg: only evict file pages when we have plenty
e986850 ("mm, vmscan: only evict file pages when we have plenty") makes a
point of not going for anonymous memory while there is still enough
inactive cache around.
The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:
200M-memcg-defconfig-j2
vanilla patched
Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
User time 668.57 ( +0.00%) 668.73 ( +0.02%)
System time 128.92 ( +0.00%) 129.53 ( +0.46%)
Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sha Zhengju [Wed, 20 Feb 2013 02:13:57 +0000 (13:13 +1100)]
memcg, oom: provide more precise dump info while memcg oom happening
Currentlt when a memcg oom is happening the oom dump messages is still
global state and provides few useful info for users. This patch prints
more pointed memcg page statistics for memcg-oom and take hierarchy into
consideration:
Based on Michal's advice, we take hierarchy into consideration:
supppose we trigger an OOM on A's limit
root_memcg
|
A (use_hierachy=1)
/ \
B C
|
D
then the printed info will be:
Memory cgroup stats for /A:...
Memory cgroup stats for /A/B:...
Memory cgroup stats for /A/C:...
Memory cgroup stats for /A/B/D:...
Sasha Levin [Wed, 20 Feb 2013 02:13:57 +0000 (13:13 +1100)]
watchdog: trigger all-cpu backtrace when locked up and going to panic
Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.
This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.
Jan Kara [Wed, 20 Feb 2013 02:13:57 +0000 (13:13 +1100)]
ocfs2: add freeze protection to ocfs2_file_splice_write()
ocfs2_file_splice_write() was missed when adding freeze protection to all
write paths. Fix that.
Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Cc: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 20 Feb 2013 02:13:56 +0000 (13:13 +1100)]
fs: fix hang with BSD accounting on frozen filesystem
When BSD process accounting is enabled and logs information to a
filesystem which gets frozen, system easily becomes unusable because each
attempt to account process information blocks. Thus e.g. every task gets
blocked in exit.
It seems better to drop accounting information (which can already happen
when filesystem is running out of space) instead of locking system up. So
we open the accounting file with O_NONBLOCK.
Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Tested-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Wed, 20 Feb 2013 02:13:56 +0000 (13:13 +1100)]
fs: return EAGAIN when O_NONBLOCK write should block on frozen fs
When user asks for O_NONBLOCK behavior for a file descriptor, return
EAGAIN instead of blocking on a frozen filesystem.
This is needed so we can fix a hang with BSD accounting on frozen
filesystems.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Cc: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhao Hongjiang [Wed, 20 Feb 2013 02:13:55 +0000 (13:13 +1100)]
fs: change return values from -EACCES to -EPERM
According to SUSv3:
[EACCES] Permission denied. An attempt was made to access a file in a way
forbidden by its file access permissions.
[EPERM] Operation not permitted. An attempt was made to perform an operation
limited to processes with appropriate privileges or to the owner of a file
or other resource.
So -EPERM should be returned if capability checks fails.
Strictly speaking this is an API change since the error code user sees is
altered.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Wed, 20 Feb 2013 02:13:55 +0000 (13:13 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guo Chao [Wed, 20 Feb 2013 02:13:54 +0000 (13:13 +1100)]
loopdev: remove an user triggerable oops
When loopdev is built as module and we pass an invalid parameter,
loop_init() will return directly without deregister misc device, which
will cause an oops when insert loop module next time because we left some
garbage in the misc device list.
Test case:
sudo modprobe loop max_part=1024
(failed due to invalid parameter)
sudo modprobe loop
(oops)
Clean up nicely to avoid such oops.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guo Chao [Wed, 20 Feb 2013 02:13:54 +0000 (13:13 +1100)]
loopdev: move common code into loop_figure_size()
Update block device size in accord with gendisk size and let userspace
know the change in loop_figure_size(). This is a clean up to remove
common code of loop_figure_size()'s two callers.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
blockdev reports file size instead of sizelimit several out of 100 times.
The problems are:
- losetup set up the device in two ioctl:
LOOP_SET_FD and LOOP_SET_STATUS64.
- LOOP_SET_STATUS64 only update size of gendisk.
Block device size will be updated lazily when device comes to use. If udev
rushes in between the two ioctl, it will bring in a block device whose
size is backing file size. If the device is not released after
LOOP_SET_STATUS64 ioctl, blockdev will not see the updated size.
Update block size in LOOP_SET_STATUS64 ioctl.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Reported-by: M. Hindess <hindessm@uk.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lockdep does not report it, because path #2 actually holds a subclass of
lo_ctl_mutex. This subclass seems creep into the code by mistake. The
patch author actually just mentioned it in the changelog, see commit f028f3b2 ("loop: fix circular locking in loop_clr_fd()"), also see:
Guo Chao [Wed, 20 Feb 2013 02:13:53 +0000 (13:13 +1100)]
block: use i_size_write() in bd_set_size()
blkdev_ioctl(GETBLKSIZE) uses i_size_read() to read size of block device.
If we update block size directly, reader may see intermediate result in
some machines and configurations. Use i_size_write() instead.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:13:53 +0000 (13:13 +1100)]
cfq: fix lock imbalance with failed allocations
While stress-running very-small container scenarios with the Kernel Memory
Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.
I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging. Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.
But here is my analysis:
When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the oom
queue, in an rcu-locked condition:
if (!cfqq || cfqq == &cfqd->oom_cfqq)
[ ... ]
Next, we will release the rcu lock, and try to allocate a queue, retrying
if we succeed:
And right before exiting, we'll issue rcu_read_unlock().
Being already unlocked, this is the likely source of our imbalance. Since
cfqq is either already NULL or made NULL in the first statement of the
outter branch, the only viable alternative here seems to be to return the
oom queue right away in case of allocation failure.
Please review the following patch and apply if you agree with my analysis.
Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Paul Bolle [Wed, 20 Feb 2013 02:13:52 +0000 (13:13 +1100)]
drivers/scsi/aacraid/src.c: silence two GCC warnings
Compiling src.o for a 32 bit system triggers two GCC warnings:
drivers/scsi/aacraid/src.c: In function `aac_src_deliver_message':
drivers/scsi/aacraid/src.c:410:3: warning: right shift count >= width of type [enabled by default]
drivers/scsi/aacraid/src.c:434:2: warning: right shift count >= width of type [enabled by default]
Silence these warnings by casting the 'address' variable (of type
dma_addr_t) to u64 on those two lines.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
which is clearly bogus. If lockdep is disabled in the config this would
cause a compile failure, if it is enabled then it compiles and causes a
puzzling warning about dereferencing without the correct protection.
Wrap the macro in "do { ... } while (0)" to also fail compile for this
when lockdep is enabled.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org>) Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Wed, 20 Feb 2013 02:13:50 +0000 (13:13 +1100)]
sched: /proc/sched_debug fails on very very large machines
On systems with 4096 cores attemping to read /proc/sched_debug fails. We
are trying to push all the data into a single kmalloc buffer. The issue
is on these very large machines all the data will not fit in 4mb.
A better solution is to not us the single_open mechanism but to provide
our own seq_operations and treat each cpu as an individual record.
The output should be identical to previous version.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>) Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
v2: Took Andrew's suggestion to add comments, fix memleak
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Wed, 20 Feb 2013 02:13:48 +0000 (13:13 +1100)]
sched: /proc/sched_stat fails on very very large machines
On systems with 4096 cores doing a cat /proc/sched_stat fails. We are
trying to push all the data into a single kmalloc buffer. The issue is on
these very large machines all the data will not fit in 4mb.
A better solution is to not use the single_open() mechanism but to provide
our own seq_operations.
The output should be identical to previous version and thus not need the
version number.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 20 Feb 2013 02:13:48 +0000 (13:13 +1100)]
ocfs2-remove-kfree-redundant-null-checks-fix
revert dubious change in ocfs2_begin_truncate_log_recovery()
Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tim Gardner [Wed, 20 Feb 2013 02:13:47 +0000 (13:13 +1100)]
ocfs2: remove kfree() redundant null checks
smatch analysis indicates a number of redundant NULL checks before
calling kfree(), e.g.,
fs/ocfs2/alloc.c:6138 ocfs2_begin_truncate_log_recovery() info:
redundant null check on *tl_copy calling kfree()
fs/ocfs2/alloc.c:6755 ocfs2_zero_range_for_truncate() info:
redundant null check on pages calling kfree()
etc....
Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel/time/timer_list.c: In function 'timer_list_show':
kernel/time/timer_list.c:262: warning: cast to pointer from integer of different size
kernel/time/timer_list.c:267: warning: cast to pointer from integer of different size
kernel/time/timer_list.c: In function 'timer_list_start':
kernel/time/timer_list.c:297: warning: cast to pointer from integer of different size
This code is revolting :(
Cc: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Wed, 20 Feb 2013 02:13:45 +0000 (13:13 +1100)]
timer_list: convert timer list to be a proper seq_file
When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition. On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer. The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.
Convert /proc/timer_list to a proper seq_file with its own iterator. This
is a little more complex given that we have to make two passes with two
separate headers.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nathan Zimmer [Wed, 20 Feb 2013 02:13:45 +0000 (13:13 +1100)]
timer_list: split timer_list_show_tickdevices()
Split timer_list_show_tickdevices() out the header and just pull the rest up
to timer_list_show. Also tweak the location of the whitespace. This is all
to prep for the fix.
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Dave Jones <davej@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
How is the compiler even handling exported functions that are marked
inline? Anyway, these shouldn't be inline because of that, so remove that
marking.
Based on a larger patch by Mark Charlebois to get LLVM to build the
kernel.
Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Charlebois <mcharleb@qualcomm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: hank <pyu@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ondrej Zary [Wed, 20 Feb 2013 02:13:44 +0000 (13:13 +1100)]
cyber2000fb: avoid palette corruption at higher clocks
When 1280x1024@75Hz mode is set, console palette is not set properly -
sometimes the background is white, sometimes yellow and text colors are
also messed up. This does not happen at 1280x1024@60Hz and below.
It seems that the HW needs some time before setting the palette - maybe
the PLL needs more time to lock at higher speeds. This patch fixes the
problem but without knowing what register to check for PLL lock(?), the
delay might be excessive.
On Fri, 28 Jan 2011 18:15:37 +0000
Russell King <rmk@arm.linux.org.uk> wrote:
> On Tue, Jan 18, 2011 at 01:14:24PM -0800, Andrew Morton wrote:
> > Russell, I have an (old) note here that this is awaiting an ack from
> > yourself?
>
> Well, I can reproduce this problem on the Netwinders here. I'm not sure
> that we should delay all mode switches by one second - and any attempt
> to reduce this value does result in the palette not being set correctly.
>
> For 1280x1024-75, the dotclock is 135MHz, which gives a PLL values of
> 0x41 and 0x06. That's: M=0x41+1, N=0x06+1, P=0x00 (top 2 bits of 0x06)
> -> Q=1
>
> Fpll = 14.31818MHz * M / N
> Fout = Fpll / Q
>
> The PLL itself is formed by dividing the 14-ish MHz frequency by N and
> phase comparing the output of the VCO, divided by M, and adjusting the
> VCO until the two correlate. As VCOs typically tend to have a limited
> range, it's normal to divide the output frequency to produce a greater
> range - and in this case that's done by Q.
>
> For the 800x600-100 copied from /etc/fb.modes, this has a dotclock of
> 67.5MHz, which is exactly half this rate. The PLL values for this are:
> M=0x41+1, N=0x06+1, P=0x01, giving PLL values of 0x41 and 0x46.
>
> Booting with 800x600-100 does not suffer the problem. So it's not
> related to PLL lock time. There's something else going on.
>
> Another experiment I tried was forcing the PLL values to produce 108MHz
> instead of 135MHz. 108MHz is the dotclock for 1280x1024-60. This too
> doesn't suffer the problem.
>
> I've also tried chosing other delay values. 100ms is too short and
> produces the problem, but 1s works. 1s for a PLL to lock is a hell of
> a time, especially for a PLL operating in the MHz range.
>
> I've tried setting the PLL to a known good freqency, and then switching
> to 135MHz - the problem persists. It's not like 135MHz is reaching the
> limits - it'll go up to 206MHz.
>
> So, I don't think this has anything to do with PLL locking. I think
> there's something else going on which isn't immediately obvious - maybe
> bandwidth starvation preventing us from writing properly to the palette?
> As it's a horrible VGA, where you write the same register multiple times
> I wouldn't be surprised if some writes were going missing.
>
> I'll see if I can play around with it some more this evening, but I've
> spent an awful long time on just this issue already this afternoon...
>
> I think further investigation needs to happen on this patch before it's
> acceptable. Or maybe we should prevent the cyberpro coming up in
Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:13:43 +0000 (13:13 +1100)]
video: exynos_dp: add missing of_node_put()
of_find_node_by_name() returns a node pointer with refcount incremented,
use of_node_put() on it when done.
of_find_node_by_name() will call of_node_put() against the node pass to
from parameter, thus we also need to call of_node_get(from) before calling
of_find_node_by_name().
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Ajay Kumar <ajaykumar.rs@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jingoo Han [Wed, 20 Feb 2013 02:13:42 +0000 (13:13 +1100)]
video: s3c-fb: remove duplicated S3C_FB_MAX_WIN
S3C_FB_MAX_WIN is already defined in 'plat-samsung/include/plat/fb.h'.
So, this definition in 'include/video/samsung_fimd.h' should be removed to
avoid the duplication.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Tomasz Figa <t.figa@samsung.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sachin Kamat [Wed, 20 Feb 2013 02:13:41 +0000 (13:13 +1100)]
drivers/video/exynos/exynos_mipi_dsi.c: fix an error check condition
Checking an unsigned variable for negative value returns false. Hence use
the macro to fix it.
Fixes the following smatch warning:
drivers/video/exynos/exynos_mipi_dsi.c:417 exynos_mipi_dsi_probe() warn:
unsigned 'dsim->irq' is never less than zero.
Zhou Zhu [Wed, 20 Feb 2013 02:13:39 +0000 (13:13 +1100)]
video: mmpdisp: add spi port in display controller
Add spi port support in mmp display controller. This port is from display
controller and for panel usage. This driver implemented and registered as
a spi master.
Signed-off-by: Zhou Zhu <zzhu3@marvell.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Lisa Du <cldu@marvell.com> Cc: Guoqing Li <ligq@marvell.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lisa Du [Wed, 20 Feb 2013 02:13:39 +0000 (13:13 +1100)]
video: mmp: add tpo hvga panel supported
Add tpo hvga panel support in marvell display framework. This panel
driver implements modes query and power on/off.
This panel driver gets panel config/ plat power on/off/ connected path
name from machine-info and registered as a spi device. This panel driver
uses mmp_disp supplied register_panel function to register panel to path
as machine-info defined.
Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Zhou Zhu <zzhu3@marvell.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Guoqing Li <ligq@marvell.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guoqing Li [Wed, 20 Feb 2013 02:13:39 +0000 (13:13 +1100)]
video: mmp display controller support
Marvell mmp series display controller support in mmpdisp subsystem. This
driver focus on implementation of hardware operations of path/overlay,
which is defined in mmp display subsystem interface. This driver
registers all pathes to mmp display framework.
Signed-off-by: Guoqing Li <ligq@marvell.com> Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Zhou Zhu <zzhu3@marvell.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arnd Bergmann [Wed, 20 Feb 2013 02:13:38 +0000 (13:13 +1100)]
fb: mmp: include linux/platform_device.h
Patch 16559ae "kgdb: remove #include <linux/serial_8250.h> from kgdb.h"
changes the kgdb.h file so that drivers including it do not implicitly
include linux/platform_device.h. The mmp framebuffer driver is new,
so Greg did not have a chance to fix it up when introducing his change.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Zhou Zhu <zzhu3@marvell.com> Cc: Lisa Du <cldu@marvell.com> Cc: Guoqing Li <ligq@marvell.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhou Zhu [Wed, 20 Feb 2013 02:13:38 +0000 (13:13 +1100)]
video: mmp fb support
Add fb support for Marvell mmp display subsystem. This driver is
configured using "buffer driver mach info". With configured name of path,
this driver get path using using exported interface of mmp display driver.
Then this driver get overlay using configured id and operates on this
overlay to show buffers on display devices.
Signed-off-by: Zhou Zhu <zzhu3@marvell.com> Signed-off-by: Lisa Du <cldu@marvell.com> Cc: Guoqing Li <ligq@marvell.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1. fb folder contains implementation of fb. fb get path and overlay
from common interface and operates on these structures.
2. core.c provides common interface for a hardware abstraction. Major
parts of this interface are:
a) Path: path is a output device connected to a panel or HDMI TV. Main
operations of the path is set/get timing/output color. fb operates
output device through path structure.
b) Ovly: Ovly is a buffer shown on the path.
Ovly describes frame buffer and its source/destination size, offset,
input color, buffer address, z-order, and so on. Each fb device maps
to one overlay.
3. hw folder contains implementation of hardware operations defined by
core.c. It registers paths for fb use.
4. panel folder contains implementation of panels. It's connected to
path. Panel drivers would also regiester panels and linked to path
when probe.
Signed-off-by: Zhou Zhu <zzhu3@marvell.com> Signed-off-by: Lisa Du <cldu@marvell.com> Cc: Guoqing Li <ligq@marvell.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 20 Feb 2013 02:13:37 +0000 (13:13 +1100)]
goldfish-framebuffer-driver-fix
fix (silly) sparse warnings
Cc: Alan Cox <alan@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Bruce Beare <bruce.j.beare@intel.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Mike A. Chan <mikechan@google.com> Cc: Sheng Yang <sheng@linux.intel.com> Cc: Tom Keel <thomas.keel@intel.com> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Xiaohui Xin <xiaohui.xin@intel.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arve Hjønnevåg [Wed, 20 Feb 2013 02:13:37 +0000 (13:13 +1100)]
goldfish: framebuffer driver
Framebuffer support for the Goldfish emulator. This takes the Google
emulator and applies the x86 cleanups as well as moving the blank methods
to the usual Linux place and dropping the Android early suspend logic (for
now at least, that can be looked at as Android and upstream converge).
Dropped various oddities like setting MTRRs on a virtual frame buffer
emulation...
With the drivers so far you can now boot a Linux initrd and have fun.
[sheng@linux.intel.com: cleaned up to handle x86]
[thomas.keel@intel.com: ported to 3.4]
[alan@linux.intel.com: cleaned up for style and 3.7, moved blank methods] Signed-off-by: Mike A. Chan <mikechan@google.com> Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Xiaohui Xin <xiaohui.xin@intel.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Bruce Beare <bruce.j.beare@intel.com> Signed-off-by: Tom Keel <thomas.keel@intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daniel Vetter [Wed, 20 Feb 2013 02:13:37 +0000 (13:13 +1100)]
drm/fb-helper: don't sleep for screen unblank when an oops is in progress
Otherwise the system will burn even brighter and worse, leave the user
wondering what's going on exactly.
Since we already have a panic handler which will (try) to restore the
entire fbdev console mode, we can just bail out. Inspired by a patch from
Konstantin Khlebnikov. The callchain leading to this, cut&pasted from
Konstantin's original patch:
Note that the entire locking in the fb helper around panic/sysrq and kdbg
is ... non-existant. So we have a decent change of blowing up
everything. But since reworking this ties in with funny concepts like the
fbdev notifier chain or the impressive things which happen around
console_lock while oopsing, I'll leave that as an exercise for braver
souls than me.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Airlie <airlied@gmail.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use vm_unmapped_area() on powerpc architecture
Update the powerpc slice_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: remove free_area_cache use in powerpc architecture
As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.
This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pcmcia: move unbind/rebind into dev_pm_ops.complete
Move the device rebind procedures for cardbus devices from the pm.resume
into the pm.complete callback.
The reason for moving the code is: "[...] The PM code needs to send
suspend and resume messages to every device in the right order, and it
can't do that if new devices are being added at the same time. [...]"
However the situation really isn't quite that rigid. In particular,
adding new children during a resume callback shouldn't cause much of
problem because the children don't need to be resumed anyway (since they
were never suspended). On the other hand, if you do it you will get a
dev_warn() from the PM core, something like 'parent should not be
sleeping'.
Still, it is considered bad form and should be avoided if possible."
(Alan Stern's full comment about the topic can
be found here: <https://lkml.org/lkml/2012/7/10/254>)
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Greg KH <greg@kroah.com> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On cris-linux-gcc, __SIZE_TYPE__ expands to "unsigned int", as
gcc-4.6.3-nolibc/cris-linux/lib/gcc/cris-linux/4.6.3/plugin/include/config/cris/linux.h
has
#define SIZE_TYPE "unsigned int"
Hence __kernel_size_t is also "unsigned int". But __kernel_ssize_t is
"long", which has a different base type, causing compiler warnings like:
fs/quota/quota_tree.c:372:4: warning: format '%zd' expects argument of type 'signed size_t', but argument 4 has type 'ssize_t' [-Wformat]
To fix this, __kernel_ssize_t should be changed to "int". Hence cris can
just use the generic 32-bit versions from include/asm-generic/posix_types.h
for all size-related types.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Mikael Starvik <starvik@axis.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Hans-Peter Nilsson <hans-peter.nilsson@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Wed, 20 Feb 2013 02:13:34 +0000 (13:13 +1100)]
x86: make 'mem=' option to work for efi platform
Current mem boot option only can work for non efi environment. If the
user specifies add_efi_memmap, it cannot work for efi environment. In the
efi environment, we call e820_add_region() to add the memory map. So we
can modify __e820_add_region() and the mem boot option can work for efi
environment.
Note: Only E820_RAM is limited, and BOOT_SERVICES_{CODE,DATA} are always
mapped(If its address >= mem_limit, the memory won't be freed in
efi_free_boot_services()).
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Rob Landley <rob@landley.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yasuaki ISIMATU <isimatu.yasuaki@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel will crash if inside the "crash tool" one would try to read
the memory at the not present address.
crash> p crash_address
crash_address = $8 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
[ *lockup* ]
The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE shares the
same bit, and pageattr leaves _PAGE_GLOBAL set on a kernel pte which
is then mistaken as _PAGE_PROTNONE (so pte_present returns true by
mistake and the kernel fault then gets confused and loops).
With THP the same can happen after we taught pmd_present to check
_PAGE_PROTNONE and _PAGE_PSE in commit 027ef6c87853b0a9df5317 ("mm: thp:
fix pmd_present for split_huge_page and PROT_NONE with THP"). THP has the
same problem with _PAGE_GLOBAL as the 4k pages, but it also has a problem
with _PAGE_PSE, which must be cleared too.
After the patch is applied copy_user correctly returns -EFAULT and
doesn't lockup anymore.
Andrea Arcangeli [Wed, 20 Feb 2013 02:13:34 +0000 (13:13 +1100)]
Revert "x86, mm: Make spurious_fault check explicitly check the PRESENT bit"
I got a report for a minor regression introduced by commit 027ef6c87853b
("mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP").
So the problem is, pageattr creates kernel pagetables (pte and pmds) that
breaks pte_present/pmd_present and the patch above exposed this invariant
breakage for pmd_present.
The same problem already existed for the pte and pte_present and it was
fixed by commit 660a293ea9be709 ("x86, mm: Make spurious_fault check
explicitly check the PRESENT bit") (if it wasn't for that commit, it
wouldn't even be a regression). That fix avoids the pagefault to use
pte_present. I could follow through by stopping using
pmd_present/pmd_huge too.
However I think it's more robust to fix pageattr and to clear the
PSE/GLOBAL bitflags too in addition to the present bitflag. So the kernel
page fault can keep using the regular pte_present/pmd_present/pmd_huge.
The confusion arises because _PAGE_GLOBAL and _PAGE_PROTNONE are sharing
the same bit, and in the pmd case we pretend _PAGE_PSE to be set only in
present pmds (to facilitate split_huge_page final tlb flush).
This patch:
Revert commit 660a293ea9be709 ("x86, mm: Make spurious_fault check
explicitly check the PRESENT bit").
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Shaohua Li <shaohua.li@intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Wed, 20 Feb 2013 02:13:33 +0000 (13:13 +1100)]
x86 numa: don't check if node is NUMA_NO_NODE
If we aren't debugging per_cpu maps, the cpu's node is stored in per_cpu
variable numa_node. If `node' is NUMA_NO_NODE, it means the caller wants
to clear the cpu's node. So we should also call set_cpu_numa_node() in
this case.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MITSUNARI Shigeo [Wed, 20 Feb 2013 02:13:33 +0000 (13:13 +1100)]
fs/block_dev.c: page cache wrongly left invalidated after revalidate_disk()
We found that bdev->bd_invalidated was left set once revalidate_disk() is
called, which results in page cache flush every time that device is open.
Specifically, we found this problem in MD block device. Once we resize a
MD device, mdadm --monitor periodically flush all page cache for that
device every 60 or 1000 seconds when it opens the device.
This bug lies since at least 3.2.0 till the latest kernel(3.6.2).
Patch is attached.
The following steps will reproduce the problem.
1. prepair a block device(ex. /dev/sdb).
2. create two partitions.
Jim Somerville [Wed, 20 Feb 2013 02:13:33 +0000 (13:13 +1100)]
inotify: remove broken mask checks causing unmount to be EINVAL
Running the command:
inotifywait -e unmount /mnt/disk
immediately aborts with a -EINVAL return code. This is however a valid
parameter. This abort occurs only if unmount is the sole event parameter.
If other event parameters are supplied, then the unmount event wait will
work.
The problem was introduced by commit 44b350fc23e ("inotify: Fix mask
checks"). In that commit, it states:
The mask checks in inotify_update_existing_watch() and
inotify_new_watch() are useless because inotify_arg_to_mask()
sets FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway.
But instead of removing the useless checks, it did this:
mask = inotify_arg_to_mask(arg);
- if (unlikely(!mask))
+ if (unlikely(!(mask & IN_ALL_EVENTS)))
return -EINVAL;
The problem is that IN_ALL_EVENTS doesn't include IN_UNMOUNT, and other
parts of the code keep IN_UNMOUNT separate from IN_ALL_EVENTS. So the
check should be:
if (unlikely(!(mask & (IN_ALL_EVENTS | IN_UNMOUNT))))
But inotify_arg_to_mask(arg) always sets the IN_UNMOUNT bit in the mask
anyway, so the check is always going to pass and thus should simply be
removed. Also note that inotify_arg_to_mask completely controls what mask
bits get set from arg, there's no way for invalid bits to get enabled
there.
Lets fix it by simply removing the useless broken checks.
Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Cc: <stable@vger.kernel.org> [2.6.37+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Maxim Patlasov [Wed, 20 Feb 2013 02:13:32 +0000 (13:13 +1100)]
proc: avoid extra pde_put() in proc_fill_super()
If proc_get_inode() succeeded, but d_make_root() failed, pde_put() for
proc_root will be called twice: the first time due to iput() called from
d_make_root() and the second time directly in the end of
proc_fill_super().
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>