]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
12 years agomm: account reaped page cache on inode cache pruning
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:23 +0000 (15:11 +1100)]
mm: account reaped page cache on inode cache pruning

Inode cache pruning indirectly reclaims page-cache by invalidating mapping
pages.  Let's account them into reclaim-state to notice this progress in
memory reclaimer.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: avoid livelock on !__GFP_FS allocations
Mel Gorman [Wed, 30 Nov 2011 04:11:07 +0000 (15:11 +1100)]
mm: avoid livelock on !__GFP_FS allocations

Colin Cross reported;

  Under the following conditions, __alloc_pages_slowpath can loop forever:
  gfp_mask & __GFP_WAIT is true
  gfp_mask & __GFP_FS is false
  reclaim and compaction make no progress
  order <= PAGE_ALLOC_COSTLY_ORDER

  These conditions happen very often during suspend and resume,
  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
  allocations into __GFP_WAIT.

  The oom killer is not run because gfp_mask & __GFP_FS is false,
  but should_alloc_retry will always return true when order is less
  than PAGE_ALLOC_COSTLY_ORDER.

In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set.  The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.

The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.

This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used.  Failing allocations like this may
cause suspend to abort but that is better than a livelock.

[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-reduce-the-amount-of-work-done-when-updating-min_free_kbytes-checkpatch-fixes
Andrew Morton [Wed, 30 Nov 2011 04:11:06 +0000 (15:11 +1100)]
mm-reduce-the-amount-of-work-done-when-updating-min_free_kbytes-checkpatch-fixes

Cc: Mel Gorman <mgorman@suse.de>
WARNING: line over 80 characters
#42: FILE: mm/page_alloc.c:3464:
+ /* Blocks with reserved pages will never free, skip them. */

WARNING: line over 80 characters
#61: FILE: mm/page_alloc.c:3477:
+ set_pageblock_migratetype(page, MIGRATE_RESERVE);

WARNING: line over 80 characters
#62: FILE: mm/page_alloc.c:3478:
+ move_freepages_block(zone, page, MIGRATE_RESERVE);

total: 0 errors, 3 warnings, 44 lines checked

./patches/mm-reduce-the-amount-of-work-done-when-updating-min_free_kbytes.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: reduce the amount of work done when updating min_free_kbytes
Mel Gorman [Wed, 30 Nov 2011 04:11:06 +0000 (15:11 +1100)]
mm: reduce the amount of work done when updating min_free_kbytes

When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE.  Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.

The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary.  This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-do-not-stall-in-synchronous-compaction-for-thp-allocations-v3
Mel Gorman [Wed, 30 Nov 2011 04:11:06 +0000 (15:11 +1100)]
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations-v3

Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: do not stall in synchronous compaction for THP allocations
Mel Gorman [Wed, 30 Nov 2011 04:11:05 +0000 (15:11 +1100)]
mm: do not stall in synchronous compaction for THP allocations

Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled.  Reports on this have
been intermittent and not reliable to reproduce locally but;

Andy Isaacson reported a problem copying to VFAT on SD Card
https://lkml.org/lkml/2011/11/7/2

In this case, it was stuck in munmap for betwen 20 and 60
seconds in compaction. It is also possible that khugepaged
was holding mmap_sem on this process if CONFIG_NUMA was set.

Johannes Weiner reported stalls on USB
https://lkml.org/lkml/2011/7/25/378

In this case, there is no stack trace but it looks like the
same problem. The USB stick may have been using NTFS as a
filesystem based on other work done related to writing back
to USB around the same time.

Internally in SUSE, I received a bug report related to stalls in firefox
when using Java and Flash heavily while copying from NFS
to VFAT on USB. It has not been confirmed to be the same problem
but if it looks like a duck and quacks like a duck.....

In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations.  This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync migraton
with __GFP_NO_KSWAPD] on the grounds that it was uncertain it was
beneficial.

While user-visible stalls do not happen for me when writing to USB, I
setup a test running postmark while short-lived processes created
anonymous mapping.  The objective was to exercise the paths that allocate
transparent huge pages.  I then logged when processes were stalled for
more than 1 second, recorded a stack strace and did some analysis to
aggregate unique "stall events" which revealed

Time stalled in this event:    47369 ms
Event count:                      20
usemem               sleep_on_page          3690 ms
usemem               sleep_on_page          2148 ms
usemem               sleep_on_page          1534 ms
usemem               sleep_on_page          1518 ms
usemem               sleep_on_page          1225 ms
usemem               sleep_on_page          2205 ms
usemem               sleep_on_page          2399 ms
usemem               sleep_on_page          2398 ms
usemem               sleep_on_page          3760 ms
usemem               sleep_on_page          1861 ms
usemem               sleep_on_page          2948 ms
usemem               sleep_on_page          1515 ms
usemem               sleep_on_page          1386 ms
usemem               sleep_on_page          1882 ms
usemem               sleep_on_page          1850 ms
usemem               sleep_on_page          3715 ms
usemem               sleep_on_page          3716 ms
usemem               sleep_on_page          4846 ms
usemem               sleep_on_page          1306 ms
usemem               sleep_on_page          1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30

The stall times are approximate at best but the estimates represent 25% of
the worst stalls and even if the estimates are off by a factor of 10, it's
severe.

This patch once again prevents sync migration for transparent hugepage
allocations as it is preferable to fail a THP allocation than stall.

It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD to
look less like a special case.  This would prevent THP allocation using
sync compaction but it would have other side-effects.  There are existing
users of __GFP_NORETRY that are doing high-order allocations and while
they can handle allocation failure, it seems reasonable that they continue
to use sync compaction unless there is a deliberate reason to change that.
 To help clarify this for the future, this patch updates the comment for
__GFP_NO_KSWAPD.

If accepted, this is a -stable candidate.

Reported-by: Andy Isaacson <adi@hexapodia.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: migrate: one less atomic operation
Jacobo Giralt [Wed, 30 Nov 2011 04:11:05 +0000 (15:11 +1100)]
mm: migrate: one less atomic operation

migrate_page_move_mapping() drops a reference from the old page after
unfreezing its counter.  Both operations can be merged into a single
atomic operation by directly unfreezing to one less reference.

The same applies to migrate_huge_page_move_mapping().

Signed-off-by: Jacobo Giralt <jacobo.giralt@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-add-extra-free-kbytes-tunable-update-checkpatch-fixes
Andrew Morton [Wed, 30 Nov 2011 04:11:05 +0000 (15:11 +1100)]
mm-add-extra-free-kbytes-tunable-update-checkpatch-fixes

ERROR: trailing whitespace
#98: FILE: mm/page_alloc.c:5303:
+ * free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so $

ERROR: trailing whitespace
#103: FILE: mm/page_alloc.c:5307:
+int free_kbytes_sysctl_handler(ctl_table *table, int write, $

ERROR: need consistent spacing around '*' (ctx:WxV)
#103: FILE: mm/page_alloc.c:5307:
+int free_kbytes_sysctl_handler(ctl_table *table, int write,
                                          ^

total: 3 errors, 0 warnings, 69 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
      scripts/cleanfile

./patches/mm-add-extra-free-kbytes-tunable-update.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-add-extra-free-kbytes-tunable-update
Rik van Riel [Wed, 30 Nov 2011 04:11:04 +0000 (15:11 +1100)]
mm-add-extra-free-kbytes-tunable-update

All the fixes suggested by Andrew Morton.   Not much of a changelog
since the patch should probably be folded into
mm-add-extra-free-kbytes-tunable.patch

Thank you for pointing these out, Andrew.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: add extra free kbytes tunable
Rik van Riel [Wed, 30 Nov 2011 04:11:04 +0000 (15:11 +1100)]
mm: add extra free kbytes tunable

Add a userspace visible knob to tell the VM to keep an extra amount of
memory free, by increasing the gap between each zone's min and low
watermarks.

This is useful for realtime applications that call system calls and have a
bound on the number of allocations that happen in any short time period.
In this application, extra_free_kbytes would be left at an amount equal to
or larger than than the maximum number of allocations that happen in any
burst.

It may also be useful to reduce the memory use of virtual machines
(temporarily?), in a way that does not cause memory fragmentation like
ballooning does.

Testing results from Satoru Moriya:

: I ran some sample workloads and measure memory allocation latency
: (latency of __alloc_page_nodemask()).
: The test is like following:
:
:  - CPU: 1 socket, 4 core
:  - Memory: 4GB
:
:  - Background load:
:    $ dd if=3D/dev/zero of=3D/tmp/tmp1
:    $ dd if=3D/dev/zero of=3D/tmp/tmp2
:    $ dd if=3D/dev/zero of=3D/tmp/tmp3
:
:  - Main load:
:    $ mapped-file-stream 1 $((1024 * 1024 * 640))  --(*)
:
:  (*) This is made by Johannes Weiner
:      https://lkml.org/lkml/2010/8/30/226
:
:      It allocates/access 640MByte memory at a burst.
:
: The result is follwoing:
:
:                                |         |  extra   |
:                                | default |  kbytes  |
: --------------------------------------------------------------
: min_free_kbytes                |    8113 |   8113   |
: extra_free_kbytes              |       0 | 640*1024 | (KB)
: --------------------------------------------------------------
: worst latency                  | 517.762 |  20.775  | (usec)
: --------------------------------------------------------------
: vmstat result                  |         |          |
:  nr_vmscan_write               |       0 |      0   |
:  pgsteal_dma                   |       0 |      0   |
:  pgsteal_dma32                 |  143667 | 144882   |
:  pgsteal_normal                |   31486 |  27001   |
:  pgsteal_movable               |       0 |      0   |
:  pgscan_kswapd_dma             |       0 |      0   |
:  pgscan_kswapd_dma32           |  138617 | 156351   |
:  pgscan_kswapd_normal          |   30593 |  27955   |
:  pgscan_kswapd_movable         |       0 |      0   |
:  pgscan_direct_dma             |       0 |      0   |
:  pgscan_direct_dma32           |    5050 |      0   |
:  pgscan_direct_normal          |     896 |      0   |
:  pgscan_direct_movable         |       0 |      0   |
:  kswapd_steal                  |  169207 | 171883   |
:  kswapd_inodesteal             |       0 |      0   |
:  kswapd_low_wmark_hit_quickly  |      43 |     45   |
:  kswapd_high_wmark_hit_quickly |       1 |      0   |
:  allocstall                    |      32 |      0   |
:
:
: As you can see, in the default case there were 32 direct reclaim
: (allocstal= l) and its worst latency was 517.762 usecs.  This value may be
: larger if a process would sleep or issue I/O in the direct reclaim path.
: OTOH, ii the other case where I add extra free bytes, there were no direct
: reclaim and its worst latency was 20.775 usecs.
:
: In this test case, we can avoid direct reclaim and keep a latency low.

Signed-off-by: Rik van Riel<riel@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Tested-by: Satoru Moriya <satoru.moriya@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: fix page-faults detection in swap-token logic
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:03 +0000 (15:11 +1100)]
mm: fix page-faults detection in swap-token logic

After commit v2.6.36-5896-gd065bd8 "mm: retry page fault when blocking on
disk transfer" we usually wait in page-faults without mmap_sem held, so
all swap-token logic was broken, because it based on using
rwsem_is_locked(&mm->mmap_sem) as sign of in progress page-faults.

Add an atomic counter of in progress page-faults for mm to the mm_struct
with swap-token.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-tracepoint: fix documentation and examples
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:03 +0000 (15:11 +1100)]
mm-tracepoint: fix documentation and examples

We renamed the page-free mm tracepoints.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-tracepoint: rename page-free events
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:03 +0000 (15:11 +1100)]
mm-tracepoint: rename page-free events

Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
mm_page_free_batched

Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
all freed pages, not only for directly freed.  So, let's name it properly.
 For pages freed via page-list we also trigger mm_page_free_batched event.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: remove unused pagevec_free
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:02 +0000 (15:11 +1100)]
mm: remove unused pagevec_free

It not exported and now nobody uses it.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-add-free_hot_cold_page_list-helper-v3
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:02 +0000 (15:11 +1100)]
mm-add-free_hot_cold_page_list-helper-v3

v3: Always free pages in reverse order.
    The most recently added struct page, the most likely to be hot.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-add-free_hot_cold_page_list-helper-v2
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:02 +0000 (15:11 +1100)]
mm-add-free_hot_cold_page_list-helper-v2

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm: add free_hot_cold_page_list() helper
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:01 +0000 (15:11 +1100)]
mm: add free_hot_cold_page_list() helper

This patch adds helper free_hot_cold_page_list() to free list of 0-order
pages.  It frees pages directly from list without temporary page-vector.
It also calls trace_mm_pagevec_free() to simulate pagevec_free()
behaviour.

bloat-o-meter:

add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
function                                     old     new   delta
free_hot_cold_page_list                        -     264    +264
get_page_from_freelist                      2129    2132      +3
__pagevec_free                               243     239      -4
split_free_page                              380     373      -7
release_pages                                606     510     -96
free_page_list                               188       -    -188

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agovmscan: activate executable pages after first usage
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:01 +0000 (15:11 +1100)]
vmscan: activate executable pages after first usage

Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").

Currently these pages can become "first class citizens" only after second
usage.  After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agovmscan: promote shared file mapped pages
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:11:01 +0000 (15:11 +1100)]
vmscan: promote shared file mapped pages

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file pages.
Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.

Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access.  Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.

After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.

I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
this situation major-pagefaults are very costly, because all containers
share the same page.  In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)

These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.

Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm/page-writeback.c: make determine_dirtyable_memory static again
Johannes Weiner [Wed, 30 Nov 2011 04:10:54 +0000 (15:10 +1100)]
mm/page-writeback.c: make determine_dirtyable_memory static again

The tracing ring-buffer used this function briefly, but not anymore.
Make it local to the writeback code again.

Also, move the function so that no forward declaration needs to be
reintroduced.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoMAINTAINERS: Staging: cx25821: Add L: linux-media
Joe Perches [Wed, 30 Nov 2011 04:08:00 +0000 (15:08 +1100)]
MAINTAINERS: Staging: cx25821: Add L: linux-media

Send patches to a mailing list.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofs: remove unneeded plug in mpage_readpages()
Namjae Jeon [Wed, 30 Nov 2011 04:08:00 +0000 (15:08 +1100)]
fs: remove unneeded plug in mpage_readpages()

The block plug in mpage_readpages() is duplicates the one in read_pages().

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/message/fusion/mptbase.c: ensure NUL-termination of MptCallbacksName elements
Ferenc Wagner [Wed, 30 Nov 2011 04:07:59 +0000 (15:07 +1100)]
drivers/message/fusion/mptbase.c: ensure NUL-termination of MptCallbacksName elements

I just stumbled upon this while pondering over
https://bugzilla.kernel.org/show_bug.cgi?id=26692 and thought this could
be made better.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ferenc Wagner <wferi@niif.hu>
Cc: Desai <kashyap.desai@lsi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/scsi/mpt2sas/mpt2sas_base.c: fix mismatch in mpt2sas_base_hard_reset_handler...
Alexey Khoroshilov [Wed, 30 Nov 2011 04:07:59 +0000 (15:07 +1100)]
drivers/scsi/mpt2sas/mpt2sas_base.c: fix mismatch in mpt2sas_base_hard_reset_handler() mutex lock-unlock

If ioc->pci_error_recovery is set, goto out in
mpt2sas_base_hard_reset_handler() leads to unlock unheld
ioc->reset_in_progress_mutex.

Fix the issue by jumping afer mutex_unlock() call.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Kashyap Desai <kashyap.desai@lsi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/scsi/sg.c: convert to kstrtoul_from_user()
Stephen Boyd [Wed, 30 Nov 2011 04:07:59 +0000 (15:07 +1100)]
drivers/scsi/sg.c: convert to kstrtoul_from_user()

Instead of open coding this function use kstrtoul_from_user() directly.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Douglas Gilbert <dougg@torque.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/scsi/aacraid/commctrl.c: fix mem leak in aac_send_raw_srb()
Jesper Juhl [Wed, 30 Nov 2011 04:07:58 +0000 (15:07 +1100)]
drivers/scsi/aacraid/commctrl.c: fix mem leak in aac_send_raw_srb()

We leak in drivers/scsi/aacraid/commctrl.c::aac_send_raw_srb() :

We allocate memory:
        ...
                        struct user_sgmap* usg;
                        usg = kmalloc(actual_fibsize - sizeof(struct aac_srb)
                          + sizeof(struct sgmap), GFP_KERNEL);
and then neglect to free it:
        ...
                        for (i = 0; i < usg->count; i++) {
                                u64 addr;
                                void* p;
                                if (usg->sg[i].count >
                                    ((dev->adapter_info.options &
                                     AAC_OPT_NEW_COMM) ?
                                      (dev->scsi_host_ptr->max_sectors << 9) :
                                      65536)) {
                                        rcode = -EINVAL;
                                        goto cleanup;
        ... this 'goto' makes 'usg' go out of scope and leak the memory we
            allocated.
            Other exits properly kfree(usg), it's just here it is neglected.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/scsi/megaraid.c: fix sparse warnings
Randy Dunlap [Wed, 30 Nov 2011 04:07:58 +0000 (15:07 +1100)]
drivers/scsi/megaraid.c: fix sparse warnings

Fix sparse warnings of right shift bigger than source value size:

drivers/scsi/megaraid.c:311:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:313:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:317:67: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:319:67: warning: right shift by bigger than source value

Patch suggestion from email by Al Viro:

"Since both are claimed to be strings, I really suspect that this >> 8 is
misspelled >> 4 and they have a character followed by pair of two-digit
packed decimals in there..."

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Neela Syam Kolli <megaraidlinux@lsi.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoscsi: fix a header to include linux/types.h
Alexander Shishkin [Wed, 30 Nov 2011 04:07:58 +0000 (15:07 +1100)]
scsi: fix a header to include linux/types.h

For headers that get exported to userland and make use of u32 style
type names, it is advised to include linux/types.h.

This fixes a headers_check warning.

Signed-off-by: Alexander Shishkin <virtuoso@slind.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoparisc, exec: remove redundant set_fs(USER_DS)
Mathias Krause [Wed, 30 Nov 2011 04:07:57 +0000 (15:07 +1100)]
parisc, exec: remove redundant set_fs(USER_DS)

The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoocfs2: avoid unaligned access to dqc_bitmap
Akinobu Mita [Wed, 30 Nov 2011 04:07:57 +0000 (15:07 +1100)]
ocfs2: avoid unaligned access to dqc_bitmap

The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
but not 64-bit aligned.  The dqc_bitmap is accessed by ocfs2_set_bit(),
ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit().  These
are wrapper macros for ext2_*_bit() which need to take an unsigned long
aligned address (though some architectures are able to handle unaligned
address correctly)

So some 64bit architectures may not be able to access the dqc_bitmap
correctly.

This avoids such unaligned access by using another wrapper functions for
ext2_*_bit().  The code is taken from fs/ext4/mballoc.c which also need to
handle unaligned bitmap access.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoext4: use proper little-endian bitops
Akinobu Mita [Wed, 30 Nov 2011 04:07:57 +0000 (15:07 +1100)]
ext4: use proper little-endian bitops

ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4.  Only two ext4_{set,clear}_bit() calls check the return value.  The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().

This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.

This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.

This also removes unused ext4_find_first_zero_bit().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agokernel/timer.c: use debugobjects to catch deletion of uninitialized timers
Christine Chan [Wed, 30 Nov 2011 04:07:56 +0000 (15:07 +1100)]
kernel/timer.c: use debugobjects to catch deletion of uninitialized timers

del_timer_sync() calls debug_object_assert_init() to assert that a timer
has been initialized before calling lock_timer_base().  lock_timer_base()
would spin forever on a NULL(uninit-ed) base.  The check is added to
del_timer() to prevent silent failure, even though it would not get stuck
in an infinite loop.

[sboyd@codeaurora.org: remove WARN, intialize timer function]
Signed-off-by: Christine Chan <cschan@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodebugobjects: extend to assert that an object is initialized
Christine Chan [Wed, 30 Nov 2011 04:07:56 +0000 (15:07 +1100)]
debugobjects: extend to assert that an object is initialized

Calling del_timer_sync() on an uninitialized timer leads to a never ending
loop in lock_timer_base() that spins checking for a non-NULL timer base.
Add an assertion to debugobjects to catch usage of uninitialized objects
so that we can initialize timers in the del_timer_sync() path before it
calls lock_timer_base().

[sboyd@codeaurora.org: clarify commit message]
Signed-off-by: Christine Chan <cschan@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodebugobjects: be smarter about static objects
Stephen Boyd [Wed, 30 Nov 2011 04:07:56 +0000 (15:07 +1100)]
debugobjects: be smarter about static objects

Remove the WARN_ON() in timer_fixup_activate() and actually use the return
code from fixup to tell the debugobjects code to print a warning.  This
provides better diagnostic information via a nice debugobjects warning
instead of a simple WARN_ON(1) in the timer code with no information as to
what is wrong.  We also assign a dummy timer callback so that if the timer
is actually set to fire we don't oops.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Christine Chan <cschan@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinclude/net/netprio_cgroup.h: various fixes
Andrew Morton [Wed, 30 Nov 2011 04:07:55 +0000 (15:07 +1100)]
include/net/netprio_cgroup.h: various fixes

- make task_netprio_state() stub an inline C function

- Untangle strange ifdef arrangement - remove mention of CONFIG_CGROUPS
  and make everything depend on CONFIG_NETPRIO_CGROUP.

- Don't build sock_update_netprioidx() when CONFIG_NETPRIO_CGROUP=n

- make the skb_update_prio() stub an inlined C function

Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Robert Love <robert.w.love@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc-mqueue-update-maximums-for-the-mqueue-subsystem-checkpatch-fixes
Andrew Morton [Wed, 30 Nov 2011 04:07:55 +0000 (15:07 +1100)]
ipc-mqueue-update-maximums-for-the-mqueue-subsystem-checkpatch-fixes

Cc: Amerigo Wang <amwang@redhat.com>
ERROR: Macros with complex values should be enclosed in parenthesis
#87: FILE: include/linux/ipc_namespace.h:126:
+#define DFLT_MSGSIZEMAX 1024*1024

ERROR: Macros with complex values should be enclosed in parenthesis
#88: FILE: include/linux/ipc_namespace.h:127:
+#define HARD_MSGSIZEMAX      16*1024*1024

total: 2 errors, 0 warnings, 75 lines checked

./patches/ipc-mqueue-update-maximums-for-the-mqueue-subsystem.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Doug Ledford <dledford@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc-mqueue-update-maximums-for-the-mqueue-subsystem-fix
Stephen Rothwell [Wed, 30 Nov 2011 04:07:55 +0000 (15:07 +1100)]
ipc-mqueue-update-maximums-for-the-mqueue-subsystem-fix

ipc/mqueue.c: In function 'mqueue_get_inode':
ipc/mqueue.c:154:4: error: implicit declaration of function 'vmalloc'
ipc/mqueue.c:154:19: warning: assignment makes pointer from integer without=
 a cast
ipc/mqueue.c: In function 'mqueue_evict_inode':
ipc/mqueue.c:278:3: error: implicit declaration of function 'vfree'

Caused by commit 8a53f9442429 ("ipc/mqueue: update maximums for the
mqueue subsystem").  See Rule 1 in Documentation/SubmitChecklist.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Doug Ledford <dledford@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/mqueue: update maximums for the mqueue subsystem
Doug Ledford [Wed, 30 Nov 2011 04:07:54 +0000 (15:07 +1100)]
ipc/mqueue: update maximums for the mqueue subsystem

Commit b231cca4381ee ("message queues: increase range limits") changed the
maximum size of a message in a message queue from INT_MAX to 8192*128.
Unfortunately, we had customers that relied on a size much larger than
8192*128 on their production systems.  After reviewing POSIX, we found
that it is silent on the maximum message size.  We did find a couple other
areas in which it was not silent.  Fix up the mqueue maximums so that the
customer's system can continue to work, and document both the POSIX and
real world requirements in ipc_namespace.h so that we don't have this
issue crop back up.

Also, commit 9cf18e1dd74c ("ipc: HARD_MSGMAX should be higher not lower on
64bit") fiddled with HARD_MSGMAX without realizing that the number was
intentionally in place to limit the msg queue depth to one that was small
enough to kmalloc an array of pointers (hence why we divided 128k by
sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
but to change our allocation to a vmalloc instead (at least for the large
queue size case).  With that, it's possible to increase our allowed
maximum to the POSIX requirements (or more if we choose).

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/mqueue: enforce hard limits
Doug Ledford [Wed, 30 Nov 2011 04:07:54 +0000 (15:07 +1100)]
ipc/mqueue: enforce hard limits

In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
In preparation for making more reasonable hard limits, start enforcing
them even on CAP_SYS_RESOURCE.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/mqueue: switch back to using non-max values on create
Doug Ledford [Wed, 30 Nov 2011 04:07:54 +0000 (15:07 +1100)]
ipc/mqueue: switch back to using non-max values on create

Commit b231cca4381ee15e ("message queues: increase range limits") changed
how we create a queue that does not include an attr struct passed to open
so that it creates the queue with whatever the maximum values are.
However, if the admin has set the maximums to allow flexibility in
creating a queue (aka, both a large size and large queue are allowed, but
combined they create a queue too large for the RLIMIT_MSGQUEUE of the
user), then attempts to create a queue without an attr struct will fail.
Switch back to using acceptable defaults regardless of what the maximums
are.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/mqueue: cleanup definition names and locations
Doug Ledford [Wed, 30 Nov 2011 04:07:53 +0000 (15:07 +1100)]
ipc/mqueue: cleanup definition names and locations

We had a customer come up with a problem while trying to upgrade from our
2.6.18 kernel to our 2.6.32 kernel.  In diagnosing their problem, it was
determined that when commit b231cca4 ("message queues: increase range
limits") changed the msg size max from INT_MAX to 8192*128, that's what
broke their setup.

While fixing this problem, testing showed that if you increase the max
values of a msg queue, then attempt to create one without an attr struct
passed in to the open call, it could fail because it sets the queue size
to the max of both the msg size and queue size.  If these are large
enough, they over run the default RLIMIT_MSGQUEUE.  This change was also
introduced in the b231cca4 ("message queues: increase range limits")
commit.

We then found that the msg queue limits were not all being enforced on
CAP_SYS_RESOURCE apps.

Finally, we found that commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher
not lower on 64bit") fiddled with HARD_MSGMAX without realizing that the
reason it was set to what it was, was to avoid trying to kmalloc a chunk
larger than 128K.

So this series of patches cleans up the various defines, takes us back to
having a larger HARD_MSGSIZEMAX, goes back to using a separate define for
the case where a user doesn't pass in an attr struct in case the maxes
have been raised too large for RLIMIT_MSGQUEUE, enforces the maximums on
CAP_SYS_RESOURCE apps, uses vmalloc instead of kmalloc when the msg
pointer array is too large, and documents all of this so it shouldn't
happen again.

This patch:

The various defines for minimums and maximums of the sysctl controllable
mqueue values are scattered amongst different files and named
inconsistently.  Move them all into ipc_namespace.h and make them have
consistent names.  Additionally, make the number of queues per namespace
also have a minimum and maximum and use the same sysctl function as the
other two settable variables.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Joe Korty <joe.korty@ccur.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomerge_config.sh: whitespace cleanup
Darren Hart [Wed, 30 Nov 2011 04:07:53 +0000 (15:07 +1100)]
merge_config.sh: whitespace cleanup

Fix whitespace usage in the clean_up routine.

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomerge_config.sh: use signal names compatible with dash and bash
Darren Hart [Wed, 30 Nov 2011 04:07:53 +0000 (15:07 +1100)]
merge_config.sh: use signal names compatible with dash and bash

The SIGHUP SIGINT and SIGTERM names caused failures when running
merge_config.sh with the dash shell.  Dropping the "SIG" component makes
the script work in both bash and dash.

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agokconfig: add merge_config.sh script
john stultz [Wed, 30 Nov 2011 04:07:50 +0000 (15:07 +1100)]
kconfig: add merge_config.sh script

After noticing almost every distro has their own method of managing config
fragments, I went looking at some best practices, and wanted to try to
consolidate some of the different approaches so this fairly simple
infrastructure can be shared (and new distros/build systems don't have to
implement yet another config fragment merge script).

This script is most influenced by the Windriver tools used in the Yocto
Project, reusing some portions found there.

This script merges multiple config fragments, warning on any overridden
values.  It then sets any unspecified values to their default, then
finally checks to make sure no specified value was dropped due to
unsatisfied dependencies.

I'm sure this implementation won't work for everyone, and I expect it will
need to evolve to adapt for various use cases.  But I think its a
reasonable starting point.

Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: <tartler@cs.fau.de>
Cc: Dmitry Fink <Dmitry.Fink@palm.com>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Bruce Ashfield <Bruce.Ashfield@windriver.com>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoia64, exec: remove redundant set_fs(USER_DS)
Mathias Krause [Wed, 30 Nov 2011 04:07:21 +0000 (15:07 +1100)]
ia64, exec: remove redundant set_fs(USER_DS)

The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agohrtimers: Special-case zero length sleeps
Matthew Garrett [Wed, 30 Nov 2011 04:07:20 +0000 (15:07 +1100)]
hrtimers: Special-case zero length sleeps

sleep(0) is a common construct used by applications that want to trigger
the scheduler.  sched_yield() might make more sense, but only appeared in
POSIX.1-2001 and so plenty of example code still uses the sleep(0) form.

This wouldn't normally be a problem, but it means that event-driven
applications that are merely trying to avoid starving other processes may
actually end up sleeping due to having large timer_slack values.  Special-
casing this seems reasonable.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agointel-iommu: Fix __init section missmatch of dmar_parse_rmrr_atsr_dev
Witold Baryluk [Wed, 30 Nov 2011 04:07:20 +0000 (15:07 +1100)]
intel-iommu: Fix __init section missmatch of dmar_parse_rmrr_atsr_dev

dmar_parse_rmrr_atsr_dev() (drivers/iommu/dmar.c) is called from
dmar_dev_scope_init() (drivers/iommu/intel-iommu.c), but
dmar_dev_scope_init() is annotated with __init, when
dmar_parse_rmrr_atsr_dev() is not, causing full section missmatch
analsysis to abort compilation.

Fix problem by adding __init annotation to dmar_parse_rmrr_atsr_dev.

Signed-off-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Allen Kay <allen.m.kay@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarm, exec: remove redundant set_fs(USER_DS)
Mathias Krause [Wed, 30 Nov 2011 04:07:20 +0000 (15:07 +1100)]
arm, exec: remove redundant set_fs(USER_DS)

The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file
Vasiliy Kulikov [Wed, 30 Nov 2011 04:07:19 +0000 (15:07 +1100)]
arch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file

Don't allow everybody to use a modem.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com>
Cc: Linus Walleij <linus.walleij@stericsson.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers-platform-x86-sony-laptopc-fix-scancodes-checkpatch-fixes
Andrew Morton [Wed, 30 Nov 2011 04:07:16 +0000 (15:07 +1100)]
drivers-platform-x86-sony-laptopc-fix-scancodes-checkpatch-fixes

ERROR: do not use assignment in if condition
#59: FILE: drivers/platform/x86/sony-laptop.c:384:
+ if ((scancode = sony_laptop_input_index[event]) != -1) {

total: 1 errors, 0 warnings, 39 lines checked

./patches/drivers-platform-x86-sony-laptopc-fix-scancodes.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/platform/x86/sony-laptop.c: fix scancodes
John Hughes [Wed, 30 Nov 2011 04:03:09 +0000 (15:03 +1100)]
drivers/platform/x86/sony-laptop.c: fix scancodes

The scancodes returned by the sony-laptop driver for function keys did not
match the scancodes used to remap keys.  Also, since the scancode was sent
to the input subsystem after the mapped keysym the /lib/udev/keymap
utility was confused about which scancode to report for which keysym.

This patch fixes the driver so the correct scancode is shown for each key.
 It also adds to the documentation a description of where to find the
scancodes.

Before the patch FN/E returned scancode 0x1B, but to remap scancode 0x14
had to be used.

Signed-off-by: John Hughes <john@calva.com>
Cc: Mattia Dongili <malattia@linux.it>
Cc: Matthew Garrett <mjg@redhat.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86: mpparse: account for bus types other than ISA and PCI
Bjorn Helgaas [Wed, 30 Nov 2011 04:03:09 +0000 (15:03 +1100)]
x86: mpparse: account for bus types other than ISA and PCI

In commit f8924e770e04 ("x86: unify mp_bus_info"), the 32-bit and 64-bit
versions of MP_bus_info were rearranged to match each other better.
Unfortunately it introduced a regression: prior to that change we used to
always set the mp_bus_not_pci bit, then clear it if we found a PCI bus.
After it, we set mp_bus_not_pci for ISA buses, clear it for PCI buses, and
leave it alone otherwise.

In the cases of ISA and PCI, there's not much difference.  But ISA is not
the only non-PCI bus, so it's better to always set mp_bus_not_pci and
clear it only for PCI.

Without this change, Dan's Dell PowerEdge 4200 panics on boot with a log
indicating interrupt routing trouble unless the "noapic" option is
supplied.  With this change, the machine boots reliably without "noapic".

Fixes http://bugs.debian.org/586494

[jrnieder@gmail.com: clarified commit message]
Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org> # 2.6.26+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm-vmallocc-eliminate-extra-loop-in-pcpu_get_vm_areas-error-path-fix
Andrew Morton [Wed, 30 Nov 2011 04:03:08 +0000 (15:03 +1100)]
mm-vmallocc-eliminate-extra-loop-in-pcpu_get_vm_areas-error-path-fix

remove now-unneeded tests

Cc: Kautuk Consul <consul.kautuk@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path
Kautuk Consul [Wed, 30 Nov 2011 04:03:08 +0000 (15:03 +1100)]
mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path

If either of the vas or vms arrays are not properly kzalloced, then the
code jumps to the err_free label.

The err_free label runs a loop to check and free each of the array members
of the vas and vms arrays which is not required for this situation as none
of the array members have been allocated till this point.

Eliminate the extra loop we have to go through by introducing a new label
err_free2 and then jumping to it.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of...
Konrad Rzeszutek Wilk [Wed, 30 Nov 2011 04:03:08 +0000 (15:03 +1100)]
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode

Fix an outstanding issue that has been reported since 2.6.37.  Under a
heavy loaded machine processing "fork()" calls could crash with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range() we turn lazy mode on, and then in
swap_entry_free() we call swap_count_continued() which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually set up the
PTE mappings and the kmap_atomic is not done synchronously and ends up
trying to dereference a page that has not been set.

Looking at kmap_atomic_prot_pfn(), it uses 'arch_flush_lazy_mmu_mode' and
doing the same in kmap_atomic_prot() and __kunmap_atomic() makes the problem
go away.

Interestingly, commit b8bcfe997e4615 ("x86/paravirt: remove lazy mode in
interrupts") removed part of this to fix an interrupt issue - but it went
to far and did not consider this scenario.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86: reduce clock calibration time during slave cpu startup
Jack Steiner [Wed, 30 Nov 2011 04:03:07 +0000 (15:03 +1100)]
x86: reduce clock calibration time during slave cpu startup

Reduce the startup time for slave cpus.

Adds hooks for an arch-specific function for clock calibration.  These
hooks are used on x86.  If a newly started cpu has the same phys_proc_id
as a core already active, uses the TSC for the delay loop and has a
CONSTANT_TSC, use the already-calculated value of loops_per_jiffy.

This patch reduces the time required to start slave cpus on a 4096 cpu
system from: 465 sec OLD 62 sec NEW

This reduces boot time on a 4096p system by almost 7 minutes.  Nice...

[akpm@linux-foundation.org: fix CONFIG_SMP=n build]
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86: tlb flush avoid superflous leave_mm()
Shaohua Li [Wed, 30 Nov 2011 04:03:07 +0000 (15:03 +1100)]
x86: tlb flush avoid superflous leave_mm()

If just one page VA tlb is required to be flushed and current task is in
lazy TLB state, doing leave_mm() is superfluous because it flushes the
whole TLB.  This can reduce some TLB miss.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/x86/mm/pageattr.c: quiet sparse noise; local functions should be static
H Hartley Sweeten [Wed, 30 Nov 2011 04:03:07 +0000 (15:03 +1100)]
arch/x86/mm/pageattr.c: quiet sparse noise; local functions should be static

Local functions should be marked static. Â This quiets the following
sparse noise:

warning: symbol '_set_memory_array' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/x86/kernel/ptrace.c: quiet sparse noise
H Hartley Sweeten [Wed, 30 Nov 2011 04:03:06 +0000 (15:03 +1100)]
arch/x86/kernel/ptrace.c: quiet sparse noise

ptrace_set_debugreg() is only used in this file and should be static.
This quiets the following sparse warning:

warning: symbol 'ptrace_set_debugreg' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
H Hartley Sweeten [Wed, 30 Nov 2011 04:03:06 +0000 (15:03 +1100)]
arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer

The last parameter to sort() is a pointer to the function used to swap
items.  This parameter should be NULL, not 0, when not used.  This quiets
the following sparse warning:

warning: Using plain integer as NULL pointer

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86: rtc: don't register a platform RTC device for Intel MID platforms
Mathias Nyman [Wed, 30 Nov 2011 04:03:05 +0000 (15:03 +1100)]
x86: rtc: don't register a platform RTC device for Intel MID platforms

Intel MID x86 platforms have a memory mapped virtual RTC instead.  No MID
platform have the default ports (and accessing them may do weird stuff)

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/x86/kernel/e820.c: eliminate bubble sort from sanitize_e820_map
Mike Ditto [Wed, 30 Nov 2011 04:03:05 +0000 (15:03 +1100)]
arch/x86/kernel/e820.c: eliminate bubble sort from sanitize_e820_map

Replace the bubble sort in sanitize_e820_map() with a call to the generic
kernel sort function to avoid pathological performance with large maps.

On large (thousands of entries) E820 maps, the previous code took minutes
to run; with this change it's now milliseconds.

Signed-off-by: Mike Ditto <mditto@google.com>
Cc: Stefan Assmann <sassmann@kpanic.de>
Cc: Nancy Yuen <yuenn@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86: fix mmap random address range
Ludwig Nussel [Wed, 30 Nov 2011 04:03:05 +0000 (15:03 +1100)]
x86: fix mmap random address range

On x86_32 casting the unsigned int result of get_random_int() to long may
result in a negative value.  On x86_32 the range of mmap_rnd() therefore
was -255 to 255.  The 32bit mode on x86_64 used 0 to 255 as intended.

The bug was introduced by 675a081 ("x86: unify mmap_{32|64}.c") in January
2008.

Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/x86/platform/iris/iris.c: register a platform device and a platform driver
Shérab [Wed, 30 Nov 2011 04:03:05 +0000 (15:03 +1100)]
arch/x86/platform/iris/iris.c: register a platform device and a platform driver

This makes the iris driver use the platform API, so it is properly exposed
in /sys.

[akpm@linux-foundation.org: remove commented-out code, add missing space to printk, clean up code layout]
Signed-off-by: Shérab <Sebastien.Hinderer@ens-lyon.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoacerhdf: lowered default temp fanon/fanoff values
Peter Feuerer [Wed, 30 Nov 2011 04:03:04 +0000 (15:03 +1100)]
acerhdf: lowered default temp fanon/fanoff values

Due to new supported hardware, of which the actual temperature limits of
processor, harddisk and other components are unknown, it feels safer with
lower fanon / fanoff settings.

It won't change much for most people, already using acerhdf, as they use
their own fanon/fanoff variable settings when loading the module.

Furthermore seems like kernel and userspace tools have been improved to
work more efficient and netbooks don't get so hot anymore.

Signed-off-by: Peter Feuerer <peter@piie.net>
Acked-by: Borislav Petkov <petkovbb@gmail.com>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoacerhdf: add support for new hardware
Peter Feuerer [Wed, 30 Nov 2011 04:03:04 +0000 (15:03 +1100)]
acerhdf: add support for new hardware

Add support for new hardware:
Acer Aspire LT-10Q/531/751/1810/1825,
Acer Travelmate 7730,
Packard Bell ENBFT/DOTVR46

Signed-off-by: Peter Feuerer <peter@piie.net>
Acked-by: Borislav Petkov <petkovbb@gmail.com>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoacerhdf: add support for Aspire 1410 BIOS v1.3314
Clay Carpenter [Wed, 30 Nov 2011 04:03:03 +0000 (15:03 +1100)]
acerhdf: add support for Aspire 1410 BIOS v1.3314

Add support for Aspire 1410 BIOS v1.3314.  Fixes the following error:

acerhdf: unknown (unsupported) BIOS version Acer/Aspire 1410/v1.3314,
please report, aborting!

Signed-off-by: Clay Carpenter <claycarpenter@gmail.com>
Signed-off-by: Peter Feuerer <peter@piie.net>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agonet/netfilter/nf_conntrack_netlink.c: fix Oops on container destroy
Alex Bligh [Wed, 30 Nov 2011 04:03:03 +0000 (15:03 +1100)]
net/netfilter/nf_conntrack_netlink.c: fix Oops on container destroy

Problem:

A repeatable Oops can be caused if a container with networking
unshared is destroyed when it has nf_conntrack entries yet to expire.

A copy of the oops follows below. A perl program generating the oops
repeatably is attached inline below.

Analysis:

The oops is called from cleanup_net when the namespace is
destroyed. conntrack iterates through outstanding events and calls
death_by_timeout on each of them, which in turn produces a call to
ctnetlink_conntrack_event. This calls nf_netlink_has_listeners, which
oopses because net->nfnl is NULL.

The perl program generates the container through fork() then
clone(NS_NEWNET). I does not explicitly set up netlink
explicitly set up netlink, but I presume it was set up else net->nfnl
would have been NULL earlier (i.e. when an earlier connection
timed out). This would thus suggest that net->nfnl is made NULL
during the destruction of the container, which I think is done by
nfnetlink_net_exit_batch.

I can see that the various subsystems are deinitialised in the opposite
order to which the relevant register_pernet_subsys calls are called,
and both nf_conntrack and nfnetlink_net_ops register their relevant
subsystems. If nfnetlink_net_ops registered later than nfconntrack,
then its exit routine would have been called first, which would cause
the oops described. I am not sure there is anything to prevent this
happening in a container environment.

Whilst there's perhaps a more complex problem revolving around ordering
of subsystem deinit, it seems to me that missing a netlink event on a
container that is dying is not a disaster. An early check for net->nfnl
being non-NULL in ctnetlink_conntrack_event appears to fix this. There
may remain a potential race condition if it becomes NULL immediately
after being checked (I am not sure any lock is held at this point or
how synchronisation for subsystem deinitialization works).

Patch:

The patch attached should apply on everything from 2.6.26 (if not before)
onwards; it appears to be a problem on all kernels. This was taken against
Ubuntu-3.0.0-11.17 which is very close to 3.0.4. I have torture-tested it
with the above perl script for 15 minutes or so; the perl script hung the
machine within 20 seconds without this patch.

Applicability:

If this is the right solution, it should be applied to all stable kernels
as well as head. Apart from the minor overhead of checking one variable
against NULL, it can never 'do the wrong thing', because if net->nfnl
is NULL, an oops will inevitably result. Therefore, checking is a reasonable
thing to do unless it can be proven than net->nfnl will never be NULL.

Check net->nfnl for NULL in ctnetlink_conntrack_event to avoid Oops on
container destroy

Signed-off-by: Alex Bligh <alex@alex.org.uk>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages
Hillf Danton [Wed, 30 Nov 2011 04:03:03 +0000 (15:03 +1100)]
mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages

Avoid unlocking and unlocked page if we failed to lock it.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agothp: set compound tail page _count to zero
Youquan Song [Wed, 30 Nov 2011 04:03:02 +0000 (15:03 +1100)]
thp: set compound tail page _count to zero

70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times.  But the current kernel does not set
page_tail->_count to zero if a 1GB page is utilized.  So when an IOMMU 1GB
page is used at KVM, it wil result in a kernel oops because a tail page's
_count does not equal zero.

kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
 [<ffffffff81072f7f>] gup_pud_range+0xb8/0x19d
 [<ffffffff8107312f>] get_user_pages_fast+0xcb/0x192
 [<ffffffff810bc450>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81006a24>] hva_to_pfn+0x119/0x2f2
 [<ffffffff81006c29>] gfn_to_pfn_memslot+0x2c/0x2e
 [<ffffffff8100b909>] kvm_iommu_map_pages+0xfd/0x1c1
 [<ffffffff8100ba49>] kvm_iommu_map_memslots+0x7c/0xbd
 [<ffffffff8100b9cd>] ? kvm_iommu_map_pages+0x1c1/0x1c1
 [<ffffffff8100bb34>] kvm_iommu_map_guest+0xaa/0xbf
 [<ffffffff8100aeb0>] kvm_vm_ioctl_assigned_device+0x2ef/0xa47
 [<ffffffff8100ac6d>] ? kvm_vm_ioctl_assigned_device+0xac/0xa47
 [<ffffffff8104f2a6>] ? native_sched_clock+0x32/0x6b
 [<ffffffff810b0c02>] ? sched_clock_cpu+0x45/0xd4
 [<ffffffff810bc450>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff810b0cd2>] ? local_clock+0x41/0x5a
 [<ffffffff810bc8a1>] ? lock_release_holdtime+0x2c/0x129
 [<ffffffff8115762d>] ? cmpxchg_double_slab+0xd0/0x12b
 [<ffffffff81248f47>] ? avc_has_perm_noaudit+0x388/0x399
 [<ffffffff8104f2a6>] ? native_sched_clock+0x32/0x6b
 [<ffffffff8104f2e8>] ? sched_clock+0x9/0xd
 [<ffffffff81007dcb>] kvm_vm_ioctl+0x36c/0x3a2
 [<ffffffff8104f2a6>] ? native_sched_clock+0x32/0x6b
 [<ffffffff8104f2e8>] ? sched_clock+0x9/0xd
 [<ffffffff81174b10>] do_vfs_ioctl+0x49e/0x4e4
 [<ffffffff81174bb0>] sys_ioctl+0x5a/0x7c
 [<ffffffff81500e02>] system_call_fastpath+0x16/0x1b
RIP  [<ffffffff81072d13>] gup_huge_pud+0xf2/0x159

Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agothp: add compound tail page _mapcount when mapped
Youquan Song [Wed, 30 Nov 2011 04:03:02 +0000 (15:03 +1100)]
thp: add compound tail page _mapcount when mapped

With the 3.2-rc kernel, the IOMMU 2M page in KVM works.  While I try to us
IOMMU 1GB page in KVM, I encounter a oops and 1GB page total fail to be
used.  The root cause is that 1GB page allocation calls gup_huge_pud()
while 2M page calls gup_huge_pmd.  If compound pages are used and the page
is tail page, gup_huge_pmd increase _mapcount to record tail page are
mapped while gup_huge_pud does not include this process.  So when the
mapped page is relesed, it will result in kernel oops because the page
does not mark mapped.

This patch add tail process for compound page in 1GB huge page which keeps
the same process as 2M page.

Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3.qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
 -net none -device pci-assign,host=07:00.1

kernel BUG at mm/swap.c:114!
invalid opcode: 0000 [#1] SMP
Call Trace:
 [<ffffffff81127482>] put_page+0x15/0x37
 [<ffffffff810067c4>] kvm_release_pfn_clean+0x31/0x36
 [<ffffffff8100b69c>] kvm_iommu_put_pages+0x94/0xb1
 [<ffffffff8100b739>] kvm_iommu_unmap_memslots+0x80/0xb6
 [<ffffffff8100b6b9>] ? kvm_iommu_put_pages+0xb1/0xb1
 [<ffffffff81425cf3>] ? intel_iommu_attach_device+0x13b/0x144
 [<ffffffff8100bc03>] kvm_assign_device+0xba/0x117
 [<ffffffff8100aec2>] kvm_vm_ioctl_assigned_device+0x301/0xa47
 [<ffffffff8100ac6d>] ? kvm_vm_ioctl_assigned_device+0xac/0xa47
 [<ffffffff8104f2a6>] ? native_sched_clock+0x32/0x6b
 [<ffffffff810b0be2>] ? sched_clock_cpu+0x45/0xd4
 [<ffffffff810bc430>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff810b0cb2>] ? local_clock+0x41/0x5a
 [<ffffffff810bc881>] ? lock_release_holdtime+0x2c/0x129
 [<ffffffff8115760d>] ? cmpxchg_double_slab+0xd0/0x12b
 [<ffffffff81248f27>] ? avc_has_perm_noaudit+0x388/0x399
 [<ffffffff8104f2a6>] ? native_sched_clock+0x32/0x6b
 [<ffffffff8104f2e8>] ? sched_clock+0x9/0xd
 [<ffffffff81007dcb>] kvm_vm_ioctl+0x36c/0x3a2
 [<ffffffff8104f2a6>] ? native_sched_clock+0x32/0x6b
 [<ffffffff8104f2e8>] ? sched_clock+0x9/0xd
 [<ffffffff81174af0>] do_vfs_ioctl+0x49e/0x4e4
 [<ffffffff81174b90>] sys_ioctl+0x5a/0x7c
 [<ffffffff81500dc2>] system_call_fastpath+0x16/0x1b
RIP  [<ffffffff811273d9>] put_compound_page+0xd4/0x168

Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoprintk: avoid double lock acquire
Peter Zijlstra [Wed, 30 Nov 2011 04:03:02 +0000 (15:03 +1100)]
printk: avoid double lock acquire

Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race")
introduced another silly bug where we would want to acquire an already
held lock.  Avoid this.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: update maintainers
KAMEZAWA Hiroyuki [Wed, 30 Nov 2011 04:03:01 +0000 (15:03 +1100)]
memcg: update maintainers

More players joined to memory cgroup developments and Johannes' great work
changed internal design of memory cgroup dramatically.  And he will do
more works.  Michal Hokko did many bug fixes and know memory cgroup very
well.  Daisuke Nishimura helped us very much but he seems busy now.
Thanks to his works.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/rtc/rtc-s3c.c: fix driver clock enable/disable balance issues
Jonghwan Choi [Wed, 30 Nov 2011 04:03:01 +0000 (15:03 +1100)]
drivers/rtc/rtc-s3c.c: fix driver clock enable/disable balance issues

If an error occurs after the clock is enabled, the enable/disable state
can become unbalanced.

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Kukjin Kim <kgene.kim@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
David Rientjes [Wed, 30 Nov 2011 04:03:01 +0000 (15:03 +1100)]
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.  This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoCREDITS: update Kees's expired fingerprint and fix details
Kees Cook [Wed, 30 Nov 2011 04:03:00 +0000 (15:03 +1100)]
CREDITS: update Kees's expired fingerprint and fix details

Small clean-up for my CREDITS entry; the GPG fingerprint was not up to
date, so I fixed other details at the same time too.

Signed-off-by: Kees Cook <kees@outflux.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agothp: reduce khugepaged freezing latency
Andrea Arcangeli [Wed, 30 Nov 2011 04:03:00 +0000 (15:03 +1100)]
thp: reduce khugepaged freezing latency

Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups.  A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.

khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Jiri Slaby <jslaby@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@suse.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofs/proc/meminfo.c: fix compilation error
Claudio Scordino [Wed, 30 Nov 2011 04:03:00 +0000 (15:03 +1100)]
fs/proc/meminfo.c: fix compilation error

Fix the error message "directives may not be used inside a macro argument"
which appears when the kernel is compiled for the cris architecture.

Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agovmscan: use atomic-long for shrinker batching
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:02:59 +0000 (15:02 +1100)]
vmscan: use atomic-long for shrinker batching

Use atomic-long operations instead of looping around cmpxchg().

[akpm@linux-foundation.org: massage atomic.h inclusions]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agovmscan: fix initial shrinker size handling
Konstantin Khlebnikov [Wed, 30 Nov 2011 04:02:59 +0000 (15:02 +1100)]
vmscan: fix initial shrinker size handling

A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock.  For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0.  Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this.  So the next time around this shrinker can cause really
big pressure.  Let's skip such shrinkers instead.

Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoMerge remote-tracking branch 'kvmtool/master'
Stephen Rothwell [Wed, 30 Nov 2011 03:58:35 +0000 (14:58 +1100)]
Merge remote-tracking branch 'kvmtool/master'

Conflicts:
include/net/9p/9p.h
scripts/kconfig/Makefile

12 years agoMerge remote-tracking branch 'vhost/linux-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:51:44 +0000 (14:51 +1100)]
Merge remote-tracking branch 'vhost/linux-next'

Conflicts:
arch/hexagon/Kconfig
arch/m68k/Kconfig

12 years agoMerge remote-tracking branch 'pinctrl/for-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:50:18 +0000 (14:50 +1100)]
Merge remote-tracking branch 'pinctrl/for-next'

12 years agoMerge remote-tracking branch 'writeback/writeback-for-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:48:53 +0000 (14:48 +1100)]
Merge remote-tracking branch 'writeback/writeback-for-next'

12 years agoMerge remote-tracking branch 'tmem/tmem'
Stephen Rothwell [Wed, 30 Nov 2011 03:43:32 +0000 (14:43 +1100)]
Merge remote-tracking branch 'tmem/tmem'

Conflicts:
mm/swapfile.c

12 years agoMerge remote-tracking branch 'char-misc/char-misc-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:41:55 +0000 (14:41 +1100)]
Merge remote-tracking branch 'char-misc/char-misc-next'

12 years agoMerge remote-tracking branch 'staging/staging-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:40:19 +0000 (14:40 +1100)]
Merge remote-tracking branch 'staging/staging-next'

Conflicts:
drivers/staging/iio/adc/ad799x_core.c
drivers/staging/iio/industrialio-core.c

12 years agoMerge remote-tracking branch 'usb/usb-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:38:15 +0000 (14:38 +1100)]
Merge remote-tracking branch 'usb/usb-next'

12 years agoMerge remote-tracking branch 'tty/tty-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:36:24 +0000 (14:36 +1100)]
Merge remote-tracking branch 'tty/tty-next'

Conflicts:
drivers/tty/serial/Kconfig
drivers/tty/serial/Makefile

12 years agoMerge remote-tracking branch 'driver-core/driver-core-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:30:13 +0000 (14:30 +1100)]
Merge remote-tracking branch 'driver-core/driver-core-next'

12 years agoMerge remote-tracking branch 'hsi/for-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:28:47 +0000 (14:28 +1100)]
Merge remote-tracking branch 'hsi/for-next'

12 years agoMerge remote-tracking branch 'regmap/for-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:27:15 +0000 (14:27 +1100)]
Merge remote-tracking branch 'regmap/for-next'

12 years agoMerge remote-tracking branch 'namespace/master'
Stephen Rothwell [Wed, 30 Nov 2011 03:25:43 +0000 (14:25 +1100)]
Merge remote-tracking branch 'namespace/master'

12 years agoMerge remote-tracking branch 'sysctl/master'
Stephen Rothwell [Wed, 30 Nov 2011 03:24:12 +0000 (14:24 +1100)]
Merge remote-tracking branch 'sysctl/master'

12 years agoMerge remote-tracking branch 'xen-two/linux-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:18:12 +0000 (14:18 +1100)]
Merge remote-tracking branch 'xen-two/linux-next'

Conflicts:
arch/x86/xen/Kconfig

12 years agoMerge remote-tracking branch 'xen/upstream/xen'
Stephen Rothwell [Wed, 30 Nov 2011 03:16:48 +0000 (14:16 +1100)]
Merge remote-tracking branch 'xen/upstream/xen'

Conflicts:
arch/x86/xen/Kconfig

12 years agoMerge remote-tracking branch 'kmemleak/kmemleak'
Stephen Rothwell [Wed, 30 Nov 2011 03:10:01 +0000 (14:10 +1100)]
Merge remote-tracking branch 'kmemleak/kmemleak'

12 years agoMerge remote-tracking branch 'uprobes/for-next'
Stephen Rothwell [Wed, 30 Nov 2011 03:03:08 +0000 (14:03 +1100)]
Merge remote-tracking branch 'uprobes/for-next'

12 years agoMerge remote-tracking branch 'tip/auto-latest'
Stephen Rothwell [Wed, 30 Nov 2011 02:56:23 +0000 (13:56 +1100)]
Merge remote-tracking branch 'tip/auto-latest'

Conflicts:
arch/mips/kernel/perf_event_mipsxx.c

12 years agoMerge remote-tracking branch 'fsnotify/for-next'
Stephen Rothwell [Wed, 30 Nov 2011 02:52:32 +0000 (13:52 +1100)]
Merge remote-tracking branch 'fsnotify/for-next'