]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
10 years agofs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
Roman Pen [Thu, 22 May 2014 00:42:45 +0000 (10:42 +1000)]
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write

In case of wbc->sync_mode == WB_SYNC_ALL we need to do data integrity
write, thus mark request as WRITE_SYNC.

akpm: afaict this change will cause the data integrity write bios to be
placed onto the second queue in cfq_io_cq.cfqq[], which presumably results
in special treatment.  The documentation for REQ_SYNC is horrid.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/compaction.c:isolate_freepages_block(): small tuneup
Andrew Morton [Thu, 22 May 2014 00:42:44 +0000 (10:42 +1000)]
mm/compaction.c:isolate_freepages_block(): small tuneup

- remove unneeded `continue'

- expand the scope if the `if (isloated)' test, to optimise a code path
  which is rarely actualyl taken.

Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-introduce-do_shared_fault-and-drop-do_fault-fix-fix
Andrew Morton [Thu, 22 May 2014 00:42:44 +0000 (10:42 +1000)]
mm-introduce-do_shared_fault-and-drop-do_fault-fix-fix

add comment which may not be true :(

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: softdirty: clear VM_SOFTDIRTY flag inside clear_refs_write() instead of clear_sof...
Cyrill Gorcunov [Thu, 22 May 2014 00:42:44 +0000 (10:42 +1000)]
mm: softdirty: clear VM_SOFTDIRTY flag inside clear_refs_write() instead of clear_soft_dirty()

clear_refs_write() is called earlier than clear_soft_dirty() and it is
more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
PTEs) that early instead of clearing it a way deeper inside call chain.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: softdirty: don't forget to save file map softdiry bit on unmap
Cyrill Gorcunov [Thu, 22 May 2014 00:42:44 +0000 (10:42 +1000)]
mm: softdirty: don't forget to save file map softdiry bit on unmap

pte_file_mksoft_dirty operates with argument passed by a value and returns
modified result thus we need to assign @ptfile here, otherwise itis a no-op
which may lead to loss of the softdirty bit.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: softdirty: make freshly remapped file pages being softdirty unconditionally
Cyrill Gorcunov [Thu, 22 May 2014 00:42:43 +0000 (10:42 +1000)]
mm: softdirty: make freshly remapped file pages being softdirty unconditionally

Hugh reported:

 | I noticed your soft_dirty work in install_file_pte(): which looked
 | good at first, until I realized that it's propagating the soft_dirty
 | of a pte it's about to zap completely, to the unrelated entry it's
 | about to insert in its place.  Which seems very odd to me.

Indeed this code ends up being nop in result -- pte_file_mksoft_dirty()
operates with pte_t argument and returns new pte_t which were never used
after.  After looking more I think what we need is to soft-dirtify all
newely remapped file pages because it should look like a new mapping for
memory tracker.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/pagewalk.c: move pte null check
Naoya Horiguchi [Thu, 22 May 2014 00:42:43 +0000 (10:42 +1000)]
mm/pagewalk.c: move pte null check

huge_pte_offset() can return NULL, so we need check it before trying to
take page table lock to avoid a crash.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: add !pte_present() check on existing hugetlb_entry callbacks
Naoya Horiguchi [Thu, 22 May 2014 00:42:43 +0000 (10:42 +1000)]
mm: add !pte_present() check on existing hugetlb_entry callbacks

Page table walker doesn't check non-present hugetlb entry in common path,
so hugetlb_entry() callbacks must check it.  The reason for this behavior
is that some callers want to handle it in its own way.

However, some callers don't check it now, which causes unpredictable
result, for example when we have a race between migrating hugepage and
reading /proc/pid/numa_maps.  This patch fixes it by adding !pte_present
checks on buggy callbacks.

This bug exists for years and got visible by introducing hugepage migration.

ChangeLog v2:
- fix if condition (check !pte_present() instead of pte_present())

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomempolicy: apply page table walker on queue_pages_range()
Naoya Horiguchi [Thu, 22 May 2014 00:42:43 +0000 (10:42 +1000)]
mempolicy: apply page table walker on queue_pages_range()

queue_pages_range() does page table walking in its own way now, so this
patch rewrites it with walk_page_range().  One difficulty was that
queue_pages_range() needed to check vmas to determine whether we queue
pages from a given vma or skip it.  Now we have test_walk() callback in
mm_walk for that purpose, so we can do the replacement cleanly.
queue_pages_test_walk() depends on not only the current vma but also the
previous one, so we use queue_pages->prev to keep it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagewalk-remove-argument-hmask-from-hugetlb_entry-fix-fix
Andrew Morton [Thu, 22 May 2014 00:42:42 +0000 (10:42 +1000)]
pagewalk-remove-argument-hmask-from-hugetlb_entry-fix-fix

remove the assertion altogether - the null deref is sufficient

Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/proc/task_mmu.c: assume non-NULL vma in pagemap_hugetlb()
Naoya Horiguchi [Thu, 22 May 2014 00:42:42 +0000 (10:42 +1000)]
fs/proc/task_mmu.c: assume non-NULL vma in pagemap_hugetlb()

Fengguang reported smatch error about potential NULL pointer access.

In updated page table walker, we never run ->hugetlb_entry() callback
on the address without vma. This is because __walk_page_range() checks
it in advance. So we can assume non-NULL vma in pagemap_hugetlb().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagewalk: remove argument hmask from hugetlb_entry()
Naoya Horiguchi [Thu, 22 May 2014 00:42:42 +0000 (10:42 +1000)]
pagewalk: remove argument hmask from hugetlb_entry()

hugetlb_entry() doesn't use the argument hmask any more, so let's remove
it now.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range()
Naoya Horiguchi [Thu, 22 May 2014 00:42:42 +0000 (10:42 +1000)]
arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range()

We don't have to use mm_walk->private to pass vma to the callback function
because of mm_walk->vma.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: redefine callback functions for page table walker
Naoya Horiguchi [Thu, 22 May 2014 00:42:41 +0000 (10:42 +1000)]
memcg: redefine callback functions for page table walker

Move code around pte loop in mem_cgroup_count_precharge_pte_range() into
mem_cgroup_count_precharge_pte() connected to pte_entry().

We don't change the callback mem_cgroup_move_charge_pte_range() for now,
because we can't do the same replacement easily due to 'goto retry'.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agonuma_maps: redefine callback functions for page table walker
Naoya Horiguchi [Thu, 22 May 2014 00:42:41 +0000 (10:42 +1000)]
numa_maps: redefine callback functions for page table walker

gather_pte_stats() connected to pmd_entry() does both of pmd loop and pte
loop.  So this patch moves pte part into pte_entry().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagemap-redefine-callback-functions-for-page-table-walker-fix
Andrew Morton [Thu, 22 May 2014 00:42:41 +0000 (10:42 +1000)]
pagemap-redefine-callback-functions-for-page-table-walker-fix

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagemap: redefine callback functions for page table walker
Naoya Horiguchi [Thu, 22 May 2014 00:42:41 +0000 (10:42 +1000)]
pagemap: redefine callback functions for page table walker

pagemap_pte_range() connected to pmd_entry() does both of pmd loop and pte
loop.  So this patch moves pte part into pagemap_pte() on pte_entry().

We remove VM_SOFTDIRTY check in pagemap_pte_range(), because in the new
page table walker we call __walk_page_range() for each vma separately, so
we never experience multiple vmas in single pgd/pud/pmd/pte loop.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoclear_refs: redefine callback functions for page table walker
Naoya Horiguchi [Thu, 22 May 2014 00:42:40 +0000 (10:42 +1000)]
clear_refs: redefine callback functions for page table walker

Currently clear_refs_pte_range() is connected to pmd_entry() to split thps
if found.  But now this work can be done in core page table walker code.
So we have no reason to keep this callback on pmd_entry().  This patch
moves pte handling code on pte_entry() callback.

clear_refs_write() has some prechecks about if we really walk over a given
vma.  It's fine to let them done by test_walk() callback, so let's define
it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosmaps: redefine callback functions for page table walker
Naoya Horiguchi [Thu, 22 May 2014 00:42:40 +0000 (10:42 +1000)]
smaps: redefine callback functions for page table walker

smaps_pte_range() connected to pmd_entry() does both of pmd loop and pte
loop.  So this patch moves pte part into smaps_pte() on pte_entry() as
expected by the name.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagewalk: add walk_page_vma()
Naoya Horiguchi [Thu, 22 May 2014 00:42:40 +0000 (10:42 +1000)]
pagewalk: add walk_page_vma()

Introduces walk_page_vma(), which is useful for the callers which want to
walk over a given vma.  It's used by later patches.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range-fix
Andrew Morton [Thu, 22 May 2014 00:42:40 +0000 (10:42 +1000)]
pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range-fix

s/min_t/min/

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/pagewalk.c: fix end address calculation in walk_page_range()
Naoya Horiguchi [Thu, 22 May 2014 00:42:39 +0000 (10:42 +1000)]
mm/pagewalk.c: fix end address calculation in walk_page_range()

When we try to walk over inside a vma, walk_page_range() tries to walk
until vma->vm_end even if a given end is before that point.
So this patch takes the smaller one as an end address.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopagewalk: update page table walker core
Naoya Horiguchi [Thu, 22 May 2014 00:42:39 +0000 (10:42 +1000)]
pagewalk: update page table walker core

This patch updates mm/pagewalk.c to make code less complex and more
maintainable.  The basic idea is unchanged and there's no userspace visible
effect.

Most of existing callback functions need access to vma to handle each
entry.  So we had better add a new member vma in struct mm_walk instead of
using mm_walk->private, which makes code simpler.

One problem in current page table walker is that we check vma in pgd loop.
 Historically this was introduced to support hugetlbfs in the strange
manner.  It's better and cleaner to do the vma check outside pgd loop.

Another problem is that many users of page table walker now use only
pmd_entry(), although it does both pmd-walk and pte-walk.  This makes code
duplication and fluctuation among callers, which worsens the
maintenability.

One difficulty of code sharing is that the callers want to determine
whether they try to walk over a specific vma or not in their own way.  To
solve this, this patch introduces test_walk() callback.

When we try to use multiple callbacks in different levels, skip control is
also important.  For example we have thp enabled in normal configuration,
and we are interested in doing some work for a thp.  But sometimes we want
to split it and handle as normal pages, and in another time user would
handle both at pmd level and pte level.  What we need is that when we've
done pmd_entry() we want to decide whether to go down to pte level
handling based on the pmd_entry()'s result.  So this patch introduces a
skip control flag in mm_walk.  We can't use the returned value for this
purpose, because we already defined the meaning of whole range of returned
values (>0 is to terminate page table walk in caller's specific manner, =0
is to continue to walk, and <0 is to abort the walk in the general
manner.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3-fix
Andrew Morton [Thu, 22 May 2014 00:42:39 +0000 (10:42 +1000)]
mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3-fix

tweak code comment

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3
Naoya Horiguchi [Thu, 22 May 2014 00:42:39 +0000 (10:42 +1000)]
mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3

- add comment on page_size_order()
- use compound_order(compound_head(page)) instead of huge_page_size_order()
- use page_pgoff() in rmap_walk_file() too
- use page_size_order() in kill_proc()
- fix space indent

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v2
Naoya Horiguchi [Thu, 22 May 2014 00:42:38 +0000 (10:42 +1000)]
mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v2

- fix wrong shift direction
- introduce page_size_order() and huge_page_size_order()
- move the declaration of PageHuge() to include/linux/hugetlb_inline.h
  to avoid macro definition.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
Naoya Horiguchi [Thu, 22 May 2014 00:42:38 +0000 (10:42 +1000)]
mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()

page->index stores pagecache index when the page is mapped into file
mapping region, and the index is in pagecache size unit, so it depends on
the page size.  Some of users of reverse mapping obviously assumes that
page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous
hugepage.

For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node.  Then the vma is split and
migrate_page() is called for the 2nd hugepage (belonging to the middle
vma.) In migrate operation, rmap_walk_anon() tries to find the relevant
vma to which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires
VM_BUG_ON.

This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index.  Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: get rid of __GFP_KMEMCG fix
Stephen Rothwell [Thu, 22 May 2014 00:42:38 +0000 (10:42 +1000)]
mm: get rid of __GFP_KMEMCG fix

export kmalloc_order() to modules

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: get rid of __GFP_KMEMCG
Vladimir Davydov [Thu, 22 May 2014 00:42:37 +0000 (10:42 +1000)]
mm: get rid of __GFP_KMEMCG

Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator.  The page
allocated is then to be freed by free_memcg_kmem_pages.  Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path.  So let's introduce separate functions that will
alloc/free pages charged to kmemcg.

The new functions are called alloc_kmem_pages and free_kmem_pages.  They
should be used when the caller actually would like to use kmalloc, but has
to fall back to the page allocator for the allocation is large.  They only
differ from alloc_pages and free_pages in that besides allocating or
freeing pages they also charge them to the kmem resource counter of the
current memory cgroup.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosl[au]b: charge slabs to kmemcg explicitly
Vladimir Davydov [Thu, 22 May 2014 00:42:37 +0000 (10:42 +1000)]
sl[au]b: charge slabs to kmemcg explicitly

We have only a few places where we actually want to charge kmem so instead
of intruding into the general page allocation path with __GFP_KMEMCG it's
better to explictly charge kmem there.  All kmem charges will be easier to
follow that way.

This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
from memcg caches' allocflags.  Instead it makes slab allocation path call
memcg_charge_kmem directly getting memcg to charge from the cache's memcg
params.

This also eliminates any possibility of misaccounting an allocation going
from one memcg's cache to another memcg, because now we always charge
slabs against the memcg the cache belongs to.  That's why this patch
removes the big comment to memcg_kmem_get_cache.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/slub.c: fix some indenting in cmpxchg_double_slab()
Dan Carpenter [Thu, 22 May 2014 00:42:37 +0000 (10:42 +1000)]
mm/slub.c: fix some indenting in cmpxchg_double_slab()

The return statement goes with the cmpxchg_double() condition so it needs
to be indented another tab.

Also these days the fashion is to line function parameters up, and it
looks nicer that way because then the "freelist_new" is not at the same
indent level as the "return 1;".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: slub: fix ALLOC_SLOWPATH stat
Dave Hansen [Thu, 22 May 2014 00:42:37 +0000 (10:42 +1000)]
mm: slub: fix ALLOC_SLOWPATH stat

There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path.  Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus.  Here's the sequence:

1. Enter __slab_alloc(), fall through all the way to the
   stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
   new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
   (goto #2)
4. Fall through in the same path we did before all the way to
   stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return

Doing this is obviously bogus.  It keeps us from being able to accurately
compare ALLOC_SLOWPATH vs.  ALLOC_FASTPATH.  It also means that the total
number of allocs always exceeds the total number of frees.

This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same place
that __slab_alloc() is.  This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-slab-suppress-out-of-memory-warning-unless-debug-is-enabled-fix-2
David Rientjes [Thu, 22 May 2014 00:42:36 +0000 (10:42 +1000)]
mm-slab-suppress-out-of-memory-warning-unless-debug-is-enabled-fix-2

Only define count_free() when CONFIG_SLUB_DEBUG since that's the only
context in which it is referenced.  Only define count_partial() when
CONFIG_SLUB_DEBUG or CONFIG_SYSFS since the sysfs interface still uses it
for partial slab counts.

Also only define node_nr_objs() when CONFIG_SLUB_DEBUG since that's the
only context in which it is referenced.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm, slab: suppress out of memory warning unless debug is enabled
David Rientjes [Thu, 22 May 2014 00:42:36 +0000 (10:42 +1000)]
mm, slab: suppress out of memory warning unless debug is enabled

When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current number
of slabs, number of objects, active objects, etc.  This is always coupled
with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.

Suppress this out of memory warning if the allocator is configured without
debug supported.  The page allocation failure warning will indicate it is
a failed slab allocation, the order, and the gfp mask, so this is only
useful to diagnose allocator issues.

Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch.  If debug is
disabled, however, the warnings are now suppressed.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/slub.c: convert vnsprintf-static to va_format
Fabian Frederick [Thu, 22 May 2014 00:42:36 +0000 (10:42 +1000)]
mm/slub.c: convert vnsprintf-static to va_format

Inspired by Joe Perches suggestion in ntfs logging clean-up.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Joe Perches <joe@perches.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/slub.c: convert printk to pr_foo()
Fabian Frederick [Thu, 22 May 2014 00:42:36 +0000 (10:42 +1000)]
mm/slub.c: convert printk to pr_foo()

-All printk(KERN_foo converted to pr_foo()

-Default printk converted to pr_warn()

-Coalesce format fragments

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Joe Perches <joe@perches.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/ext4/fsync.c: generic_file_fsync call based on barrier flag
Fabian Frederick [Thu, 22 May 2014 00:42:35 +0000 (10:42 +1000)]
fs/ext4/fsync.c: generic_file_fsync call based on barrier flag

generic_file_fsync has been updated to issue a flush for older
filesystems.

This patch tests for barrier flag in ext4 mount flags and calls the right
function.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs-add-generic-data-flush-to-fsync-fix-fix
Andrew Morton [Thu, 22 May 2014 00:42:35 +0000 (10:42 +1000)]
fs-add-generic-data-flush-to-fsync-fix-fix

fix warning

Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs-add-generic-data-flush-to-fsync-fix
Andrew Morton [Thu, 22 May 2014 00:42:35 +0000 (10:42 +1000)]
fs-add-generic-data-flush-to-fsync-fix

nuke ifdef

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/libfs.c: add generic data flush to fsync
Fabian Frederick [Thu, 22 May 2014 00:42:35 +0000 (10:42 +1000)]
fs/libfs.c: add generic data flush to fsync

Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.

This patch makes generic_file_fsync() issue proper cache flush to fix the
problem.  Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/9p: kerneldoc fixes
Fabian Frederick [Thu, 22 May 2014 00:42:34 +0000 (10:42 +1000)]
fs/9p: kerneldoc fixes

Function parameters comment fixing.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/9p/v9fs.c: add __init to v9fs_sysfs_init
Fabian Frederick [Thu, 22 May 2014 00:42:34 +0000 (10:42 +1000)]
fs/9p/v9fs.c: add __init to v9fs_sysfs_init

v9fs_sysfs_init is only called by __init init_v9fs

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoblock: restore /proc/partitions to not display non-partitionable removable devices
Josh Hunt [Thu, 22 May 2014 00:42:34 +0000 (10:42 +1000)]
block: restore /proc/partitions to not display non-partitionable removable devices

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.

Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior.  It's not
clear to me from the commit's changelog if this change was intentional or
not.  This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoMAINTAINERS: update IBM ServeRAID RAID info
Michael Opdenacker [Thu, 22 May 2014 00:42:33 +0000 (10:42 +1000)]
MAINTAINERS: update IBM ServeRAID RAID info

- Invalid maintainer e-mail address:
  Mail server reply:
  Recipient address rejected: User unknown in virtual alias table
- Remove no longer working webpage URL
- Remove obsolete "Person" field
- Move status to "Orphan"
- Add Dave Jeffery and Jack Hammer to the CREDITS file

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Cc: David Jeffery <dhjeffery@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod
jiangyiwen [Thu, 22 May 2014 00:42:33 +0000 (10:42 +1000)]
ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod

When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
ocfs2_mknod(), iput() will not be called during dput(dentry) because no
d_instantiate(), and this will lead to umount hung.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop during...
jiangyiwen [Thu, 22 May 2014 00:42:33 +0000 (10:42 +1000)]
ocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop during umount

The following case may lead to endless loop during umount.

node A         node B               node C       node D
umount volume,
migrate lockres1
to B
                                                 want to lock lockres1,
                                                 send
                                                 MASTER_REQUEST_MSG
                                                 to C
                                    init block mle
               send
               MIGRATE_REQUEST_MSG
               to C
                                    find a block
                                    mle, and then
                                    return
                                    DLM_MIGRATE_RESPONSE_MASTERY_REF
                                    to B
               set C in refmap
                                    umount successfully
               try to umount, endless
               loop occurs when migrate
               lockres1 since C is in
               refmap

So we can fix this endless loop case by only returning
DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
MIGRATE_REQUEST_MSG.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: fix a tiny race when running dirop_fileop_racer
Yiwen Jiang [Thu, 22 May 2014 00:42:33 +0000 (10:42 +1000)]
ocfs2: fix a tiny race when running dirop_fileop_racer

When running dirop_fileop_racer we found a dead lock case.

2 nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
/race/16/1 in the filesystem, and let the inode number of dir 16 is less
than the inode number of dir race.

Node A                            Node B
mv /race/16/1 /race/
                                  right after Node A has got the
                                  EX mode of /race/16/, and tries to
                                  get EX mode of /race
                                  ls /race/16/

In this case, Node A has got the EX mode of /race/16/, and wants to get EX
mode of /race/.  Node B has got the PR mode of /race/, and wants to get
the PR mode of /race/16/.  Since EX and PR are mutually exclusive, dead
lock happens.

This patch fixes this case by locking in ancestor order before trying
inode number order.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an...
Tariq Saeed [Thu, 22 May 2014 00:42:32 +0000 (10:42 +1000)]
ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one

Orabug: 17489469

When o2net-accept-one() rejects an illegal connection, it terminates the
loop picking up the remaining queued connections.  This fix will continue
accepting connections till the queue is emtpy.

Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an...
Tariq Saeed [Thu, 22 May 2014 00:42:32 +0000 (10:42 +1000)]
ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one

When o2net-accept-one() rejects an illegal connection, it terminates the
loop picking up the remaining queued connections.  This fix will continue
accepting connections till the queue is emtpy.

Addresses Orabug 17489469.

Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: fix deadlock when two nodes are converting same lock from PR to EX and idletim...
Tariq Saeed [Thu, 22 May 2014 00:42:32 +0000 (10:42 +1000)]
ocfs2: fix deadlock when two nodes are converting same lock from PR to EX and idletimeout closes conn

Orabug: 18639535

Two node cluster and both nodes hold a lock at PR level and both want to
convert to EX at the same time.  Master node 1 has sent BAST and then
closes the connection due to idletime out.  Node 0 receives BAST, sends
unlock req with cancel flag but gets error -ENOTCONN.  The problem is this
error is ignored in dlm_send_remote_unlock_request() on the **incorrect**
assumption that the master is dead.  See NOTE in comment why it returns
DLM_NORMAL.  Upon getting DLM_NORMAL, node 0 proceeds to sends convert
(without cancel flg) which fails with -ENOTCONN.  waits 5 sec and resends.
 This time gets DLM_IVLOCKID from the master since lock not found in grant
, it had been moved to converting queue in response to conv PR->EX req.
No way out.

Node 1 (master) Node 0
============== ======

lock mode PR PR

convert PR -> EX
mv grant -> convert and que BAST
...
                     <-------- convert PR -> EX
convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
...
BAST (want PR -> NL)
                     ------------------>
...
idle timout, conn closed
                                ...
In response to BAST,
sends unlock with cancel convert flag
gets -ENOTCONN. Ignores and
                                sends remote convert request
                                gets -ENOTCONN, waits 5 Sec, retries
...
reconnects
                   <----------------- convert req goes through on next try
does not find lock on grant que
                   status DLM_IVLOCKID
                   ------------------>
...

No way out.  Fix is to keep retrying unlock with cancel flag until it
succeeds or the master dies.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: fix umount hang while shutting down truncate log
Xue jiufei [Thu, 22 May 2014 00:42:32 +0000 (10:42 +1000)]
ocfs2: fix umount hang while shutting down truncate log

Revert XXXX because it may cause umount hang while shutting down truncate
log.

fix NULL pointer dereference when dismount and ocfs2rec simultaneously

The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
  -> free osb->recovery_map
-> ocfs2_truncate_shutdown
  -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
  -> check whether osb->recovery_map->rm_used is zero
Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.

To prevent NULL pointer dereference while getting sys_root_inode, we
use a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
truncate log shutdown.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2/dlm: fix possible convert=sion deadlock
Xue jiufei [Thu, 22 May 2014 00:42:31 +0000 (10:42 +1000)]
ocfs2/dlm: fix possible convert=sion deadlock

We found there is a conversion deadlock when the owner of lockres happened
to crash before send DLM_PROXY_AST_MSG for a downconverting lock.  The
situation is as follows:

Node1                            Node2                  Node3
                           the owner of lockresA
lock_1 granted at EX mode
and call ocfs2_cluster_unlock
to decrease ex_holders.
                                                 converting lock_3 from
                                                 NL to EX
                           send DLM_PROXY_AST_MSG
                           to Node1, asking Node 1
                           to downconvert.
receiving DLM_PROXY_AST_MSG,
thread ocfs2dc send
DLM_CONVERT_LOCK_MSG
to Node2 to downconvert
lock_1(EX->NL).
                           lock_1 can be granted and
                           put it into pending_asts
                           list, return DLM_NORMAL.
                           then something happened
                           and Node2 crashed.
received DLM_NORMAL, waiting
for DLM_PROXY_AST_MSG.
                                               selected as the recovery
                                               master, receving migrate
                                               lock from Node1, queue
                                               lock_1 to the tail of
                                               converting list.

After dlm recovery, converting list in the master of lockresA(Node3) will
be: converting list head <-> lock_3(NL->EX) <->lock_1(EX<->NL).  Requested
mode of lock_3 is not compatible with the granted mode of lock_1, so it
can not be granted.  and lock_1 can not downconvert because covnerting
queue is strictly FIFO.  So a deadlock is created.  We think function
dlm_process_recovery_data() should queue_ast for lock_1 or alter the order
of lock_1 and lock_3, so dlm_thread can process lock_1 first.  And if
there are multiple downconverting locks, they must convert form PR to NL,
so no need to sort them.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: should add inode into orphan dir after updating entry in ocfs2_rename()
alex chen [Thu, 22 May 2014 00:42:31 +0000 (10:42 +1000)]
ocfs2: should add inode into orphan dir after updating entry in ocfs2_rename()

There are two files a and b in dir /mnt/ocfs2.
    node A                           node B
mv a b
In ocfs2_rename(), after calling
ocfs2_orphan_add(), the inode of
file b will be added into orphan
dir.

If ocfs2_update_entry() fails,
ocfs2_rename return error and mv
operation fails. But file b still
exists in the parent dir.

ocfs2_queue_orphan_scan
-> ocfs2_queue_recovery_completion
-> ocfs2_complete_recovery
-> ocfs2_recover_orphans
The inode of the file b will be
put with iput().

ocfs2_evict_inode
-> ocfs2_delete_inode
-> ocfs2_wipe_inode
-> ocfs2_remove_inode
OCFS2_VALID_FL in the inode
i_flags will be cleared.

                                   The file b still can be accessed
                                   on node B.
                                   ls /mnt/ocfs2
                                   When first read the file b with
                                   ocfs2_read_inode_block(). It will
                                   validate the inode using
                                   ocfs2_validate_inode_block().
                                   Because OCFS2_VALID_FL not set in
                                   the inode i_flags, so the file
                                   system will be readonly.

So we should add inode into orphan dir after updating entry in
ocfs2_rename().

Signed-off-by: alex.chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2-limit-printk-when-journal-is-aborted-fix
Andrew Morton [Thu, 22 May 2014 00:42:31 +0000 (10:42 +1000)]
ocfs2-limit-printk-when-journal-is-aborted-fix

document the msleep

Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: limit printk when journal is aborted
Joseph Qi [Thu, 22 May 2014 00:42:31 +0000 (10:42 +1000)]
ocfs2: limit printk when journal is aborted

Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
ocfs2_commit_thread.  Then it will get into a loop with mass logs.  This
will meaninglessly consume a larger number of resource and may lead to the
system hanging.  So limit printk in this case.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: remove some redundant casting
George Spelvin [Thu, 22 May 2014 00:42:30 +0000 (10:42 +1000)]
ocfs2: remove some redundant casting

There are two standard techniques for dereferencing structures pointed
to by void *: cast to the right type each time they're used, or assign
to local variables of the right type.

But there's no need to do *both*.

Signed-off-by: George Spelvin <linux@horizon.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/ocfs2/super.c: use OCFS2_MAX_VOL_LABEL_LEN and strlcpy
Fabian Frederick [Thu, 22 May 2014 00:42:30 +0000 (10:42 +1000)]
fs/ocfs2/super.c: use OCFS2_MAX_VOL_LABEL_LEN and strlcpy

Replace strncpy(size 63) by defined value.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: remove NULL assignments on static
Fabian Frederick [Thu, 22 May 2014 00:42:30 +0000 (10:42 +1000)]
ocfs2: remove NULL assignments on static

Static values are automatically initialized to NULL.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodrivers/net/irda/donauboe.c: convert to module_pci_driver
Libo Chen [Thu, 22 May 2014 00:42:30 +0000 (10:42 +1000)]
drivers/net/irda/donauboe.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/configfs: use pr_fmt
Fabian Frederick [Thu, 22 May 2014 00:42:29 +0000 (10:42 +1000)]
fs/configfs: use pr_fmt

Add pr_fmt based on module name.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/configfs: convert printk to pr_foo()
Fabian Frederick [Thu, 22 May 2014 00:42:29 +0000 (10:42 +1000)]
fs/configfs: convert printk to pr_foo()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/configs/item.c: kernel-doc fixes + clean-up
Fabian Frederick [Thu, 22 May 2014 00:42:29 +0000 (10:42 +1000)]
fs/configs/item.c: kernel-doc fixes + clean-up

-Fix function parameter documentation

-EXPORT_SYMBOLS moved after corresponding functions

-Small coding style and checkpatch warning fixes

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch/unicore32/mm/ioremap.c: return NULL on invalid pfn
Andrew Morton [Thu, 22 May 2014 00:42:29 +0000 (10:42 +1000)]
arch/unicore32/mm/ioremap.c: return NULL on invalid pfn

__uc32_ioremap_pfn_caller() should return NULL when the pfn is found to be
invalid.

From a recommendation by Guan Xuetao.

Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch-unicore32-mm-ioremapc-convert-printk-warn_on-to-warn1-fix
Andrew Morton [Thu, 22 May 2014 00:42:28 +0000 (10:42 +1000)]
arch-unicore32-mm-ioremapc-convert-printk-warn_on-to-warn1-fix

undo crazy long line

Cc: Fabian Frederick <fabf@skynet.be>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch/unicore32/mm/ioremap.c: convert printk/warn_on to warn(1
Fabian Frederick [Thu, 22 May 2014 00:42:28 +0000 (10:42 +1000)]
arch/unicore32/mm/ioremap.c: convert printk/warn_on to warn(1

+coalesce formats

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/squashfs/squashfs.h: replace pr_warning by pr_warn
Fabian Frederick [Thu, 22 May 2014 00:42:28 +0000 (10:42 +1000)]
fs/squashfs/squashfs.h: replace pr_warning by pr_warn

Update the last pr_warning callsite in fs branch

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosh: Replace __get_cpu_var uses
Christoph Lameter [Thu, 22 May 2014 00:42:28 +0000 (10:42 +1000)]
sh: Replace __get_cpu_var uses

__get_cpu_var() is used for multiple purposes in the kernel source.  One
of them is address calculation via the form &__get_cpu_var(x).  This
calculates the address for the instance of the percpu variable of the
current processor based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination.  However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less
registers are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well.  Once these
operations are used throughout then specialized macros can be defined in
non -x86 arches as well in order to optimize per cpu access by f.e.  using
a global register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

    Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

    Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

   Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

   Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

   Converts to

__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

   Converts to

__this_cpu_inc(y)

Signed-off-by: Christoph Lameter <cl@linux.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> [compilation only]
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agontfs: remove NULL value assignments
Fabian Frederick [Thu, 22 May 2014 00:42:27 +0000 (10:42 +1000)]
ntfs: remove NULL value assignments

Static values are automatically initialized to NULL.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoinput-route-kbd-leds-through-the-generic-leds-layer-fix
Samuel Thibault [Thu, 22 May 2014 00:42:27 +0000 (10:42 +1000)]
input-route-kbd-leds-through-the-generic-leds-layer-fix

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoinput: route kbd LEDs through the generic LEDs layer
Samuel Thibault [Thu, 22 May 2014 00:42:27 +0000 (10:42 +1000)]
input: route kbd LEDs through the generic LEDs layer

This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default.  Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.

This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.

[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants]
[blogic@openwrt.org: CONFIG_INPUT_LEDS stubs should be static inline]
[akpm@linux-foundation.org: remove unneeded `extern', fix comment layout]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Evan Broder <evan@ebroder.net>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Bryan Wu <cooloney@gmail.com>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Matt Sealey <matt@genesi-usa.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Niels de Vos <devos@fedoraproject.org>
Cc: Steev Klimaszewski <steev@genesi-usa.com>
Signed-off-by: John Crispin <blogic@openwrt.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosched_clock: document 4Mhz vs 1Mhz decision
Stephen Boyd [Thu, 22 May 2014 00:42:26 +0000 (10:42 +1000)]
sched_clock: document 4Mhz vs 1Mhz decision

Bo Shen sent a patch to change this to 1Mhz instead of 4Mhz but according
to Russell King the use of 4Mhz was intentional.  Add a comment to this
effect so that others don't try to change the code as well.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Bo Shen <voice.shen@atmel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel/time/ntp.c: convert simple_strtol to kstrtol
Fabian Frederick [Thu, 22 May 2014 00:42:26 +0000 (10:42 +1000)]
kernel/time/ntp.c: convert simple_strtol to kstrtol

Replace obsolete function.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel-posix-timersc-code-clean-up-checkpatch-fixes
Andrew Morton [Thu, 22 May 2014 00:42:26 +0000 (10:42 +1000)]
kernel-posix-timersc-code-clean-up-checkpatch-fixes

WARNING: space prohibited between function name and open parenthesis '('
#55: FILE: kernel/posix-timers.c:345:
+        sizeof (struct k_itimer), 0,

ERROR: do not use assignment in if condition
#70: FILE: kernel/posix-timers.c:504:
+ if ((event->sigev_notify & SIGEV_THREAD_ID) &&

total: 1 errors, 1 warnings, 192 lines checked

./patches/kernel-posix-timersc-code-clean-up.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Fabian Frederick <fabf@skynet.be>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel/posix-timers.c: code clean-up
Fabian Frederick [Thu, 22 May 2014 00:42:26 +0000 (10:42 +1000)]
kernel/posix-timers.c: code clean-up

Fixing some checkpatch warnings:
-Convert printk to pr_foo()
-Remove spaces between function and (
-Split lines > 80 characters

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofanotify: check file flags passed in fanotify_init
Heinrich Schuchardt [Thu, 22 May 2014 00:42:25 +0000 (10:42 +1000)]
fanotify: check file flags passed in fanotify_init

Without this patch fanotify_init does not validate the value passed in
event_f_flags.

When a fanotify event is read from the fanotify file descriptor a new file
descriptor is created where file.f_flags = event_f_flags.

Internal and external open flags are stored together in field f_flags of
struct file.  Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.

Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522
https://lkml.org/lkml/2014/4/29/539

This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10

With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.

Internal flags are disallowed.

File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.

Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.

This leaves us with the following allowed values:

O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.

O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.

O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.

O_NONBLOCK may be useful when monitoring slow devices like tapes.

O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.

__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.

O_NOATIME may be useful to reduce disk activity.

O_CLOEXEC may be useful, if separate processes shall be used to scan files.

Once this patch is accepted, the fanotify_init.2 manpage has to be updated.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/notify/fanotify/fanotify_user.c: fix FAN_MARK_FLUSH flag checking
Heinrich Schuchardt [Thu, 22 May 2014 00:42:25 +0000 (10:42 +1000)]
fs/notify/fanotify/fanotify_user.c: fix FAN_MARK_FLUSH flag checking

If fanotify_mark is called with illegal value of arguments flags and marks
it usually returns EINVAL.

When fanotify_mark is called with FAN_MARK_FLUSH the argument flags is not
checked for irrelevant flags like FAN_MARK_IGNORED_MASK.

The patch removes this inconsistency.

If an irrelevant flag is set error EINVAL is returned.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/notify/mark.c: trivial cleanup
David Cohen [Thu, 22 May 2014 00:42:25 +0000 (10:42 +1000)]
fs/notify/mark.c: trivial cleanup

Do not initialize private_destroy_list twice.  list_replace_init() already
takes care of initializing private_destroy_list.  We don't need to
initialize it with LIST_HEAD() beforehand.

Signed-off-by: David Cohen <david.a.cohen@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofanotify: create FAN_ACCESS event for readdir
Heinrich Schuchardt [Thu, 22 May 2014 00:42:25 +0000 (10:42 +1000)]
fanotify: create FAN_ACCESS event for readdir

Before the patch,
read creates FAN_ACCESS_PERM and FAN_ACCESS events,
readdir creates only FAN_ACCESS_PERM events.

This is inconsistent.

After the patch,
readdir creates FAN_ACCESS_PERM and FAN_ACCESS events.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofanotify: FAN_MARK_FLUSH: avoid having to provide a fake/invalid fd and path
Heinrich Schuchardt [Thu, 22 May 2014 00:42:24 +0000 (10:42 +1000)]
fanotify: FAN_MARK_FLUSH: avoid having to provide a fake/invalid fd and path

Originally from Tvrtko Ursulin (https://lkml.org/lkml/2011/1/12/112)

Avoid having to provide a fake/invalid fd and path when flushing marks

Currently for a group to flush marks it has set it needs to provide a fake
or invalid (but resolvable) file descriptor and path when calling
fanotify_mark.  This patch pulls the flush handling a bit up so file
descriptor and path are completely ignored when flushing.

I reworked the patch to be applicable again (the signature of fanotify_mark
has changed since Tvrtko's work).

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/fscache: replace seq_printf by seq_puts
Fabian Frederick [Thu, 22 May 2014 00:42:24 +0000 (10:42 +1000)]
fs/fscache: replace seq_printf by seq_puts

Replace seq_printf where possible + coalesce formats from 2 existing
seq_puts

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/fscache: convert printk to pr_foo()
Fabian Frederick [Thu, 22 May 2014 00:42:24 +0000 (10:42 +1000)]
fs/fscache: convert printk to pr_foo()

-All printk converted to pr_foo() except internal.h: printk(KERN_DEBUG

-Coalesce formats.

-Add pr_fmt

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/jfs/super.c: remove 0 assignement to static + code clean-up
Fabian Frederick [Thu, 22 May 2014 00:42:24 +0000 (10:42 +1000)]
fs/jfs/super.c: remove 0 assignement to static + code clean-up

-Static values are automatically initialized to NULL
-Coalesce format fragments
-Remove unnecessary {}
-Small typo fixes
-Fix lines > 80 characters

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Dave Kleikamp <shaggy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/jfs/jfs_logmgr.c: remove NULL assignment on static
Fabian Frederick [Thu, 22 May 2014 00:42:23 +0000 (10:42 +1000)]
fs/jfs/jfs_logmgr.c: remove NULL assignment on static

Static values are automatically initialized to NULL

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Dave Kleikamp <shaggy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/cifs: remove obsolete __constant
Fabian Frederick [Thu, 22 May 2014 00:42:23 +0000 (10:42 +1000)]
fs/cifs: remove obsolete __constant

Replace all __constant_foo to foo() except in smb2status.h (1700 lines to
update).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/ceph/debugfs.c: replace seq_printf by seq_puts
Fabian Frederick [Thu, 22 May 2014 00:42:23 +0000 (10:42 +1000)]
fs/ceph/debugfs.c: replace seq_printf by seq_puts

Replace seq_printf where possible.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/ceph: replace pr_warning by pr_warn
Fabian Frederick [Thu, 22 May 2014 00:42:22 +0000 (10:42 +1000)]
fs/ceph: replace pr_warning by pr_warn

Update the last pr_warning callsites in fs branch

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agox86, mm: probe memory block size for generic x86 64bit
Yinghai Lu [Thu, 22 May 2014 00:42:22 +0000 (10:42 +1000)]
x86, mm: probe memory block size for generic x86 64bit

On system with 2TiB ram, current x86_64 have 128M as section size, and one
memory_block only include one section.  So will have 16400 entries under
/sys/devices/system/memory/.

Current code try to use block id to find block pointer in /sys
for any section, and reuse that block pointer. that finding will take some time
even after

| commit 7c243c7168dcc1bc2081fc0494923cd7cc808fb6
| Author: Russ Anderson <rja@sgi.com>
| Date:   Mon Apr 29 15:07:59 2013 -0700
|    mm: speedup in __early_pfn_to_nid

that will skip the search in that case during booting up.

So solution could be increase block size just like SGI UV system did.
(harded code to 2g).

This patch is trying to probe the block size to make it match mmio remap
size.  for example, Intel Nehalem later system will have memory range [0,
TOML), [4g, TOMH].  If the memory hole is 2g and total is 128g, TOM will
be 2g, and TOM2 will be 130g.

We could use 2g as block size instead of default 128M.  That will reduce
number of entries in /sys/devices/system/memory/

On system 6TiB system will reduce boot time by 35 seconds.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agox86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels -fix 2
Mel Gorman [Thu, 22 May 2014 00:42:22 +0000 (10:42 +1000)]
x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels -fix 2

powerpc has NUMA_BALANCING and non-NUMA_BALANCING versions of pte_present
and I missed that when testing cross-compiling.  This patch replaces
x86-define-_page_numa-by-reusing-software-bits-on-the-pmd-and-pte-levels-fix.patch

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agox86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels
Mel Gorman [Thu, 22 May 2014 00:42:22 +0000 (10:42 +1000)]
x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels

_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86.  Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults and
prot_none faults.  This decision was x86-specific and conceptually it is
difficult requiring special casing to distinguish between PROTNONE and
NUMA ptes based on context.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected for
NUMA hinting faults as if the PTE is not present then a fault will be
trapped.

Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.  This
patch shrinks the maximum possible swap size and uses the bit to uniquely
distinguish between NUMA hinting ptes and swap ptes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agox86: require x86-64 for automatic NUMA balancing
Mel Gorman [Thu, 22 May 2014 00:42:21 +0000 (10:42 +1000)]
x86: require x86-64 for automatic NUMA balancing

32-bit support for NUMA is an oddity on its own but with automatic NUMA
balancing on top there is a reasonable risk that the CPUPID information
cannot be stored in the page flags.  This patch removes support for
automatic NUMA support on 32-bit x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agotools/vm/page-types.c: catch sigbus if raced with truncate
Konstantin Khlebnikov [Thu, 22 May 2014 00:42:21 +0000 (10:42 +1000)]
tools/vm/page-types.c: catch sigbus if raced with truncate

Recently added page-cache dumping is known to be a little bit racy.
But after race with truncate it just dies due to unhandled SIGBUS
when it tries to poke pages beyond the new end of file.
This patch adds handler for SIGBUS which skips the rest of the file.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoMAINTAINERS: add closing angle bracket to Vince Bridgers' email address
Tobias Klauser [Thu, 22 May 2014 00:42:21 +0000 (10:42 +1000)]
MAINTAINERS: add closing angle bracket to Vince Bridgers' email address

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Vince Bridgers <vbridgers2013@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoDocumentation: fix DOCBOOKS=... building
Johannes Berg [Thu, 22 May 2014 00:42:21 +0000 (10:42 +1000)]
Documentation: fix DOCBOOKS=... building

Prior to
commit 4266129964b8238526936d723de65b419d8069c6
Author: Mauro Carvalho Chehab <mchehab@redhat.com>
Date:   Tue May 31 16:27:44 2011 -0300

    [media] DocBook: Move all media docbook stuff into its own directory

it was possible to build only a single (or more) book(s) by calling, for
example

make htmldocs DOCBOOKS=80211.xml

This now fails:
cp: target `.../Documentation/DocBook//media_api' is not a directory

Ignore errors from that copy to make this possible again.

Fixes: 4266129964b8 ("[media] DocBook: Move all media docbook stuff into its own directory")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoocfs2: fix double kmem_cache_destroy in dlm_init
Joseph Qi [Thu, 22 May 2014 00:42:20 +0000 (10:42 +1000)]
ocfs2: fix double kmem_cache_destroy in dlm_init

In dlm_init, if create dlm_lockname_cache failed in
dlm_init_master_caches, it will destroy dlm_lockres_cache which created
before twice.  And this will cause system die when loading modules.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/memory-failure.c: fix memory leak by race between poison and unpoison
Naoya Horiguchi [Thu, 22 May 2014 00:42:20 +0000 (10:42 +1000)]
mm/memory-failure.c: fix memory leak by race between poison and unpoison

When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.  When
you try to unpoison it later, unpoison_memory() calls put_page() for it
twice in order to bring the page back to free page pool (buddy or free
hugepage list.) However, if another memory error occurs on the page which
we are unpoisoning, memory_failure() returns without releasing the
refcount which was incremented in the same call at first, which results in
memory leak and unconsistent num_poisoned_pages statistics.  This patch
fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org> [2.6.32+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: fix swapcache charge from kernel thread context
Michal Hocko [Thu, 22 May 2014 00:42:20 +0000 (10:42 +1000)]
memcg: fix swapcache charge from kernel thread context

284f39afeaa4 (mm: memcg: push !mm handling out to page cache charge
function) explicitly checks for page cache charges without any mm context
(from kernel thread context[1]).

This seemed to be the only possible case where memory could be charged
without mm context so 03583f1a631c (memcg: remove unnecessary !mm check
from try_get_mem_cgroup_from_mm()) removed the mm check from
get_mem_cgroup_from_mm.  This however caused another NULL ptr dereference
during early boot when loopback kernel thread splices to tmpfs as reported
by Stephan Kulow:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
IP: [<ffffffff81196aab>] get_mem_cgroup_from_mm.isra.42+0x2b/0x60
PGD 5082067 PUD 83c3067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop
 ppa]
CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880039b7a390 ti: ffff880038efe000 task.ti: ffff880038efe000
RIP: 0010:[<ffffffff81196aab>]  [<ffffffff81196aab>] get_mem_cgroup_from_mm.isra.42+0x2b/0x60
RSP: 0018:ffff880038effa40  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffea00001e5140 RCX: 0000000000000020
RDX: ffff88003c365020 RSI: ffffea00001e5140 RDI: 0000000000000360
RBP: ffff880038effa78 R08: 0000000000000ab3 R09: ffff880039572248
R10: 0000000000002ace R11: 0000000000000000 R12: 0000000000000010
R13: 0000000000000000 R14: ffff880038c72448 R15: 00000000fffffffe
FS:  00007fb0042ed880(0000) GS:ffff88003c000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000360 CR3: 0000000005e2b000 CR4: 00000000000006f0
Stack:
 ffffffff8119bae0 0000000000000000 0000000000000000 ffffea00001e5140
 0000000000000001 00000000ffffffef ffffffff8119c04b 0000000000000000
 ffff880038c722f8 0000000000000ab3 ffffffff8115129b 00000000000001d7
Call Trace:
 [<ffffffff8119bae0>] __mem_cgroup_try_charge_swapin+0x40/0xe0
 [<ffffffff8119c04b>] mem_cgroup_charge_file+0x8b/0xd0
 [<ffffffff8115129b>] shmem_getpage_gfp+0x66b/0x7b0
 [<ffffffff811518cf>] shmem_file_splice_read+0x18f/0x430
 [<ffffffff811ceff2>] splice_direct_to_actor+0xa2/0x1c0
 [<ffffffffa00019ea>] do_lo_receive+0x5a/0x60 [loop]
 [<ffffffffa0002158>] loop_thread+0x298/0x720 [loop]
 [<ffffffff810778d6>] kthread+0xc6/0xe0
 [<ffffffff815c0dbc>] ret_from_fork+0x7c/0xb0
Code: 66 66 66 66 90 eb 24 66 0f 1f 84 00 00 00 00 00 f6 40 48 01 75 3a 48 8b 50 18 f6 c2 03 75 32 65 ff 02 ba 01 00 00 00 84 d2 75 25 <48> 8b 07 48 85 c0 74 10 48 8b 80 70 08 00 00 48 8b 40 60 48 85
RIP  [<ffffffff81196aab>] get_mem_cgroup_from_mm.isra.42+0x2b/0x60
 RSP <ffff880038effa40>
CR2: 0000000000000360

Also Branimir Maksimovic reported the following oops which is tiggered
for the swapcache charge path from the accounting code for kernel threads:

[  388.522494] CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P           OE 3.15.0-rc5-core2-custom #159
[  388.522496] Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
[  388.522498] task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
[  388.522500] RIP: 0010:[<ffffffff81185b0b>] [<ffffffff81185b0b>] get_mem_cgroup_from_mm.isra.42+0x2b/0x60
[  388.522504] RSP: 0000:ffff88040486bab8  EFLAGS: 00010246
[  388.522506] RAX: 0000000000000000 RBX: ffffea000a416340 RCX: 0000000000000a40
[  388.522508] RDX: ffff88041efe8a40 RSI: ffffea000a416340 RDI: 0000000000000340
[  388.522509] RBP: ffff88040486bab8 R08: 000000000001cb56 R09: 0000000000072d5a
[  388.522511] R10: 0000000000000000 R11: 0000000000000005 R12: ffff88040486bb00
[  388.522512] R13: 00000000000000d0 R14: 0000000000000000 R15: ffff8803f3fe82f8
[  388.522515] FS:  0000000000000000(0000) GS:ffff88041ec80000(0000) knlGS:0000000000000000
[  388.522517] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  388.522518] CR2: 0000000000000340 CR3: 00000003ee44d000 CR4: 00000000001407e0
[  388.522520] Stack:
[  388.522521]  ffff88040486baf0 ffffffff8118abf5 ffffffff8112ce1a 0000000000000000
[  388.522524]  ffffea000a416340 0000000000000003 00000000ffffffef ffff88040486bb18
[  388.522527]  ffffffff8118b1cc ffff88040486baf8 000000000001cb56 0000000000000000
[  388.522530] Call Trace:
[  388.522536]  [<ffffffff8118abf5>] __mem_cgroup_try_charge_swapin+0x45/0xf0
[  388.522539]  [<ffffffff8112ce1a>] ? __lock_page+0x6a/0x70
[  388.522543]  [<ffffffff8118b1cc>] mem_cgroup_charge_file+0x9c/0xe0
[  388.522548]  [<ffffffff8114599c>] shmem_getpage_gfp+0x62c/0x770
[  388.522552]  [<ffffffff81145b18>] shmem_write_begin+0x38/0x40
[  388.522555]  [<ffffffff8112d1c5>] generic_perform_write+0xc5/0x1c0
[  388.522559]  [<ffffffff811ad53a>] ? file_update_time+0x8a/0xd0
[  388.522563]  [<ffffffff8112f211>] __generic_file_aio_write+0x1d1/0x3f0
[  388.522567]  [<ffffffff81084fc1>] ? enqueue_entity+0x291/0xb90
[  388.522570]  [<ffffffff8112f47f>] generic_file_aio_write+0x4f/0xc0
[  388.522574]  [<ffffffff81192eaa>] do_sync_write+0x5a/0x90
[  388.522578]  [<ffffffff810c53c1>] do_acct_process+0x4b1/0x550
[  388.522582]  [<ffffffff810c5acd>] acct_process+0x6d/0xa0
[  388.522587]  [<ffffffff810667d0>] ? manage_workers.isra.25+0x2a0/0x2a0
[  388.522590]  [<ffffffff8104d937>] do_exit+0x827/0xa70
[  388.522594]  [<ffffffff8106699e>] ? worker_thread+0x1ce/0x3a0
[  388.522597]  [<ffffffff810667d0>] ? manage_workers.isra.25+0x2a0/0x2a0
[  388.522600]  [<ffffffff8106cad3>] kthread+0xc3/0xf0
[  388.522604]  [<ffffffff8106ca10>] ? kthread_create_on_node+0x180/0x180
[  388.522608]  [<ffffffff816bfe6c>] ret_from_fork+0x7c/0xb0
[  388.522611]  [<ffffffff8106ca10>] ? kthread_create_on_node+0x180/0x180

This patch fixes the issue by reintroducing mm check into
get_mem_cgroup_from_mm.  We could do the same trick in
__mem_cgroup_try_charge_swapin as we do for the regular page cache path
but it is not worth troubles.  The check is not that expensive and it is
better to have get_mem_cgroup_from_mm more robust.

[1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2

Fixes: 03583f1a631c ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
Reported-and-tested-by: Stephan Kulow <coolo@suse.com>
Reported-by: Branimir Maksimovic <branimir.maksimovic@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: madvise: fix MADV_WILLNEED on shmem swapouts
Johannes Weiner [Thu, 22 May 2014 00:42:20 +0000 (10:42 +1000)]
mm: madvise: fix MADV_WILLNEED on shmem swapouts

MADV_WILLNEED currently does not read swapped out shmem pages back in.

0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix
trees") made find_get_page() filter exceptional radix tree entries but
failed to convert all find_get_page() callers that WANT exceptional
entries over to find_get_entry().  One of them is shmem swap readahead in
madvise, which now skips over any swap-out records.

Convert it to find_get_entry().

Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
Jens Axboe [Thu, 22 May 2014 00:42:19 +0000 (10:42 +1000)]
mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

In some testing I ran today (some fio jobs that spread over two nodes), we
end up spending 40% of the time in filemap_check_errors().  That smells
fishy.  Looking further, this is basically what happens:

blkdev_aio_read()
    generic_file_aio_read()
        filemap_write_and_wait_range()
            if (!mapping->nr_pages)
                filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on the
mapping flags, thus dirtying it for every single invocation.  The patch
below tests each of these bits before clearing them, avoiding this issue.
In my test case (4-socket box), performance went from 1.7M IOPS to 4.0M
IOPS.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoHWPOSION, hugetlb: lock_page/unlock_page does not match for handling a free hugepage
Chen Yucong [Thu, 22 May 2014 00:42:19 +0000 (10:42 +1000)]
HWPOSION, hugetlb: lock_page/unlock_page does not match for handling a free hugepage

For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently.  So we need to check
PageHWPoison instead of !PageHWPoison.

If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoMerge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Wed, 21 May 2014 10:01:08 +0000 (19:01 +0900)]
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
 "Most of the changes are drivers fixes (rtl28xuu, fc2580, ov7670,
  davinci, gspca, s5p-fimc and s5c73m3).

  There is also a compat32 fix and one infoleak fixup at the media
  controller"

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  [media] V4L2: fix VIDIOC_CREATE_BUFS in 64- / 32-bit compatibility mode
  [media] V4L2: ov7670: fix a wrong index, potentially Oopsing the kernel from user-space
  [media] media-device: fix infoleak in ioctl media_enum_entities()
  [media] fc2580: fix tuning failure on 32-bit arch
  [media] Prefer gspca_sonixb over sn9c102 for all devices
  [media] media: davinci: vpfe: make sure all the buffers unmapped and released
  [media] staging: media: davinci: vpfe: make sure all the buffers are released
  [media] media: davinci: vpbe_display: fix releasing of active buffers
  [media] media: davinci: vpif_display: fix releasing of active buffers
  [media] media: davinci: vpif_capture: fix releasing of active buffers
  [media] s5p-fimc: Fix YUV422P depth
  [media] s5c73m3: Add missing rename of v4l2_of_get_next_endpoint() function
  [media] rtl28xxu: silence error log about disabled rtl2832_sdr module
  [media] rtl28xxu: do not hard depend on staging SDR module