Up to this point we tried to keep patchset bisectable, but next patches
are going to change how core of THP refcounting work.
It would be beneficial to split the change into several patches and make
it more reviewable. Unfortunately, I don't see how we can achieve that
while keeping THP working.
Let's hide THP under CONFIG_BROKEN for now and bring it back when new
refcounting get established.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we are
going to be able split PMD without the compound page and that
split_huge_page() can fail.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
thp, mlock: do not allow huge pages in mlocked area
With new refcounting THP can belong to several VMAs. This makes tricky to
track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.
With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.
I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas. But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.
We can bring something better later, but this should be good enough for
now.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain reference
on page. page_cache_get_speculative() always fails on tail pages, because
->_count on tail pages is always zero.
Let's handle tail pages in gup_pte_range().
New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head page
should be enough to serialize against split.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, thp: adjust conditions when we can reuse the page on WP fault
With new refcounting we will be able map the same compound page with PTEs
and PMDs. It requires adjustment to conditions when we can reuse the page
on write-protection fault.
For PTE fault we can't reuse the page if it's part of huge page.
For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page, but
it's expensive.
The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.
This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need to
get information from caller to know whether it was mapped with PMD or PTE.
We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.
The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure happens
the partial unmapped page will be split through shrinker. This should be
good enough.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We're going to allow mapping of individual 4k pages of THP compound page.
It means we cannot rely on PageTransHuge() check to decide if map/unmap
small page or THP.
The patch adds new argument to rmap functions to indicate whether we want
to operate on whole compound page or only the small page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The goal of this patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.
With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal. It
makes gup_fast() implementation simpler and doesn't require special-case
in futex code to handle tail THP pages.
It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.
The patchset drastically lower complexity of get_page()/put_page()
codepaths. I encourage people look on this code before-and-after to
justify time budget on reviewing this patchset.
This patch (of 37):
With new refcounting all subpages of the compound page are not necessary
have the same mapcount. We need to take into account mapcount of every
sub-page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We don't define meaning of page->mapping for tail pages. Currently it's
always NULL, which can be inconsistent with head page and potentially lead
to problems.
Let's poison the pointer to catch all illigal uses.
page_rmapping(), page_mapping() and page_anon_vma() are changed to look on
head page.
The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs. do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hugetlb: clear PG_reserved before setting PG_head on gigantic pages
PF_NO_COMPOUND for PG_reserved assumes we don't use PG_reserved for
compound pages. And we generally don't. But during allocation of
gigantic pages we set PG_head before clearing PG_reserved and
__ClearPageReserved() steps on the VM_BUG_ON_PAGE().
The fix is trivial: set PG_head after PG_reserved.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
page-flags: define behavior of FS/IO-related flags on compound pages
It seems we don't have compound page on FS/IO path currently.
Use PF_NO_COMPOUND to catch if we have.
The odd exception is PG_dirty: sound uses compound pages and maps them
with PTEs. PF_NO_COMPOUND triggers VM_BUG_ON() in set_page_dirty() on
handling shared fault. Let's use PF_HEAD for PG_dirty.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
page-flags: define PG_locked behavior on compound pages
lock_page() must operate on the whole compound page. It doesn't make much
sense to lock part of compound page. Change code to use head page's
PG_locked, if tail page is passed.
This patch also gets rid of custom helper functions -- __set_page_locked()
and __clear_page_locked(). They are replaced with helpers generated by
__SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these helper would trigger
VM_BUG_ON().
SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
appear there. VM_BUG_ON() is added to make sure that this assumption is
correct.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
page-flags: do not corrupt caller 'page' in PF_NO_TAIL
Andrew noticed that PF_NO_TAIL() modifies caller's 'page'. This doesn't
trigger any bad results, because all users are inline functions which
doesn't use the variable beyond the point. But still not good.
The patch changes PF_NO_TAIL() to always return head page, regardless
'enforce'. This makes operations of page flags with PF_NO_TAIL more
symmetrical: modifications and checks goes to head page. It gives better
chance to recover in case of bug for non-DEBUG_VM kernel.
DEBUG_VM kernel will still trigger VM_BUG_ON() on modifications to tail
pages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch adds a third argument to macros which create function
definitions for page flags. This argument defines how page-flags helpers
behave on compound functions.
For now we define four policies:
- PF_ANY: the helper function operates on the page it gets, regardless
if it's non-compound, head or tail.
- PF_HEAD: the helper function operates on the head page of the compound
page if it gets tail page.
- PF_NO_TAIL: only head and non-compond pages are acceptable for this
helper function.
- PF_NO_COMPOUND: only non-compound pages are acceptable for this helper
function.
For now we use policy PF_ANY for all helpers, which matches current
behaviour.
We do not enforce the policy for TESTPAGEFLAG, because we have flags
checked for random pages all over the kernel. Noticeable exception to
this is PageTransHuge() which triggers VM_BUG_ON() for tail page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Rothwell [Wed, 21 Oct 2015 22:03:30 +0000 (09:03 +1100)]
mm-use-unsigned-int-for-page-order-fix
fix build (type of pageblock_order)
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: avoid false-positive PageTail() during meminit
Since compound_head() rework we encode PageTail() into bit 0 of
page->lru.next (aka page->compound_head). We need to make sure that
page->lru is initialized before first use of compound_head() or
PageTail().
My page-flags patchset makes sure that we don't use PG_reserved on
compound pages. That means we have PageTail() check as eary as in
SetPageReserved() in reserve_bootmem_region()
Let's initialize page->lru before that to avoid false positive from
PageTail().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.
We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.
The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.
The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.
hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.
The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.
That means page->compound_head shares storage space with:
That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.
page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().
mm: pack compound_dtor and compound_order into one word in struct page
The patch halves space occupied by compound_dtor and compound_order in
struct page.
For compound_order, it's trivial long -> short conversion.
For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.
This patch free up one word in tail pages for reuse. This is preparation
for the next patch.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Each `struct size_class' contains `struct zs_size_stat': an array of
NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no
CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned
long)' per-class.
The patch removes unneeded `struct zs_size_stat' members by redefining
NR_ZS_STAT_TYPE (max stat idx in array).
Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants,
GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type
larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at
the moment.
./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39)
function old new delta
fix_fullness_group 97 94 -3
insert_zspage 100 86 -14
remove_zspage 141 119 -22
To summarize:
a) each class now uses less memory
b) we avoid a number of dec/inc stats (a minor optimization,
but still).
The gain will increase once we introduce additional stats.
zsmalloc: don't test shrinker_enabled in zs_shrinker_count()
We don't let user to disable shrinker in zsmalloc (once it's been
enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(),
at the moment at least.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit c60369f01125 ("staging: zsmalloc: prevent mappping in interrupt
context") added in_interrupt() check to zs_map_object() and 'hardirq.h'
include; but in_interrupt() macro is defined in 'preempt.h' not in
'hardirq.h', so include it instead.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hui Zhu [Wed, 21 Oct 2015 22:03:28 +0000 (09:03 +1100)]
zsmalloc: fix obj_to_head use page_private(page) as value but not pointer
In obj_malloc():
if (!class->huge)
/* record handle in the header of allocated chunk */
link->handle = handle;
else
/* record handle in first_page->private */
set_page_private(first_page, handle);
In the hugepage we save handle to private directly.
But in obj_to_head():
if (class->huge) {
VM_BUG_ON(!is_first_page(page));
return *(unsigned long *)page_private(page);
} else
return *(unsigned long *)obj;
It is used as a pointer.
The reason why there is no problem until now is huge-class page is born
with ZS_FULL so it can't be migrated. However, we need this patch for
future work: "VM-aware zsmalloced page migration" to reduce external
fragmentation.
Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make the return type of zpool_get_type const; the string belongs to the
zpool driver and should not be modified. Remove the redundant type field
in the struct zpool; it is private to zpool.c and isn't needed since
->driver->type can be used directly. Add comments indicating strings must
be null-terminated.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Wed, 21 Oct 2015 22:03:27 +0000 (09:03 +1100)]
zswap: use charp for zswap param strings
Instead of using a fixed-length string for the zswap params, use charp.
This simplifies the code and uses less memory, as most zswap param strings
will be less than the current maximum length.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zram: keep the exact overcommited value in mem_used_max
`mem_used_max' is designed to store the max amount of memory zram consumed
to store the data. However, it does not represent the actual
'overcommited' (max) value. The existing code goes to -ENOMEM
overcommited case before it updates `->stats.max_used_pages', which hides
the reason we went to -ENOMEM in the first place -- we actually used more
memory than `->limit_pages':
alloced_pages = zs_get_total_pages(meta->mem_pool);
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
zs_free(meta->mem_pool, handle);
ret = -ENOMEM;
goto out;
}
update_used_max(zram, alloced_pages);
Which is misleading. User will see -ENOMEM, check `->limit_pages', check
`->stats.max_used_pages', which will keep the value BEFORE zram passed
`->limit_pages', and see:
`->stats.max_used_pages' < `->limit_pages'
Move update_used_max() before we do `->limit_pages' check, so that
user will see:
`->stats.max_used_pages' > `->limit_pages'
should the overcommit and -ENOMEM happen.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the user supplies an unsupported compression algorithm, keep the
previously selected one (knowingly supported) or the default one (if the
compression algorithm hasn't been changed yet).
Note that previously this operation (i.e. setting an invalid algorithm)
would result in no algorithm being selected, which means that this
represents a small change in the default behaviour.
Minchan said:
For initializing zram, we need to set up 3 optional parameters in advance.
1. the number of compression streams
2. memory limitation
3. compression algorithm
Although user pass completely wrong value to set up for 1 and 2
parameters, it's okay because they have default value so zram will be
initialized with the default value (of course, when user passes a wrong
value via *echo*, sysfs returns -EINVAL so the user can notice it).
But 3 is not consistent with other optional parameters. IOW, if the
user passes a wrong value to set up 3 parameter, zram's initialization
would fail unlike other optional parameters.
So this patch makes them consistent.
Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eric B Munson [Wed, 21 Oct 2015 22:03:25 +0000 (09:03 +1100)]
selftests: vm: add tests for lock on fault
Test the mmap() flag, and the mlockall() flag. These tests ensure that
pages are not faulted in until they are accessed, that the pages are
unevictable once faulted in, and that VMA splitting and merging works with
the new VM flag. The second test ensures that mlock limits are respected.
Note that the limit test needs to be run a normal user.
Also add tests to use the new mlock2 family of system calls.
[treding@nvidia.com: : Fix mlock2-tests for 32-bit architectures]
[treding@nvidia.com: ensure the mlock2 syscall number can be found]
[treding@nvidia.com: use the right arguments for main()] Signed-off-by: Eric B Munson <emunson@akamai.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eric B Munson [Wed, 21 Oct 2015 22:03:25 +0000 (09:03 +1100)]
mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
The previous patch introduced a flag that specified pages in a VMA should
be placed on the unevictable LRU, but they should not be made present when
the area is created. This patch adds the ability to set this state via
the new mlock system calls.
We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
When used with MCL_CURRENT, all current mappings will be marked with
VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
marked with VM_LOCKED | VM_LOCKONFAULT.
Prior to this patch, mlockall() will unconditionally clear the
mm->def_flags any time it is called without MCL_FUTURE. This behavior is
maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
in either mlockall() invocation.
munlock() will unconditionally clear both vma flags. munlockall()
unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
field.
Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eric B Munson [Wed, 21 Oct 2015 22:03:25 +0000 (09:03 +1100)]
mm: introduce VM_LOCKONFAULT
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).
This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.
Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.
Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eric B Munson [Wed, 21 Oct 2015 22:03:24 +0000 (09:03 +1100)]
mm: mlock: add new mlock system call
With the refactored mlock code, introduce a new system call for mlock.
The new call will allow the user to specify what lock states are being
added. mlock2 is trivial at the moment, but a follow on patch will add a
new mlock state making it useful.
Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eric B Munson [Wed, 21 Oct 2015 22:03:24 +0000 (09:03 +1100)]
mm: mlock: refactor mlock, munlock, and munlockall code
mlock() allows a user to control page out of program memory, but this
comes at the cost of faulting in the entire mapping when it is allocated.
For large mappings where the entire area is not necessary this is not
ideal. Instead of forcing all locked pages to be present when they are
allocated, this set creates a middle ground. Pages are marked to be
placed on the unevictable LRU (locked) when they are first used, but they
are not faulted in by the mlock call.
This series introduces a new mlock() system call that takes a flags
argument along with the start address and size. This flags argument gives
the caller the ability to request memory be locked in the traditional way,
or to be locked after the page is faulted in. A new MCL flag is added to
mirror the lock on fault behavior from mlock() in mlockall().
There are two main use cases that this set covers. The first is the
security focussed mlock case. A buffer is needed that cannot be written
to swap. The maximum size is known, but on average the memory used is
significantly less than this maximum. With lock on fault, the buffer is
guaranteed to never be paged out without consuming the maximum size every
time such a buffer is created.
The second use case is focussed on performance. Portions of a large file
are needed and we want to keep the used portions in memory once accessed.
This is the case for large graphical models where the path through the
graph is not known until run time. The entire graph is unlikely to be
used in a given invocation, but once a node has been used it needs to stay
resident for further processing. Given these constraints we have a number
of options. We can potentially waste a large amount of memory by mlocking
the entire region (this can also cause a significant stall at startup as
the entire file is read in). We can mlock every page as we access them
without tracking if the page is already resident but this introduces large
overhead for each access. The third option is mapping the entire region
with PROT_NONE and using a signal handler for SIGSEGV to
mprotect(PROT_READ) and mlock() the needed page. Doing this page at a
time adds a significant performance penalty. Batching can be used to
mitigate this overhead, but in order to safely avoid trying to mprotect
pages outside of the mapping, the boundaries of each mapping to be used in
this way must be tracked and available to the signal handler. This is
precisely what the mm system in the kernel should already be doing.
For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if
mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is
created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the
user is charged as if MCL_FUTURE was used. This decision was made to keep
the accounting checks out of the page fault path.
To illustrate the benefit of this set I wrote a test program that mmaps a
5 GB file filled with random data and then makes 15,000,000 accesses to
random addresses in that mapping. The test program was run 20 times for
each setup. Results are reported for two program portions, setup and
execution. The setup phase is calling mmap and optionally mlock on the
entire region. For most experiments this is trivial, but it highlights
the cost of faulting in the entire region. Results are averages across
the 20 runs in milliseconds.
mmap with mlock(MLOCK_LOCKED) on entire range:
Setup avg: 8228.666
Processing avg: 8274.257
mmap with mlock(MLOCK_LOCKED) before each access:
Setup avg: 0.113
Processing avg: 90993.552
mmap with PROT_NONE and signal handler and batch size of 1 page:
With the default value in max_map_count, this gets ENOMEM as I attempt
to change the permissions, after upping the sysctl significantly I get:
Setup avg: 0.058
Processing avg: 69488.073
mmap with PROT_NONE and signal handler and batch size of 8 pages:
Setup avg: 0.068
Processing avg: 38204.116
mmap with PROT_NONE and signal handler and batch size of 16 pages:
Setup avg: 0.044
Processing avg: 29671.180
mmap with mlock(MLOCK_ONFAULT) on entire range:
Setup avg: 0.189
Processing avg: 17904.899
The signal handler in the batch cases faulted in memory in two steps to
avoid having to know the start and end of the faulting mapping. The first
step covers the page that caused the fault as we know that it will be
possible to lock. The second step speculatively tries to mlock and
mprotect the batch size - 1 pages that follow. There may be a clever way
to avoid this without having the program track each mapping to be covered
by this handeler in a globally accessible structure, but I could not find
it. It should be noted that with a large enough batch size this two step
fault handler can still cause the program to crash if it reaches far
beyond the end of the mapping.
These results show that if the developer knows that a majority of the
mapping will be used, it is better to try and fault it in at once,
otherwise mlock(MLOCK_ONFAULT) is significantly faster.
The performance cost of these patches are minimal on the two benchmarks I
have tested (stream and kernbench). The following are the average values
across 20 runs of stream and 10 runs of kernbench after a warmup run whose
results were discarded.
Avg throughput in MB/s from stream using 1000000 element arrays
Test 4.2-rc1 4.2-rc1+lock-on-fault
Copy: 10,566.5 10,421
Scale: 10,685 10,503.5
Add: 12,044.1 11,814.2
Triad: 12,064.8 11,846.3
Kernbench optimal load
4.2-rc1 4.2-rc1+lock-on-fault
Elapsed Time 78.453 78.991
User Time 64.2395 65.2355
System Time 9.7335 9.7085
Context Switches 22211.5 22412.1
Sleeps 14965.3 14956.1
This patch (of 6):
Extending the mlock system call is very difficult because it currently
does not take a flags argument. A later patch in this set will extend
mlock to support a middle ground between pages that are locked and faulted
in immediately and unlocked pages. To pave the way for the new system
call, the code needs some reorganization so that all the actual entry
point handles is checking input and translating to VMA flags.
Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Wed, 21 Oct 2015 22:03:24 +0000 (09:03 +1100)]
kasan: always taint kernel on report
Currently we already taint the kernel in some cases. E.g. if we hit some
bug in slub memory we call object_err() which will taint the kernel with
TAINT_BAD_PAGE flag. But for other kind of bugs kernel left untainted.
Always taint with TAINT_BAD_PAGE if kasan found some bug. This is useful
for automated testing.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Wed, 21 Oct 2015 22:03:24 +0000 (09:03 +1100)]
mm, slub, kasan: enable user tracking by default with KASAN=y
It's recommended to have slub's user tracking enabled with CONFIG_KASAN,
because:
a) User tracking disables slab merging which improves
detecting out-of-bounds accesses.
b) User tracking metadata acts as redzone which also improves
detecting out-of-bounds accesses.
c) User tracking provides additional information about object.
This information helps to understand bugs.
Currently it is not enabled by default. Besides recompiling the kernel
with KASAN and reinstalling it, user also have to change the boot cmdline,
which is not very handy.
Enable slub user tracking by default with KASAN=y, since there is no good
reason to not do this.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Wed, 21 Oct 2015 22:03:23 +0000 (09:03 +1100)]
kasan: use IS_ALIGNED in memory_is_poisoned_8()
Use IS_ALIGNED() to determine whether the shadow span two bytes. It
generates less code and more readable. Also add some comments in shadow
check functions.
the cause of the problem is the type conversion error in
*memory_is_poisoned_n* function. So this patch fix that.
Signed-off-by: Wang Long <long.wanglong@huawei.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Long [Wed, 21 Oct 2015 22:03:23 +0000 (09:03 +1100)]
lib: test_kasan: add some testcases
Add some out of bounds testcases to test_kasan module.
Signed-off-by: Wang Long <long.wanglong@huawei.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WARNING: 'happend' may be misspelled - perhaps 'happened'?
#79: FILE: Documentation/kasan.txt:121:
+The header of the report discribe what kind of bug happend and what kind of
total: 0 errors, 1 warnings, 82 lines checked
./patches/kasan-various-fixes-in-documentation.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Wed, 21 Oct 2015 22:03:22 +0000 (09:03 +1100)]
kasan: accurately determine the type of the bad access
Makes KASAN accurately determine the type of the bad access. If the shadow
byte value is in the [0, KASAN_SHADOW_SCALE_SIZE) range we can look at
the next shadow byte to determine the type of the access.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Wed, 21 Oct 2015 22:03:22 +0000 (09:03 +1100)]
kasan: update reported bug types for kernel memory accesses
Update the names of the bad access types to better reflect the type of
the access that happended and make these error types "literals" that can
be used for classification and deduplication in scripts.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Wed, 21 Oct 2015 22:03:22 +0000 (09:03 +1100)]
kasan: update reported bug types for not user nor kernel memory accesses
Each access with address lower than
kasan_shadow_to_mem(KASAN_SHADOW_START) is reported as user-memory-access.
This is not always true, the accessed address might not be in user space.
Fix this by reporting such accesses as null-ptr-derefs or
wild-memory-accesses.
There's another reason for this change. For userspace ASan we have a
bunch of systems that analyze error types for the purpose of
classification and deduplication. Sooner of later we will write them to
KASAN as well. Then clearly and explicitly stated error types will bring
value.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we end up calling kasan_report in real mode, our shadow mapping for
the spinlock variable will show poisoned. This will result in us calling
kasan_report_error with lock_report spin lock held. To prevent this
disable kasan reporting when we are priting error w.r.t kasan.
mm/kasan: don't use kasan shadow pointer in generic functions
We can't use generic functions like print_hex_dump to access kasan shadow
region. This require us to setup another kasan shadow region for the
address passed (kasan shadow address). Some architectures won't be able
to do that. Hence make a copy of the shadow region row and pass that to
generic functions.
mm/kasan: rename kasan_enabled() to kasan_report_enabled()
The function only disable/enable reporting. In the later patch we will be
adding a kasan early enable/disable. Rename kasan_enabled to properly
reflect its function.
Johannes Weiner [Wed, 21 Oct 2015 22:03:20 +0000 (09:03 +1100)]
mm: memcontrol: eliminate root memory.current
memory.current on the root level doesn't add anything that wouldn't be
more accurate and detailed using system statistics. It already doesn't
include slabs, and it'll be a pain to keep in sync when further memory
types are accounted in the memory controller. Remove it.
Note that this applies to the new unified hierarchy interface only.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 21 Oct 2015 22:03:20 +0000 (09:03 +1100)]
mm/hugetlb: unmap pages to remove if page fault raced with hole punch
Page faults can race with fallocate hole punch. If a page fault happens
between the unmap and remove operations, the page is not removed and
remains within the hole. This is not the desired behavior. If a page is
mapped, the remove operation (remove_inode_hugepages) will unmap the page
before removing. The unmap within remove_inode_hugepages occurs with the
hugetlb_fault_mutex held so that no other faults can occur until the page
is removed.
The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
remove_inode_hugepages to satisfy the new reference.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 21 Oct 2015 22:03:20 +0000 (09:03 +1100)]
mm/hugetlb: page faults check for fallocate hole punch in progress and wait
At page fault time, check i_private which indicates a fallocate hole punch
is in progress. If the fault falls within the hole, wait for the hole
punch operation to complete before proceeding with the fault.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 21 Oct 2015 22:03:20 +0000 (09:03 +1100)]
mm/hugetlb: setup hugetlb_falloc during fallocate hole punch
When performing a fallocate hole punch, set up a hugetlb_falloc struct and
make i_private point to it. i_private will point to this struct for the
duration of the operation. At the end of the operation, wake up anyone
who faulted on the hole and is on the waitq.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 21 Oct 2015 22:03:20 +0000 (09:03 +1100)]
mm/hugetlb: define hugetlb_falloc structure for hole punch race
The hugetlbfs fallocate hole punch code can race with page faults. The
result is that after a hole punch operation, pages may remain within the
hole. No other side effects of this race were observed.
In preparation for adding userfaultfd support to hugetlbfs, it is
desirable to close the window of this race. This patch set starts by
using the same mechanism employed in shmem (see commit f00cdc6df7). This
greatly reduces the race window. However, it is still possible for the
race to occur.
The current hugetlbfs code to remove pages did not deal with pages that
were mapped (because of such a race). This patch set also adds code to
unmap pages in this rare case. This unmapping of a single page happens
under the hugetlb_fault_mutex, so it can not be faulted again until the
end of the operation.
This patch (of 4):
A hugetlb_falloc structure is pointed to by i_private during fallocate
hole punch operations. Page faults check this structure and if they are
in the hole, wait for the operation to finish.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dave Hansen [Wed, 21 Oct 2015 22:03:19 +0000 (09:03 +1100)]
mm, hugetlbfs: optimize when NUMA=n
My recent patch "mm, hugetlb: use memory policy when available" added some
bloat to hugetlb.o. This patch aims to get some of the bloat back,
especially when NUMA is not in play.
It does this with an implicit #ifdef and marking some things static that
should have been static in my first patch. It also makes the warnings
only VM_WARN_ON()s. They were responsible for a pretty big chunk of the
bloat.
Doing this gets our NUMA=n text size back to a wee bit _below_ where we
started before the original patch.
It also shaves a bit of space off the NUMA=y case, but not much.
Enforcing the mempolicy definitely takes some text and it's hard to avoid.
Dave Hansen [Wed, 21 Oct 2015 22:03:19 +0000 (09:03 +1100)]
mm, hugetlb: use memory policy when available
I have a hugetlbfs user which is never explicitly allocating huge pages
with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let
the pages be allocated from the buddy allocator at fault time.
This works, but they noticed that mbind() was not doing them any good and
the pages were being allocated without respect for the policy they
specified.
dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
But, it only grabs _existing_ huge pages from the huge page pool. If the
pool is empty, we fall back to alloc_buddy_huge_page() which obviously
can't do anything with the VMA's policy because it isn't even passed the
VMA.
Almost everybody preallocates huge pages. That's probably why nobody has
ever noticed this. Looking back at the git history, I don't think this
_ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5, 8 years ago.
The fix is to pass vma/addr down in to the places where we actually call
in to the buddy allocator. It's fairly straightforward plumbing. This
has been lightly tested.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rasmus Villemoes [Wed, 21 Oct 2015 22:03:19 +0000 (09:03 +1100)]
mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe
As far as I can tell, strncpy_from_unsafe never returns -EFAULT. ret is
the result of a __copy_from_user_inatomic(), which is 0 for success and
positive (in this case necessarily 1) for access error - it is never
negative. So we were always returning the length of the, possibly
truncated, destination string.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 21 Oct 2015 22:03:19 +0000 (09:03 +1100)]
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Hugh Dickins [Wed, 21 Oct 2015 22:03:18 +0000 (09:03 +1100)]
mm: page migration avoid touching newpage until no going back
We have had trouble in the past from the way in which page migration's
newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
fix direct reclaim writeback regression") which proposed a cleanup.
We have no actual problem now, but I think the procedure would be clearer
(and alternative get_new_page pools safer to implement) if we assert that
newpage is not touched until we are sure that it's going to be used -
except for taking the trylock on it in __unmap_and_move().
So shift the early initializations from move_to_new_page() into
migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.
Adjust stages 3 to 8 in the Documentation file accordingly.
Hugh Dickins [Wed, 21 Oct 2015 22:03:18 +0000 (09:03 +1100)]
mm: page migration use migration entry for swapcache too
Hitherto page migration has avoided using a migration entry for a
swapcache page mapped into userspace, apparently for historical reasons.
So any page blessed with swapcache would entail a minor fault when it's
next touched, which page migration otherwise tries to avoid. Swapcache in
an mlocked area is rare, so won't often matter, but still better fixed.
Just rearrange the block in try_to_unmap_one(), to handle TTU_MIGRATION
before checking PageAnon, that's all (apart from some reindenting).
Well, no, that's not quite all: doesn't this by the way fix a soft_dirty
bug, that page migration of a file page was forgetting to transfer the
soft_dirty bit? Probably not a serious bug: if I understand correctly,
soft_dirty afficionados usually have to handle file pages separately
anyway; but we publish the bit in /proc/<pid>/pagemap on file mappings as
well as anonymous, so page migration ought not to perturb it.
Hugh Dickins [Wed, 21 Oct 2015 22:03:18 +0000 (09:03 +1100)]
mm: simplify page migration's anon_vma comment and flow
__unmap_and_move() contains a long stale comment on page_get_anon_vma()
and PageSwapCache(), with an odd control flow that's hard to follow.
Mostly this reflects our confusion about the lifetime of an anon_vma, in
the early days of page migration, before we could take a reference to one.
Nowadays this seems quite straightforward: cut it all down to essentials.
I cannot see the relevance of swapcache here at all, so don't treat it any
differently: I believe the old comment reflects in part our anon_vma
confusions, and in part the original v2.6.16 page migration technique,
which used actual swap to migrate anon instead of swap-like migration
entries. Why should a swapcache page not be migrated with the aid of
migration entry ptes like everything else? So lose that comment now, and
enable migration entries for swapcache in the next patch.
Hugh Dickins [Wed, 21 Oct 2015 22:03:18 +0000 (09:03 +1100)]
mm: page migration remove_migration_ptes at lock+unlock level
Clean up page migration a little more by calling remove_migration_ptes()
from the same level, on success or on failure, from __unmap_and_move() or
from unmap_and_move_huge_page().
Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
leave that to when the page is freed. Except for here in page migration,
it has been an invariant that a PageAnon (bit set in page->mapping) page
stays PageAnon until it is freed, and I think we're safer to keep to that.
And with the above rearrangement, it's necessary because zap_pte_range()
wants to identify whether a migration entry represents a file or an anon
page, to update the appropriate rss stats without waiting on it.
Hugh Dickins [Wed, 21 Oct 2015 22:03:18 +0000 (09:03 +1100)]
mm: page migration trylock newpage at same level as oldpage
Clean up page migration a little by moving the trylock of newpage from
move_to_new_page() into __unmap_and_move(), where the old page has been
locked. Adjust unmap_and_move_huge_page() and balloon_page_migrate()
accordingly.
But make one kind-of-functional change on the way: whereas trylock of
newpage used to BUG() if it failed, now simply return -EAGAIN if so.
Cutting out BUG()s is good, right? But, to be honest, this is really to
extend the usefulness of the custom put_new_page feature, allowing a pool
of new pages to be shared perhaps with racing uses.
Hugh Dickins [Wed, 21 Oct 2015 22:03:17 +0000 (09:03 +1100)]
mm: page migration use the put_new_page whenever necessary
I don't know of any problem from the way it's used in our current tree,
but there is one defect in page migration's custom put_new_page feature.
An unused newpage is expected to be released with the put_new_page(), but
there was one MIGRATEPAGE_SUCCESS (0) path which released it with
putback_lru_page(): which can be very wrong for a custom pool.
Fixed more easily by resetting put_new_page once it won't be needed, than
by adding a further flag to modify the rc test.