mtd: ofpart: Fix incorrect NULL check in parse_ofoldpart_partitions()
The pointer returned by kzalloc should be tested for NULL
to avoid potential NULL pointer dereference later. Incorrect
pointer was being tested for NULL. Bug introduced by commit fbcf62a3
(mtd: physmap_of: move parse_obsolete_partitions to become separate
parser).
This patch fixes this bug.
A combination of the following two commits caused a regression in 3.7-rc1
when identifying some Samsung NAND, so that some previously working NAND
were no longer detected properly:
Particularly, a regression was seen on Samsung K9F2G08U0B, with the
following full 8-byte READ ID string:
ec da 10 95 44 00 ec da
The basic problem is that Samsung manufactures both SLC and MLC NAND
that use a non-standard decoding table for deriving information from
their IDs. I have heuristically determined that all the chips that use
the new table have ID strings which wrap around after the 6th byte.
Unfortunately, I overlooked the fact that some older Samsung SLC (which
use a different decoding table) have "5 byte ID strings" which also wrap
around after the 6th byte.
This patch re-introduces a distinction between these old and new Samsung
NAND by checking that the 6th byte is non-zero, allowing both old and
new Samsung NAND to be detected properly.
Signed-off-by: Brian Norris <computersforpeace@gmail.com> Tested-by: Brian Norris <computersforpeace@gmail.com> Reported-by: Marek Vasut <marex@denx.de> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Linus Torvalds [Wed, 10 Oct 2012 01:51:35 +0000 (10:51 +0900)]
Merge tag 'for-linus-20121009' of git://git.infradead.org/mtd-2.6
Pull MTD updates from David Woodhouse:
- Disable broken mtdchar mmap() on MMU systems
- Additional ECC tests for NAND flash, and some test cleanups
- New NAND and SPI chip support
- Fixes/cleanup for SH FLCTL NAND controller driver
- Improved hardware support for GPMI NAND controller
- Conversions to device-tree support for various drivers
- Removal of obsolete drivers (sbc8xxx, bcmring, etc.)
- New LPC32xx drivers for MLC and SLC NAND
- Further cleanup of NAND OOB/ECC handling
- UAPI cleanup merge from David Howells (just moving files, since MTD
headers were sorted out long ago to separate user-visible from kernel
bits)
* tag 'for-linus-20121009' of git://git.infradead.org/mtd-2.6: (168 commits)
mtd: Disable mtdchar mmap on MMU systems
UAPI: (Scripted) Disintegrate include/mtd
mtd: nand: detect Samsung K9GBG08U0A, K9GAG08U0F ID
mtd: nand: decode Hynix MLC, 6-byte ID length
mtd: nand: increase max OOB size to 640
mtd: nand: add generic READ ID length calculation functions
mtd: nand: split simple ID decode into its own function
mtd: nand: split extended ID decoding into its own function
mtd: nand: split BB marker options decoding into its own function
mtd: nand: remove redundant ID read
mtd: nand: remove unnecessary variable
mtd: docg4: add missing HAS_IOMEM dependency
mtd: gpmi: initialize the timing registers only one time
mtd: gpmi: add EDO feature for imx6q
mtd: gpmi: do not set the default values for the extra clocks
mtd: gpmi: simplify the DLL setting code
mtd: gpmi: add a new field for HW_GPMI_CTRL1
mtd: gpmi: do not get the clock frequency in gpmi_begin()
mtd: gpmi: add a new field for HW_GPMI_TIMING1
mtd: add helpers to get the supportted ONFI timing mode
...
Linus Torvalds [Wed, 10 Oct 2012 01:49:20 +0000 (10:49 +0900)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
"This is a large pull, with the bulk of the updates coming from:
- Hole punching
- send/receive fixes
- fsync performance
- Disk format extension allowing more hardlinks inside a single
directory (btrfs-progs patch required to enable the compat bit for
this one)
I'm cooking more unrelated RAID code, but I wanted to make sure this
original batch makes it in. The largest updates here are relatively
old and have been in testing for some time."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
btrfs: init ref_index to zero in add_inode_ref
Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
Btrfs: fix page leakage
Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
Btrfs: don't bug on enomem in readpage
Btrfs: cleanup pages properly when ENOMEM in compression
Btrfs: make filesystem read-only when submitting barrier fails
Btrfs: detect corrupted filesystem after write I/O errors
Btrfs: make compress and nodatacow mount options mutually exclusive
btrfs: fix message printing
Btrfs: don't bother committing delayed inode updates when fsyncing
btrfs: move inline function code to header file
Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
btrfs: remove unused function btrfs_insert_some_items()
Btrfs: don't commit instead of overcommitting
Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
Btrfs: be smarter about dropping things from the tree log
Btrfs: don't lookup csums for prealloc extents
Btrfs: cache extent state when writing out dirty metadata pages
Btrfs: do not hold the file extent leaf locked when adding extent item
...
Linus Torvalds [Wed, 10 Oct 2012 01:48:32 +0000 (10:48 +0900)]
Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French.
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: reinstate the forcegid option
Convert properly UTF-8 to UTF-16
[CIFS] WARN_ON_ONCE if kernel_sendmsg() returns -ENOSPC
David Woodhouse [Tue, 9 Oct 2012 14:08:10 +0000 (15:08 +0100)]
mtd: Disable mtdchar mmap on MMU systems
This code was broken because it assumed that all MTD devices were map-based.
Disable it for now, until it can be fixed properly for the next merge window.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Wang Sheng-Hui [Mon, 8 Oct 2012 13:26:15 +0000 (07:26 -0600)]
Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
In csum_dirty_buffer, we first get eb from page->private.
Then we check if the page is the first page of eb. Later
we check it again. Remove the repeated check here.
Stefan Behrens [Wed, 1 Aug 2012 16:56:49 +0000 (18:56 +0200)]
Btrfs: make filesystem read-only when submitting barrier fails
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.
In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.
The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Andrei Popa [Thu, 20 Sep 2012 14:42:11 +0000 (08:42 -0600)]
Btrfs: make compress and nodatacow mount options mutually exclusive
If a filesystem is mounted with compression and then remounted by adding nodatacow,
the compression is disabled but the compress flag is still visible.
Also, if a filesystem is mounted with nodatacow and then remounted with compression,
nodatacow flag is still present but it's not active.
This patch:
- removes compress flags and notifies that the compression has been disabled if the
filesystem is mounted with nodatacow
- removes nodatacow and nodatasum flags if mounted with compress.
Robin Dong [Sat, 29 Sep 2012 08:07:47 +0000 (02:07 -0600)]
btrfs: move inline function code to header file
When building btrfs from kernel code, it will report:
fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called
fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here
fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called
fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here
because of the wrong declaration of inline functions.
Josef Bacik [Fri, 28 Sep 2012 20:04:19 +0000 (16:04 -0400)]
Btrfs: don't commit instead of overcommitting
I don't think we have the same problem that this was supposed to fix
originally since we can allocate chunks in the enospc path now. This code
is causing us to constantly commit the transaction as we get close to using
all of our available space in our currently allocated chunks, instead of
allocating another chunk and carrying on with life, which is not nice for
performance. Thanks,
Josef Bacik [Fri, 28 Sep 2012 15:56:28 +0000 (11:56 -0400)]
Btrfs: be smarter about dropping things from the tree log
When we truncate existing items in the tree log we've been searching for
each individual item and removing them. This is unnecessary churn and
searching, just keep track of the slot we are on and how many items we need
to delete and delete them all at once. Thanks,
Josef Bacik [Thu, 27 Sep 2012 21:07:30 +0000 (17:07 -0400)]
Btrfs: cache extent state when writing out dirty metadata pages
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,
Josef Bacik [Tue, 25 Sep 2012 19:26:16 +0000 (15:26 -0400)]
Btrfs: do not hold the file extent leaf locked when adding extent item
For some reason we unlock everything except the leaf we are on, set the path
blocking and then add the extent item for the extent we just finished
writing. I can't for the life of me figure out why we would want to do
this, and the history doesn't really indicate that there was a real reason
for it, so just remove it. This will reduce our tree lock contention on
heavy writes. Thanks,
Josef Bacik [Tue, 25 Sep 2012 18:25:58 +0000 (14:25 -0400)]
Btrfs: do not async metadata csumming in certain situations
There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help. The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost. The other case is for our tree log. We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads. Thanks,
Josef Bacik [Mon, 24 Sep 2012 17:42:00 +0000 (13:42 -0400)]
Btrfs: run delayed refs first when out of space
Running delayed refs is faster than running delalloc, so lets do that first
to try and reclaim space. This makes my fs_mark test about 20% faster.
Thanks,
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Btrfs: add a type field for the transaction handle
This patch add a type field into the transaction handle structure,
in this way, we needn't implement various end-transaction functions
and can make the code more simple and readable.
Mark Fasheh [Wed, 8 Aug 2012 18:33:54 +0000 (11:33 -0700)]
btrfs: extended inode ref iteration
The iterate_irefs in backref.c is used to build path components from inode
refs. This patch adds code to iterate extended refs as well.
I had modify the callback function signature to abstract out some of the
differences between ref structures. iref_to_path() also needed similar
changes.
Mark Fasheh [Wed, 8 Aug 2012 18:32:27 +0000 (11:32 -0700)]
btrfs: extended inode refs
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.
Inode creation does not need changing - extended refs are only added after
the ref array is full.
Linus Torvalds [Tue, 9 Oct 2012 12:06:41 +0000 (21:06 +0900)]
Fix staging driver use of VM_RESERVED
The VM_RESERVED flag was killed off in commit 314e51b9851b ("mm: kill
vma flag VM_RESERVED and mm->reserved_vm counter"), and replaced by the
proper semantic flags (eg "don't core-dump" etc). But there was a new
use of VM_RESERVED that got missed by the merge.
Fix the remaining use of VM_RESERVED in the vfio_pci driver, replacing
the VM_RESERVED flag with VM_DONTEXPAND | VM_DONTDUMP.
David Howells [Tue, 9 Oct 2012 08:49:09 +0000 (09:49 +0100)]
UAPI: (Scripted) Disintegrate include/mtd
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>
Linus Torvalds [Tue, 9 Oct 2012 07:23:15 +0000 (16:23 +0900)]
Merge branch 'akpm' (Andrew's patch-bomb)
Merge patches from Andrew Morton:
"A few misc things and very nearly all of the MM tree. A tremendous
amount of stuff (again), including a significant rbtree library
rework."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (160 commits)
sparc64: Support transparent huge pages.
mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd().
mm: Add and use update_mmu_cache_pmd() in transparent huge page code.
sparc64: Document PGD and PMD layout.
sparc64: Eliminate PTE table memory wastage.
sparc64: Halve the size of PTE tables
sparc64: Only support 4MB huge pages and 8KB base pages.
memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
mm: memcg: clean up mm_match_cgroup() signature
mm: document PageHuge somewhat
mm: use %pK for /proc/vmallocinfo
mm, thp: fix mlock statistics
mm, thp: fix mapped pages avoiding unevictable list on mlock
memory-hotplug: update memory block's state and notify userspace
memory-hotplug: preparation to notify memory block's state at memory hot remove
mm: avoid section mismatch warning for memblock_type_name
make GFP_NOTRACK definition unconditional
cma: decrease cc.nr_migratepages after reclaiming pagelist
CMA: migrate mlocked pages
kpageflags: fix wrong KPF_THP on non-huge compound pages
...
David Miller [Mon, 8 Oct 2012 23:34:29 +0000 (16:34 -0700)]
sparc64: Support transparent huge pages.
This is relatively easy since PMD's now cover exactly 4MB of memory.
Our PMD entries are 32-bits each, so we use a special encoding. The
lowest bit, PMD_ISHUGE, determines the interpretation. This is possible
because sparc64's page tables are purely software entities so we can use
whatever encoding scheme we want. We just have to make the TLB miss
assembler page table walkers aware of the layout.
set_pmd_at() works much like set_pte_at() but it has to operate in two
page from a table of non-huge PTEs, so we have to queue up TLB flushes
based upon what mappings are valid in the PTE table. In the second regime
we are going from huge-page to non-huge-page, and in that case we need
only queue up a single TLB flush to push out the huge page mapping.
We still have 5 bits remaining in the huge PMD encoding so we can very
likely support any new pieces of THP state tracking that might get added
in the future.
With lots of help from Johannes Weiner.
Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Mon, 8 Oct 2012 23:34:26 +0000 (16:34 -0700)]
mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd().
Invalidation sequences are handled in various ways on various
architectures.
One way, which sparc64 uses, is to let the set_*_at() functions accumulate
pending flushes into a per-cpu array. Then the flush_tlb_range() et al.
calls process the pending TLB flushes.
In this regime, the __tlb_remove_*tlb_entry() implementations are
essentially NOPs.
And we properly accomodate TLB flush mechanims like the one described
above.
Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Mon, 8 Oct 2012 23:34:25 +0000 (16:34 -0700)]
mm: Add and use update_mmu_cache_pmd() in transparent huge page code.
The transparent huge page code passes a PMD pointer in as the third
argument of update_mmu_cache(), which expects a PTE pointer.
This never got noticed because X86 implements update_mmu_cache() as a
macro and thus we don't get any type checking, and X86 is the only
architecture which supports transparent huge pages currently.
Before other architectures can support transparent huge pages properly we
need to add a new interface which will take a PMD pointer as the third
argument rather than a PTE pointer.
[akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390] Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Mon, 8 Oct 2012 23:34:23 +0000 (16:34 -0700)]
sparc64: Document PGD and PMD layout.
We're going to be messing around with the PMD interpretation and layout
for the sake of transparent huge pages, so we better clearly document what
we're starting with.
Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Mon, 8 Oct 2012 23:34:22 +0000 (16:34 -0700)]
sparc64: Eliminate PTE table memory wastage.
We've split up the PTE tables so that they take up half a page instead of
a full page. This is in order to facilitate transparent huge page
support, which works much better if our PMDs cover 4MB instead of 8MB.
What we do is have a one-behind cache for PTE table allocations in the
mm struct.
This logic triggers only on allocations. For example, we don't try to
keep track of free'd up page table blocks in the style that the s390 port
does.
There were only two slightly annoying aspects to this change:
1) Changing pgtable_t to be a "pte_t *". There's all of this special
logic in the TLB free paths that needed adjustments, as did the
PMD populate interfaces.
2) init_new_context() needs to zap the pointer, since the mm struct
just gets copied from the parent on fork.
Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Mon, 8 Oct 2012 23:34:20 +0000 (16:34 -0700)]
sparc64: Halve the size of PTE tables
The reason we want to do this is to facilitate transparent huge page
support.
Right now PMD's cover 8MB of address space, and our huge page size is 4MB.
The current transparent hugepage support is not able to handle HPAGE_SIZE
!= PMD_SIZE.
So make PTE tables be sized to half of a page instead of a full page.
We can still map properly the whole supported virtual address range which
on sparc64 requires 44 bits. Add a compile time CPP test which ensures
that this requirement is always met.
There is a minor inefficiency added by this change. We only use half of
the page for PTE tables. It's not trivial to use only half of the page
yet still get all of the pgtable_page_{ctor,dtor}() stuff working
properly. It is doable, and that will come in a subsequent change.
Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Miller [Mon, 8 Oct 2012 23:34:19 +0000 (16:34 -0700)]
sparc64: Only support 4MB huge pages and 8KB base pages.
Narrowing the scope of the page size configurations will make the
transparent hugepage changes much simpler.
In the end what we really want to do is have the kernel support multiple
huge page sizes and use whatever is appropriate as the context dictactes.
Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
When our x86 box calls __remove_pages(), release_mem_region() shows many
warnings. And x86 box cannot unregister iomem_resource.
"Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"
release_mem_region() has been changed to be called in each
PAGES_PER_SECTION by commit de7f0cba9678 ("memory hotplug: release
memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers
iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add
memory on x86 box, iomem_resource is register in each _CRS not
PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource.
The patch fixes the problem.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@austin.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Mon, 8 Oct 2012 23:34:12 +0000 (16:34 -0700)]
mm: memcg: clean up mm_match_cgroup() signature
It really should return a boolean for match/no match. And since it takes
a memcg, not a cgroup, fix that parameter name as well.
[akpm@linux-foundation.org: mm_match_cgroup() is not a macro] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Mon, 8 Oct 2012 23:34:06 +0000 (16:34 -0700)]
mm, thp: fix mlock statistics
NR_MLOCK is only accounted in single page units: there's no logic to
handle transparent hugepages. This patch checks the appropriate number of
pages to adjust the statistics by so that the correct amount of memory is
reflected.
David Rientjes [Mon, 8 Oct 2012 23:34:03 +0000 (16:34 -0700)]
mm, thp: fix mapped pages avoiding unevictable list on mlock
When a transparent hugepage is mapped and it is included in an mlock()
range, follow_page() incorrectly avoids setting the page's mlock bit and
moving it to the unevictable lru.
This is evident if you try to mlock(), munlock(), and then mlock() a
range again. Currently:
Notice that less than 2MB was added to the unevictable list; this is
because these pages in the range are not transparent hugepages since the
4GB range was allocated with mmap() and has no specific alignment. If
posix_memalign() were used instead, unevictable would not have grown at
all on the second mlock().
The fix is to call mlock_vma_page() so that the mlock bit is set and the
page is added to the unevictable list. With this patch:
Wen Congyang [Mon, 8 Oct 2012 23:34:01 +0000 (16:34 -0700)]
memory-hotplug: update memory block's state and notify userspace
remove_memory() will be called when hot removing a memory device. But
even if offlining memory, we cannot notice it. So the patch updates the
memory block's state and sends notification to userspace.
Additionally, the memory device may contain more than one memory block.
If the memory block has been offlined, __offline_pages() will fail. So we
should try to offline one memory block at a time.
Thus remove_memory() also check each memory block's state. So there is no
need to check the memory block's state before calling remove_memory().
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wen Congyang [Mon, 8 Oct 2012 23:33:58 +0000 (16:33 -0700)]
memory-hotplug: preparation to notify memory block's state at memory hot remove
remove_memory() is called in two cases:
1. echo offline >/sys/devices/system/memory/memoryXX/state
2. hot remove a memory device
In the 1st case, the memory block's state is changed and the notification
that memory block's state changed is sent to userland after calling
remove_memory(). So user can notice memory block is changed.
But in the 2nd case, the memory block's state is not changed and the
notification is not also sent to userspcae even if calling
remove_memory(). So user cannot notice memory block is changed.
For adding the notification at memory hot remove, the patch just prepare
as follows:
1st case uses offline_pages() for offlining memory.
2nd case uses remove_memory() for offlining memory and changing memory block's
state and notifing the information.
The patch does not implement notification to remove_memory().
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: avoid section mismatch warning for memblock_type_name
Following section mismatch warning is thrown during build;
WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
The function memblock_type_name() references
the variable __meminitdata memblock.
This is often because memblock_type_name lacks a __meminitdata
annotation or the annotation of memblock is wrong.
This is because memblock_type_name makes reference to memblock variable
with attribute __meminitdata. Hence, the warning (even if the function is
inline).
[akpm@linux-foundation.org: remove inline] Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Cc: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Glauber Costa [Mon, 8 Oct 2012 23:33:52 +0000 (16:33 -0700)]
make GFP_NOTRACK definition unconditional
There was a general sentiment in a recent discussion (See
https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
defined unconditionally. Currently, the only offender is GFP_NOTRACK,
which is conditional to KMEMCHECK.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Mon, 8 Oct 2012 23:33:51 +0000 (16:33 -0700)]
cma: decrease cc.nr_migratepages after reclaiming pagelist
reclaim_clean_pages_from_list() reclaims clean pages before migration so
cc.nr_migratepages should be updated. Currently, there is no problem but
it can be wrong if we try to use the value in future.
Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Mon, 8 Oct 2012 23:33:48 +0000 (16:33 -0700)]
CMA: migrate mlocked pages
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.
This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.
[akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Mon, 8 Oct 2012 23:33:47 +0000 (16:33 -0700)]
kpageflags: fix wrong KPF_THP on non-huge compound pages
KPF_THP can be set on non-huge compound pages (like slab pages or pages
allocated by drivers with __GFP_COMP) because PageTransCompound only
checks PG_head and PG_tail. Obviously this is a bug and breaks user space
applications which look for thp via /proc/kpageflags.
This patch rules out setting KPF_THP wrongly by additionally checking
PageLRU on the head pages.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Mon, 8 Oct 2012 23:33:42 +0000 (16:33 -0700)]
mm: remove unevictable_pgs_mlockfreed
Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
have been used by any tool, and of course we can restore it easily enough
if that turns out to be wrong.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Mon, 8 Oct 2012 23:33:39 +0000 (16:33 -0700)]
memory-hotplug: fix zone stat mismatch
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang. When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.
The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.
In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.
Minchan Kim [Mon, 8 Oct 2012 23:33:38 +0000 (16:33 -0700)]
mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range")
Revert commit 0def08e3acc2 because check_range can't fail in
migrate_to_node with considering current usecases.
Quote from Johannes
: I think it makes sense to revert. Not because of the semantics, but I
: just don't see how check_range() could even fail for this callsite:
:
: 1. we pass mm->mmap->vm_start in there, so we should not fail due to
: find_vma()
:
: 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
: and so can not fail
:
: 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
: continue until addr == end, so we never fail with -EIO
And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
which might pass to MPOL_MF_STRICT.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vasiliy Kulikov <segooon@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Haggai Eran [Mon, 8 Oct 2012 23:33:35 +0000 (16:33 -0700)]
mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end
In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.
This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.
Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.
Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Andrea Arcangeli <andrea@qumranet.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sagi Grimberg [Mon, 8 Oct 2012 23:33:33 +0000 (16:33 -0700)]
mm: move all mmu notifier invocations to be done outside the PT lock
In order to allow sleeping during mmu notifier calls, we need to avoid
invoking them under the page table spinlock. This patch solves the
problem by calling invalidate_page notification after releasing the lock
(but before freeing the page itself), or by wrapping the page invalidation
with calls to invalidate_range_begin and invalidate_range_end.
To prevent accidental changes to the invalidate_range_end arguments after
the call to invalidate_range_begin, the patch introduces a convention of
saving the arguments in consistently named locals:
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
The patch changes code to use this convention for all calls to
mmu_notifier_invalidate_range_start/end, except those where the calls are
close enough so that anyone who glances at the code can see the values
aren't changing.
This patchset is a preliminary step towards on-demand paging design to be
added to the RDMA stack.
Why do we want on-demand paging for Infiniband?
Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.
Why should anyone care? What problems are users currently experiencing?
This can make programming with RDMA much simpler. Today, developers
that are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we
might be able to provide a single memory access key for each process
that would provide the entire process's address as one large memory
region, and the developers wouldn't need to register memory regions at
all.
Is there any prospect that any other subsystems will utilise these
infrastructural changes? If so, which and how, etc?
As for other subsystems, I understand that XPMEM wanted to sleep in
MMU notifiers, as Christoph Lameter wrote at
http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
perhaps Andrea knows about other use cases.
Scheduling in mmu notifications is required since we need to sync the
hardware with the secondary page tables change. A TLB flush of an IO
device is inherently slower than a CPU TLB flush, so our design works by
sending the invalidation request to the device, and waiting for an
interrupt before exiting the mmu notifier handler.
Avi said:
kvm may be a buyer. kvm::mmu_lock, which serializes guest page
faults, also protects long operations such as destroying large ranges.
It would be good to convert it into a spinlock, but as it is used inside
mmu notifiers, this cannot be done.
(there are alternatives, such as keeping the spinlock and using a
generation counter to do the teardown in O(1), which is what the "may"
is doing up there).
[akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 8 Oct 2012 23:33:31 +0000 (16:33 -0700)]
hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach
Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping
page from vma") fixed pgoff calculation but it has replaced it by
vma_hugecache_offset() which is not approapriate for offsets used for
vma_prio_tree_foreach() because that one expects index in page units
rather than in huge_page_shift.
Johannes said:
: The resulting index may not be too big, but it can be too small: assume
: hpage size of 2M and the address to unmap to be 0x200000. This is regular
: page index 512 and hpage index 1. If you have a VMA that maps the file
: only starting at the second huge page, that VMAs vm_pgoff will be 512 but
: you ask for offset 1 and miss it even though it does map the page of
: interest. hugetlb_cow() will try to unmap, miss the vma, and retry the
: cow until the allocation succeeds or the skipped vma(s) go away.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP
In many places !pmd_present has been converted to pmd_none. For pmds
that's equivalent and pmd_none is quicker so using pmd_none is better.
However (unless we delete pmd_present) we should provide an accurate
pmd_present too. This will avoid the risk of code thinking the pmd is non
present because it's under __split_huge_page_map, see the pmd_mknotpresent
there and the comment above it.
If the page has been mprotected as PROT_NONE, it would also lead to a
pmd_present false negative in the same way as the race with
split_huge_page.
Because the PSE bit stays on at all times (both during split_huge_page and
when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit,
but checking the PROTNONE bit too is still good to remember pmd_present
must always keep PROT_NONE into account.
This explains a not reproducible BUG_ON that was seldom reported on the
lists.
The same issue is in pmd_large, it would go wrong with both PROT_NONE and
if it races with split_huge_page.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Kosina [Mon, 8 Oct 2012 23:33:25 +0000 (16:33 -0700)]
memory.txt: remove stray information
Andi removed some outedated documentation from Documentation/memory.txt
back in 2009 by commit 3b2b9a875ddc ("Documentation/memory.txt: remove
some very outdated recommendations"), but the resulting document is not
in a nice shape either.
It seems to me like we are not losing anything by completely removing the
file now.
David Rientjes [Mon, 8 Oct 2012 23:33:24 +0000 (16:33 -0700)]
mm, numa: reclaim from all nodes within reclaim distance
RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.
To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.
What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Mon, 8 Oct 2012 23:33:21 +0000 (16:33 -0700)]
mm: remove free_page_mlock
We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Mon, 8 Oct 2012 23:33:19 +0000 (16:33 -0700)]
mm: use clear_page_mlock() in page_remove_rmap()
We had thought that pages could no longer get freed while still marked as
mlocked; but Johannes Weiner posted this program to demonstrate that
truncating an mlocked private file mapping containing COWed pages is still
mishandled:
The anon COWed pages are not caught by truncation's clear_page_mlock() of
the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
look out for them there in page_remove_rmap(). Indeed, why should
truncation or invalidation be doing the clear_page_mlock() when removing
from pagecache? mlock is a property of mapping in userspace, not a
property of pagecache: an mlocked unmapped page is nonsensical.
Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Mon, 8 Oct 2012 23:33:18 +0000 (16:33 -0700)]
mm: remove vma arg from page_evictable
page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Mon, 8 Oct 2012 23:33:14 +0000 (16:33 -0700)]
mm: fix invalidate_complete_page2() lock ordering
In fuzzing with trinity, lockdep protested "possible irq lock inversion
dependency detected" when isolate_lru_page() reenabled interrupts while
still holding the supposedly irq-safe tree_lock:
isolate_lru_page() is correct to enable interrupts unconditionally:
invalidate_complete_page2() is incorrect to call clear_page_mlock() while
holding tree_lock, which is supposed to nest inside lru_lock.
Both truncate_complete_page() and invalidate_complete_page() call
clear_page_mlock() before taking tree_lock to remove page from radix_tree.
I guess invalidate_complete_page2() preferred to test PageDirty (again)
under tree_lock before committing to the munlock; but since the page has
already been unmapped, its state is already somewhat inconsistent, and no
worse if clear_page_mlock() moved up.
Reported-by: Sasha Levin <levinsasha928@gmail.com> Deciphered-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 8 Oct 2012 23:33:13 +0000 (16:33 -0700)]
memcg: move mem_cgroup_is_root upwards
kmem code uses this function and it is better to not use forward
declarations for static inline functions as some (older) compilers don't
like it:
gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)
mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 8 Oct 2012 23:33:10 +0000 (16:33 -0700)]
memcg: cleanup kmem tcp ifdefs
TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
the code is not used if !CONFIG_INET so we should rather test for both.
The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
let's keep those outside of any ifdefs because it is considered safer wrt.
future maintainability.
Michael Kerrisk [Mon, 8 Oct 2012 23:33:09 +0000 (16:33 -0700)]
memcg: trivial fixes for Documentation/cgroups/memory.txt
While reading through Documentation/cgroups/memory.txt, I found a number
of minor wordos and typos. The patch below is a conservative handling of
some of these: it provides just a number of "obviously correct" fixes to
the English that improve the readability of the document somewhat.
Obviously some more significant fixes need to be made to the document, but
some of those may not be in the "obvious correct" category.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
but is now:
zone->present_pages = spanned pages - absent pages - memmap pages.
spanned pages: total size, including holes.
absent pages: holes.
bootmem pages: pages used in system boot, managed by bootmem allocator.
memmap pages: pages used by page structs.
This may cause zone->present_pages less than it should be. For example,
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
present_pages should be spanned pages - absent pages, but now it also
minus memmap pages(free_area_init_core), which are actually allocated from
ZONE_MOVABLE. When offlining all memory of a zone, this will cause
zone->present_pages less than 0, because present_pages is unsigned long
type, it is actually a very large integer, it indirectly caused
zone->watermark[WMARK_MIN] becomes a large
integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
large integer(calculate_totalreserve_pages()), and finally cause memory
allocating failure when fork process(__vm_enough_memory()).
Rik van Riel [Mon, 8 Oct 2012 23:33:03 +0000 (16:33 -0700)]
mm: enable CONFIG_COMPACTION by default
Now that lumpy reclaim has been removed, compaction is the only way left
to free up contiguous memory areas. It is time to just enable
CONFIG_COMPACTION by default.
Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Catalin Marinas [Mon, 8 Oct 2012 23:33:01 +0000 (16:33 -0700)]
mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c
The update_mmu_cache() takes a pointer (to pte_t by default) as the last
argument but the huge_memory.c passes a pmd_t value. The patch changes
the argument to the pmd_t * pointer.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Mon, 8 Oct 2012 23:32:54 +0000 (16:32 -0700)]
memory-hotplug: don't replace lowmem pages with highmem
The changelog for commit 6a6dccba2fdc ("mm: cma: don't replace lowmem
pages with highmem") mentioned that lowmem pages can be replaced by
highmem pages during CMA migration. 6a6dccba2fdc fixed that issue.
Quote from that changelog:
: The filesystem layer expects pages in the block device's mapping to not
: be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
: currently replace lowmem pages with highmem pages, leading to crashes in
: filesystem code such as the one below:
:
: Unable to handle kernel NULL pointer dereference at virtual address 00000400
: pgd = c0c98000
: [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
: Internal error: Oops: 817 [#1] PREEMPT SMP ARM
: CPU: 0 Not tainted (3.5.0-rc5+ #80)
: PC is at __memzero+0x24/0x80
: ...
: Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
: Backtrace:
: [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
: [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
: r4:c15337f0
: [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
: [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
: r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
: [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
: r6:beccdcf0 r5:00074000 r4:beccdbbc
: [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
Memory-hotplug has same problem as CMA has so the same fix can be applied
to memory-hotplug as well.
Fix it by reusing.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:47 +0000 (16:32 -0700)]
mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.
Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.
1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.
2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.
The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:45 +0000 (16:32 -0700)]
mm: compaction: Restart compaction from near where it left off
This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.
However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.
This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.
This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:41 +0000 (16:32 -0700)]
mm: compaction: cache if a pageblock was scanned and no pages were isolated
When compaction was implemented it was known that scanning could
potentially be excessive. The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line. It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.
Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.
The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch
o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
o If there are a large number of allocation failures, discard the cache.
The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure. Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page. The time-based mechanism is
clumsy but a better option is not obvious.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:40 +0000 (16:32 -0700)]
revert "mm: have order > 0 compaction start off where it left"
This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages"). These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand. A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:36 +0000 (16:32 -0700)]
mm: compaction: acquire the zone->lock as late as possible
Compaction's free scanner acquires the zone->lock when checking for
PageBuddy pages and isolating them. It does this even if there are no
PageBuddy pages in the range.
This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone->lock.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Peter Ujfalusi <peter.ujfalusi@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:33 +0000 (16:32 -0700)]
mm: compaction: acquire the zone->lru_lock as late as possible
Richard Davies and Shaohua Li have both reported lock contention problems
in compaction on the zone and LRU locks as well as significant amounts of
time being spent in compaction. This series aims to reduce lock
contention and scanning rates to reduce that CPU usage. Richard reported
at https://lkml.org/lkml/2012/9/21/91 that this series made a big
different to a problem he reported in August:
Patch 1 defers acquiring the zone->lru_lock as long as possible.
Patch 2 defers acquiring the zone->lock as lock as possible.
Patch 3 reverts Rik's "skip-free" patches as the core concept gets
reimplemented later and the remaining patches are easier to
understand if this is reverted first.
Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
pageblocks should be skipped by the migrate and free scanners.
This drastically reduces the amount of scanning compaction has
to do.
Patch 5 reimplements something similar to Rik's idea except it uses the
pageblock-skip information to decide where the scanners should
restart from and does not need to wrap around.
I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were
Stress high-order allocation tests looked ok. Success rates are more or
less the same with the full series applied but there is an expectation
that there is less opportunity to race with other allocation requests if
there is less scanning. The time to complete the tests did not vary that
much and are uninteresting as were the vmstat statistics so I will not
present them here.
Using ftrace I recorded how much scanning was done by compaction and got this
The efficiency is worthless because of the nature of the test and the
number of failures. The really interesting point as far as this patch
series is concerned is the number of pages scanned. Note that reverting
Rik's patches massively increases the number of pages scanned indicating
that those patches really did make a difference to CPU usage.
However, caching what pageblocks should be skipped has a much higher
impact. With patches 1-8 applied, free page and migrate page scanning are
both reduced by 95% in comparison to the akpm kernel. If the basic
concept of Rik's patches are implemened on top then scanning then the free
scanner barely changed but migrate scanning was further reduced. That
said, tests on 3.6-rc5 indicated that the last patch had greater impact
than what was measured here so it is a bit variable.
One way or the other, this series has a large impact on the amount of
scanning compaction does when there is a storm of THP allocations.
This patch:
Compaction's migrate scanner acquires the zone->lru_lock when scanning a
range of pages looking for LRU pages to acquire. It does this even if
there are no LRU pages in the range. If multiple processes are compacting
then this can cause severe locking contention. To make matters worse
commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration") releases the lru_lock every
SWAP_CLUSTER_MAX pages that are scanned.
This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
if the lock is contended. This reduces the number of times it
unnecessarily disables and re-enables IRQs. The second is that it defers
acquiring the LRU lock for as long as possible. If there are no LRU pages
or the only LRU pages are transhuge then the LRU lock will not be acquired
at all which reduces contention on zone->lru_lock.
[minchan@kernel.org: augment comment]
[akpm@linux-foundation.org: tweak comment text] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Mon, 8 Oct 2012 23:32:30 +0000 (16:32 -0700)]
mm: compaction: move fatal signal check out of compact_checklock_irqsave
Commit c67fe3752abe ("mm: compaction: Abort async compaction if locks
are contended or taking too long") addressed a lock contention problem
in compaction by introducing compact_checklock_irqsave() that effecively
aborting async compaction in the event of compaction.
To preserve existing behaviour it also moved a fatal_signal_pending()
check into compact_checklock_irqsave() but that is very misleading. It
"hides" the check within a locking function but has nothing to do with
locking as such. It just happens to work in a desirable fashion.
This patch moves the fatal_signal_pending() check to
isolate_migratepages_range() where it belongs. Arguably the same check
should also happen when isolating pages for freeing but it's overkill.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Mon, 8 Oct 2012 23:32:27 +0000 (16:32 -0700)]
mm: compaction: abort compaction loop if lock is contended or run too long
isolate_migratepages_range() might isolate no pages if for example when
zone->lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone->lru_lock is even contended.
An additional check is added to ensure that cc.migratepages and
cc.freepages get properly drained whan compaction is aborted.
[minchan@kernel.org: Putback pages isolated for migration if aborting]
[akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
[akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
[minchan@kernel.org: Putback pages isolated for migration if aborting] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Mon, 8 Oct 2012 23:32:19 +0000 (16:32 -0700)]
readahead: fault retry breaks mmap file read random detection
.fault now can retry. The retry can break state machine of .fault. In
filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
try, since the page is in page cache now, ra->mmap_miss is decreased. And
these are done in one fault, so we can't detect random mmap file access.
Add a new flag to indicate .fault is tried once. In the second try, skip
ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.
I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)
Signed-off-by: Shaohua Li <shaohua.li@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The x86 implementation of atomic_dec_if_positive is quite generic, so make
it available to all architectures.
This is needed for "swap: add a simple detector for inappropriate swapin
readahead".
[akpm@linux-foundation.org: do the "#define foo foo" trick in the conventional manner] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Simek <monstr@monstr.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Mon, 8 Oct 2012 23:32:16 +0000 (16:32 -0700)]
memory-hotplug: fix pages missed by race rather than failing
If race between allocation and isolation in memory-hotplug offline
happens, some pages could be in MIGRATE_MOVABLE of free_list although the
pageblock's migratetype of the page is MIGRATE_ISOLATE.
The race could be detected by get_freepage_migratetype in
__test_page_isolated_in_pageblock. If it is detected, now EBUSY gets
bubbled all the way up and the hotplug operations fails.
But better idea is instead of returning and failing memory-hotremove, move
the free page to the correct list at the time it is detected. It could
enhance memory-hotremove operation success ratio although the race is
really rare.
free_hot_cold_page(Page A)
/* without zone->lock */
migratetype = get_pageblock_migratetype(Page A);
/*
* Page could be moved into MIGRATE_MOVABLE
* of per_cpu_pages
*/
list_add_tail(&page->lru, &pcp->lists[migratetype]);
/* Page A could be in MIGRATE_MOVABLE of free_list. */
check_pages_isolated
__test_page_isolated_in_pageblock
/*
* We can't catch freed page which
* is free_list[MIGRATE_MOVABLE]
*/
if (PageBuddy(page A))
pfn += 1 << page_order(page A);
/* So, Page A could be allocated */
__offline_isolated_pages
/*
* BUG_ON hit or offline page
* which is used by someone
*/
BUG_ON(!PageBuddy(page A));
This patch checks page's migratetype in freelist in
__test_page_isolated_in_pageblock. So now
__test_page_isolated_in_pageblock can check the page caused by above race
and can fail of memory offlining.
Minchan Kim [Mon, 8 Oct 2012 23:32:11 +0000 (16:32 -0700)]
mm: remain migratetype in freed page
The page allocator caches the pageblock information in page->private while
it is in the PCP freelists but this is overwritten with the order of the
page when freed to the buddy allocator. This patch stores the migratetype
of the page in the page->index field so that it is available at all times
when the page remain in free_list.
This patch adds a new call site in __free_pages_ok so it might be overhead
a bit but it's for high order allocation. So I believe damage isn't hurt.