Mel Gorman [Thu, 25 Oct 2012 01:14:33 +0000 (12:14 +1100)]
mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
Jiri Slaby reported the following:
(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".)
Given kswapd had hours of runtime in ps/top output yesterday in the
morning and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.
The intention of the patch in question was to compensate for the loss of
lumpy reclaim. Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a sane
compromise.
When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up each
time and continues reclaiming which was not taken into account when the
patch was developed.
As it is not taking deferred compaction into account in this path it scans
aggressively before falling out and making the compaction_deferred check
in compaction_ready. This patch avoids kswapd scaling pages for reclaim
and leaves the aggressive reclaim to the process attempting the THP
allocation.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Jiri Slaby <jslaby@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
epoll: support for disabling items, and a self-test app
Pending resolution of the issues identifier by Michael Kerrisk:
: I've taken a look at this patch as it currently stands in 3.7-rc1, and
: done a bit of testing. (By the way, the test program
: tools/testing/selftests/epoll/test_epoll.c does not compile...)
:
: There are one or two places where the behavior seems a little strange,
: so I have a question or two at the end of this mail. But other than
: that, I want to check my understanding so that the interface can be
: correctly documented.
:
: Just to go though my understanding, the problem is the following
: scenario in a multithreaded application:
:
: 1. Multiple threads are performing epoll_wait() operations,
: and maintaining a user-space cache that contains information
: corresponding to each file descriptor being monitored by
: epoll_wait().
:
: 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
: a file descriptor from the epoll interest list, and
: delete the corresponding record from the user-space cache.
:
: 3. The problem with (2) is that some other thread may have
: previously done an epoll_wait() that retrieved information
: about the fd in question, and may be in the middle of using
: information in the cache that relates to that fd. Thus,
: there is a potential race.
:
: 4. The race can't solved purely in user space, because doing
: so would require applying a mutex across the epoll_wait()
: call, which would of course blow thread concurrency.
:
: Right?
:
: Your solution is the EPOLL_CTL_DISABLE operation. I want to
: confirm my understanding about how to use this flag, since
: the description that has accompanied the patches so far
: has been a bit sparse
:
: 0. In the scenario you're concerned about, deleting a file
: descriptor means (safely) doing the following:
: (a) Deleting the file descriptor from the epoll interest list
: using EPOLL_CTL_DEL
: (b) Deleting the corresponding record in the user-space cache
:
: 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
: conjunction with EPOLLONESHOT.
:
: 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
: conjunction is a logical error.
:
: 3. The correct way to code multithreaded applications using
: EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
:
: a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
: should EPOLLONESHOT.
:
: b. When a thread wants to delete a file descriptor, it
: should do the following:
:
: [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
: [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
: was zero, then the file descriptor can be safely
: deleted by the thread that made this call.
: [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
: then the descriptor is in use. In this case, the calling
: thread should set a flag in the user-space cache to
: indicate that the thread that is using the descriptor
: should perform the deletion operation.
:
: Is all of the above correct?
:
: The implementation depends on checking on whether
: (events & ~EP_PRIVATE_BITS) == 0
: This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
: set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
: causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
: cleared.
:
: A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
: is only useful in conjunction with EPOLLONESHOT. However, as things
: stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
: not have EPOLLONESHOT set in 'events' This results in the following
: (slightly surprising) behavior:
:
: (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
: (the indicator that the file descriptor can be safely deleted).
: (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
:
: This doesn't seem particularly useful, and in fact is probably an
: indication that the user made a logic error: they should only be using
: epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
: EPOLLONESHOT was set in 'events'. If that is correct, then would it
: not make sense to return an error to user space for this case?
Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paton J. Lewis" <palewis@adobe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The compat ioctl for VIDEO_SET_SPU_PALETTE was missing an error check
while converting ioctl arguments. This could lead to leaking kernel stack
contents into userspace.
Patch extracted from existing fix in grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: David Miller <davem@davemloft.net> Cc: Brad Spengler <spender@grsecurity.net> Cc: PaX Team <pageexec@freemail.hu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Thu, 25 Oct 2012 01:13:58 +0000 (12:13 +1100)]
gen_init_cpio: avoid stack overflow when expanding
Fix possible overflow of the buffer used for expanding environment
variables when building file list. In the extremely unlikely case of an
attacker having control over the environment variables visible to
gen_init_cpio, control over the contents of the file gen_init_cpio parses,
and gen_init_cpio was built without compiler hardening, the attacker can
gain arbitrary execution control via a stack buffer overflow.
Signed-off-by: Jan Luebbe <jlu@pengutronix.de> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Roland Stigge <stigge@antcom.de> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Thu, 25 Oct 2012 01:13:57 +0000 (12:13 +1100)]
mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim
distance") caused zone_reclaim_mode to be set for all systems where two
nodes are within RECLAIM_DISTANCE of each other. This is the opposite of
what we actually want: zone_reclaim_mode should be set if two nodes are
sufficiently distant.
Andrew Vagin [Thu, 25 Oct 2012 01:13:57 +0000 (12:13 +1100)]
pidns: limit the nesting depth of pid namespaces
'struct pid' is a "variable sized struct" - a header with an array of
upids at the end.
The size of the array depends on a level (depth) of pid namespaces. Now a
level of pidns is not limited, so 'struct pid' can be more than one page.
Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
not calculated from PAGE_SIZE, because in this case it depends on
architectures, config options and it will be reduced, if someone adds a
new fields in struct pid or struct upid.
I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
"struct pid" and it's more than enough for all known for me use-cases.
When someone finds a reasonable use case, we can add a config option or a
sysctl parameter.
In addition it will reduce the effect of another problem, when we have
many nested namespaces and the oldest one starts dying.
zap_pid_ns_processe will be called for each namespace and find_vpid will
be called for each process in a namespace. find_vpid will be called
minimum max_level^2 / 2 times. The reason of that is that when we found a
bit in pidmap, we can't determine this pidns is top for this process or it
isn't.
vpid is a heavy operation, so a fork bomb, which create many nested
namespace, can make a system inaccessible for a long time. For example my
system becomes inaccessible for a few minutes with 4000 processes.
Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hein Tibosch [Thu, 25 Oct 2012 01:13:56 +0000 (12:13 +1100)]
drivers/dma/dw_dmac: make driver's endianness configurable
The dw_dmac driver was originally developed for avr32 to be used with the
Synopsys DesignWare AHB DMA controller. Starting from 2.6.38, access to
the device's i/o memory was done with the little-endian readl/writel
functions(1)
This broke the driver for the avr32 platform, because it needs big
(native) endian accessors. This patch makes the endianness configurable
using 'DW_DMAC_BIG_ENDIAN_IO', which will default be true for AVR32
I submitted this patch before(2) but then waited for Andy to finish other
changes to the same module(3).
Gavin Shan [Thu, 25 Oct 2012 01:13:56 +0000 (12:13 +1100)]
mm/mmu_notifier: allocate mmu_notifier in advance
While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
to work in case of tight available memory. Eventually, that would lead to
a deadlock while the swap deamon swaps anonymous pages. It was caused by
commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").
Latest Linus head run of "make selftests" in the tools directory failed
with references to undefined variables. Reference was to
'write_thread_data' which is the name of a struct that is being used, not
the variable itself. Change reference so it points to the variable.
Signed-off-by: Daniel Hazelton <dshadowwolf@gmail.com> Cc: "Paton J. Lewis" <palewis@adobe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Howells [Thu, 25 Oct 2012 01:13:55 +0000 (12:13 +1100)]
UAPI: fix tools/vm/page-types.c
Fix tools/vm/page-types.c to use the UAPI variant of linux/kernel-page-flags.h
lest the following error appear:
In file included from page-types.c:38:0:
../../include/linux/kernel-page-flags.h:4:42: fatal error:
uapi/linux/kernel-page-flags.h: No such file or directory
Reported-by: Daniel Hazelton <dshadowwolf@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Daniel Hazelton <dshadowwolf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bob Liu [Thu, 25 Oct 2012 01:13:55 +0000 (12:13 +1100)]
mm/page_alloc.c:alloc_contig_range(): return early for err path
If start_isolate_page_range() failed, unset_migratetype_isolate() has been
done inside it.
Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Will Deacon [Thu, 25 Oct 2012 01:13:54 +0000 (12:13 +1100)]
rbtree: include linux/compiler.h for definition of __always_inline
rb_erase_augmented() is a static function annotated with __always_inline.
This causes a compile failure when attempting to use the rbtree
implementation as a library (e.g. kvm tool):
rbtree_augmented.h:125:24: error: expected `=', `,', `;', `asm' or `__attribute__' before `void'
Include linux/compiler.h in rbtree_augmented.h so that the __always_inline
macro is resolved correctly.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
genalloc: stop crashing the system when destroying a pool
The genalloc code uses the bitmap API from include/linux/bitmap.h and
lib/bitmap.c, which is based on long values. Both bitmap_set from
lib/bitmap.c and bitmap_set_ll, which is the lockless version from
genalloc.c, use BITMAP_LAST_WORD_MASK to set the first bits in a long in
the bitmap.
That one uses (1 << bits) - 1, 0b111, if you are setting the first three
bits. This means that the API counts from the least significant bits
(LSB from now on) to the MSB. The LSB in the first long is bit 0, then.
The same works for the lookup functions.
The genalloc code uses longs for the bitmap, as it should. In
include/linux/genalloc.h, struct gen_pool_chunk has unsigned long
bits[0] as its last member. When allocating the struct, genalloc should
reserve enough space for the bitmap. This should be a proper number of
longs that can fit the amount of bits in the bitmap.
However, genalloc allocates an integer number of bytes that fit the
amount of bits, but may not be an integer amount of longs. 9 bytes, for
example, could be allocated for 70 bits.
This is a problem in itself if the Least Significat Bit in a long is in
the byte with the largest address, which happens in Big Endian machines.
This means genalloc is not allocating the byte in which it will try to
set or check for a bit.
This may end up in memory corruption, where genalloc will try to set the
bits it has not allocated. In fact, genalloc may not set these bits
because it may find them already set, because they were not zeroed
since they were not allocated. And that's what causes a BUG when
gen_pool_destroy is called and check for any set bits.
What really happens is that genalloc uses kmalloc_node with __GFP_ZERO
on gen_pool_add_virt. With SLAB and SLUB, this means the whole slab will
be cleared, not only the requested bytes. Since struct gen_pool_chunk
has a size that is a multiple of 8, and slab sizes are multiples of 8,
we get lucky and allocate and clear the right amount of bytes.
Hower, this is not the case with SLOB or with older code that did memset
after allocating instead of using __GFP_ZERO.
So, a simple module as this (running 3.6.0), will cause a crash when
rmmod'ed.
Jingoo Han [Thu, 25 Oct 2012 01:13:54 +0000 (12:13 +1100)]
backlight: ili9320: add missing SPI dependency
Add this missing SPI dependency and prevent the driver from building
without SPI, because functions of the spi driver are used in this driver.
drivers/video/backlight/ili9320.c:51: undefined reference to `spi_sync'
Also, a prompt string for CONFIG_LCD_ILI9320 is added for explicit
selection.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Aristeu Rozanski [Thu, 25 Oct 2012 01:13:53 +0000 (12:13 +1100)]
device_cgroup: stop using simple_strtoul()
Convert the code to use kstrtou32() instead of simple_strtoul() which is
deprecated. The real size of the variables are u32, so use kstrtou32
instead of kstrtoul
Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Aristeu Rozanski [Thu, 25 Oct 2012 01:13:53 +0000 (12:13 +1100)]
device_cgroup: rename deny_all to behavior
This was done in a v2 patch but v1 ended up being committed. The variable
name is less confusing and stores the default behavior when no matching
exception exists.
Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Slaby [Thu, 25 Oct 2012 01:13:52 +0000 (12:13 +1100)]
cgroup: fix invalid rcu dereference
Commit ad676077a2ae ("device_cgroup: convert device_cgroup internally to
policy + exceptions") removed rcu locks which are needed in task_devcgroup
called in this chain: devcgroup_inode_mknod OR
__devcgroup_inode_permission -> __devcgroup_inode_permission ->
task_devcgroup -> task_subsys_state -> task_subsys_state_check.
Change the code so that task_devcgroup is safely called with rcu read
lock held.
Jan Kara [Thu, 25 Oct 2012 01:13:52 +0000 (12:13 +1100)]
mm: fix XFS oops due to dirty pages without buffers on s390
On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via buffered
write, HW dirty bit gets set and when we later map and unmap the page,
page_remove_rmap() finds the dirty bit and calls set_page_dirty().
Dirtying of a page which shouldn't be dirty can cause all sorts of
problems to filesystems. The bug we observed in practice is that buffers
from the page get freed, so when the page gets later marked as dirty and
writeback writes it, XFS crashes due to an assertion
BUG_ON(!PagePrivate(page)) in page_buffers() called from
xfs_count_page_state().
Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then
page unmapped.
Fix the issue by ignoring s390 HW dirty bit for page cache pages of
mappings with mapping_cap_account_dirty(). This is safe because for such
mappings when a page gets marked as writeable in PTE it is also marked
dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared
by clear_page_dirty_for_io(), the page gets writeprotected in
page_mkclean(). So pagecache page is writeable if and only if it is
dirty.
Thanks to Hugh Dickins for pointing out mapping has to have
mapping_cap_account_dirty() for things to work and proposing a cleaned up
variant of the patch.
The patch has survived about two hours of running fsx-linux on tmpfs while
heavily swapping and several days of running on out build machines where
the original problem was triggered.
Signed-off-by: Jan Kara <jack@suse.cz> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>