Arun Sharma [Thu, 22 Dec 2011 00:15:40 +0000 (16:15 -0800)]
sched/tracing: Add a new tracepoint for sleeptime
If CONFIG_SCHEDSTATS is defined, the kernel maintains
information about how long the task was sleeping or
in the case of iowait, blocking in the kernel before
getting woken up.
This will be useful for sleep time profiling.
Note: this information is only provided for sched_fair.
Other scheduling classes may choose to provide this in
the future.
Note: the delay includes the time spent on the runqueue
as well.
Signed-off-by: Arun Sharma <asharma@fb.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1324512940-32060-2-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_waking_fair().
sched: Fix cgroup movement of newly created process
There is a small race between do_fork() and sched_move_task(), which is
trying to move the child.
do_fork() sched_move_task()
--------------------------------+---------------------------------
copy_process()
sched_fork()
task_fork_fair()
-> vruntime of the child is initialized
based on that of the parent.
-> we can see the child in "tasks" file now.
task_rq_lock()
task_move_group_fair()
-> child.se.vruntime
-= (old)cfs_rq->min_vruntime
+= (new)cfs_rq->min_vruntime
task_rq_unlock()
wake_up_new_task()
...
enqueue_entity()
child.se.vruntime += cfs_rq->min_vruntime
As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_fork_fair().
There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.
task_fork_fair() sched_move_task()
--------------------------------+---------------------------------
cfs_rq = task_cfs_rq(current)
-> cfs_rq is the "old" one.
curr = cfs_rq->curr
-> curr is set to the parent.
task_rq_lock()
dequeue_task()
->parent.se.vruntime -= (old)cfs_rq->min_vruntime
enqueue_task()
->parent.se.vruntime += (new)cfs_rq->min_vruntime
task_rq_unlock()
raw_spin_lock_irqsave(rq->lock)
se->vruntime = curr->vruntime
-> vruntime of the child is set to that of the parent
which has already been updated by sched_move_task().
se->vruntime -= (old)cfs_rq->min_vruntime.
raw_spin_unlock_irqrestore(rq->lock)
As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
This patch fixes this problem by setting "cfs_rq" and "curr" after
holding the rq->lock.
Kamalesh Babulal [Sat, 10 Dec 2011 13:59:25 +0000 (19:29 +0530)]
sched: Remove cfs bandwidth period check in tg_set_cfs_period()
Remove cfs bandwidth period check from tg_set_cfs_period.
Invalid bandwidth period's lower/upper limits are denoted
by min_cfs_quota_period/max_cfs_quota_period repsectively,
and are checked against valid period in tg_set_cfs_bandwidth().
As pjt pointed out, negative input will result in very large unsigned
numbers and will be caught by the max allowed period test.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Paul Turner <pjt@google.com>
[ammended changelog to mention negative values] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111210135925.GA14593@linux.vnet.ibm.com
--
kernel/sched/core.c | 3 ---
1 file changed, 3 deletions(-)
Peter Zijlstra [Thu, 22 Sep 2011 13:30:18 +0000 (15:30 +0200)]
sched: Fix load-balance lock-breaking
The current lock break relies on contention on the rq locks, something
which might never come because we've got IRQs disabled. Or will be
very likely because on anything with more than 2 cpus a synchronized
load-balance pass will very likely cause contention on the rq locks.
Also the sched_nr_migrate thing fails when it gets trapped the loops
of either the cgroup muck in load_balance_fair() or the move_tasks()
load condition.
Instead, use the new lb_flags field to propagate break/abort
conditions for all these loops and create a new loop outside the irq
disabled on the break being required.
For 32-bit architectures using standard jiffies the idletime calculation
in uptime_proc_show will quickly overflow. It takes (2^32 / HZ) seconds
of idle-time, or e.g. 12.45 days with no load on a quad-core with HZ=1000.
Switch to 64-bit calculations.
Cc: stable@vger.kernel.org Cc: Michael Abbott <michael.abbott@diamond.ac.uk> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make cputime_t and cputime64_t nocast to enable sparse checking to
detect incorrect use of cputime. Drop the cputime macros for simple
scalar operations. The conversion macros are still needed.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Linus Torvalds [Thu, 15 Dec 2011 02:25:58 +0000 (18:25 -0800)]
Merge tag 'tytso-for-linus-20111214' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* tag 'tytso-for-linus-20111214' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: handle EOF correctly in ext4_bio_write_page()
ext4: remove a wrong BUG_ON in ext4_ext_convert_to_initialized
ext4: correctly handle pages w/o buffers in ext4_discard_partial_buffers()
ext4: avoid potential hang in mpage_submit_io() when blocksize < pagesize
ext4: avoid hangs in ext4_da_should_update_i_disksize()
ext4: display the correct mount option in /proc/mounts for [no]init_itable
ext4: Fix crash due to getting bogus eh_depth value on big-endian systems
ext4: fix ext4_end_io_dio() racing against fsync()
.. using the new signed tag merge of git that now verifies the gpg
signature automatically. Yay. The branchname was just 'dev', which is
prettier. I'll tell Ted to use nicer tag names for future cases.
Linus Torvalds [Thu, 15 Dec 2011 02:22:55 +0000 (18:22 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/ncpfs: fix error paths and goto statements in ncp_fill_super()
configfs: register_filesystem() called too early
fuse: register_filesystem() called too early
ubifs: too early register_filesystem()
... and the same kind of leak for mqueue
procfs: fix a vfsmount longterm reference leak
Yongqiang Yang [Wed, 14 Dec 2011 02:51:55 +0000 (21:51 -0500)]
ext4: avoid potential hang in mpage_submit_io() when blocksize < pagesize
If there is an unwritten but clean buffer in a page and there is a
dirty buffer after the buffer, then mpage_submit_io does not write the
dirty buffer out. As a result, da_writepages loops forever.
This patch fixes the problem by checking dirty flag.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Andrea Arcangeli [Wed, 14 Dec 2011 02:41:15 +0000 (21:41 -0500)]
ext4: avoid hangs in ext4_da_should_update_i_disksize()
If the pte mapping in generic_perform_write() is unmapped between
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
"copied" parameter to ->end_write can be zero. ext4 couldn't cope with
it with delayed allocations enabled. This skips the i_disksize
enlargement logic if copied is zero and no new data was appeneded to
the inode.
gdb> bt
#0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
#1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
#2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\
ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440
#3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\
os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482
#4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
xffff88001e26be40) at mm/filemap.c:2600
#5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\
zed out>, pos=<value optimized out>) at mm/filemap.c:2632
#6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
t fs/ext4/file.c:136
#7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \
ppos=0xffff88001e26bf48) at fs/read_write.c:406
#8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\
000, pos=0xffff88001e26bf48) at fs/read_write.c:435
#9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\
4000) at fs/read_write.c:487
#10 <signal handler called>
#11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
#12 0x0000000000000000 in ?? ()
gdb> print offset
$22 = 0xffffffffffffffff
gdb> print idx
$23 = 0xffffffff
gdb> print inode->i_blkbits
$24 = 0xc
gdb> up
#1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
2512 if (ext4_da_should_update_i_disksize(page, end)) {
gdb> print start
$25 = 0x0
gdb> print end
$26 = 0xffffffffffffffff
gdb> print pos
$27 = 0x108000
gdb> print new_i_size
$28 = 0x108000
gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
$29 = 0xd9000
gdb> down
2467 for (i = 0; i < idx; i++)
gdb> print i
$30 = 0xd44acbee
This is 100% reproducible with some autonuma development code tuned in
a very aggressive manner (not normal way even for knumad) which does
"exotic" changes to the ptes. It wouldn't normally trigger but I don't
see why it can't happen normally if the page is added to swap cache in
between the two faults leading to "copied" being zero (which then
hangs in ext4). So it should be fixed. Especially possible with lumpy
reclaim (albeit disabled if compaction is enabled) as that would
ignore the young bits in the ptes.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Linus Torvalds [Tue, 13 Dec 2011 23:02:31 +0000 (15:02 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "x86, efi: Calling __pa() with an ioremap()ed address is invalid"
x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware
Linus Torvalds [Tue, 13 Dec 2011 22:59:42 +0000 (14:59 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: add missing spin_unlock at ceph_mdsc_build_path()
ceph: fix SEEK_CUR, SEEK_SET regression
crush: fix mapping calculation when force argument doesn't exist
ceph: use i_ceph_lock instead of i_lock
rbd: remove buggy rollback functionality
rbd: return an error when an invalid header is read
ceph: fix rasize reporting by ceph_show_options
Linus Torvalds [Tue, 13 Dec 2011 22:58:56 +0000 (14:58 -0800)]
Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: set max_pause to lowest value on zero bdi_dirty
writeback: permit through good bdi even when global dirty exceeded
writeback: comment on the bdi dirty threshold
fs: Make write(2) interruptible by a fatal signal
writeback: Fix issue on make htmldocs
Sage Weil [Tue, 13 Dec 2011 17:19:26 +0000 (09:19 -0800)]
ceph: fix SEEK_CUR, SEEK_SET regression
Commit 06222e491e663dac939f04b125c9dc52126a75c4 got the if wrong so that
it always evaluates as true. This is semantically harmless, but makes
SEEK_CUR and SEEK_SET needlessly query the server.
Rewrite the if to explicitly enumerate the cases we DO need a valid i_size
to make this code less fragile.
Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
Miklos Szeredi [Tue, 13 Dec 2011 10:40:59 +0000 (11:40 +0100)]
fuse: llseek fix race
Fix race between lseek(fd, 0, SEEK_CUR) and read/write. This was fixed in
generic code by commit 5b6f1eb97d (vfs: lseek(fd, 0, SEEK_CUR) race condition).
Roel Kluin [Tue, 13 Dec 2011 09:37:00 +0000 (10:37 +0100)]
fuse: fix llseek bug
The test in fuse_file_llseek() "not SEEK_CUR or not SEEK_SET" always evaluates
to true.
This was introduced in 3.1 by commit 06222e49 (fs: handle SEEK_HOLE/SEEK_DATA
properly in all fs's that define their own llseek) and changed the behavior of
SEEK_CUR and SEEK_SET to always retrieve the file attributes. This is a
performance regression.
Fix the test so that it makes sense.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org CC: Josef Bacik <josef@redhat.com> CC: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Tue, 13 Dec 2011 06:06:55 +0000 (22:06 -0800)]
linux/log2.h: Fix rounddown_pow_of_two(1)
Exactly like roundup_pow_of_two(1), the rounddown version was buggy for
the case of a compile-time constant '1' argument. Probably because it
originated from the same code, sharing history with the roundup version
from before the bugfix (for that one, see commit 1a06a52ee1b0: "Fix
roundup_pow_of_two(1)").
However, unlike the roundup version, the fix for rounddown is to just
remove the broken special case entirely. It's simply not needed - the
generic code
1UL << ilog2(n)
does the right thing for the constant '1' argment too. The only reason
roundup needed that special case was because rounding up does so by
subtracting one from the argument (and then adding one to the result)
causing the obvious problems with "ilog2(0)".
But rounddown doesn't do any of that, since ilog2() naturally truncates
(ie "rounds down") to the right rounded down value. And without the
ilog2(0) case, there's no reason for the special case that had the wrong
value.
tl;dr: rounddown_pow_of_two(1) should be 1, not 0.
Linus Torvalds [Tue, 13 Dec 2011 04:06:13 +0000 (20:06 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
mmc: core: Fix deadlock when the CONFIG_MMC_UNSAFE_RESUME is not defined
mmc: sdhci-s3c: Remove old and misprototyped suspend operations
mmc: tmio: fix clock gating on platforms with a .set_pwr() method
mmc: sh_mmcif: fix clock gating on platforms with a .down_pwr() method
mmc: core: Fix typo at mmc_card_sleep
mmc: core: Fix power_off_notify during suspend
mmc: core: Fix setting power notify state variable for non-eMMC
mmc: core: Add quirk for long data read time
mmc: Add module.h include to sdhci-cns3xxx.c
mmc: mxcmmc: fix falling back to PIO
mmc: omap_hsmmc: DMA unmap only once in case of MMC error
Theodore Ts'o [Tue, 13 Dec 2011 03:06:18 +0000 (22:06 -0500)]
ext4: display the correct mount option in /proc/mounts for [no]init_itable
/proc/mounts was showing the mount option [no]init_inode_table when
the correct mount option that will be accepted by parse_options() is
[no]init_itable.
Paul Mackerras [Mon, 12 Dec 2011 16:00:56 +0000 (11:00 -0500)]
ext4: Fix crash due to getting bogus eh_depth value on big-endian systems
Commit 1939dd84b3 ("ext4: cleanup ext4_ext_grow_indepth code") added a
reference to ext4_extent_header.eh_depth, but forget to pass the value
read through le16_to_cpu. The result is a crash on big-endian
machines, such as this crash on a POWER7 server:
This is due to getting ext_depth(inode) == 0x101 and therefore running
off the end of the path array in ext4_ext_drop_refs into following
unallocated structures.
This fixes it by adding the necessary le16_to_cpu.
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Mon, 12 Dec 2011 15:53:02 +0000 (10:53 -0500)]
ext4: fix ext4_end_io_dio() racing against fsync()
We need to make sure iocb->private is cleared *before* we put the
io_end structure on i_completed_io_list. Otherwise fsync() could
potentially run on another CPU and free the iocb structure out from
under us.
Reported-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
arm_dma_zone_size is used by arm_bootmem_free() which is called by
paging_init(). Thus it needs to be set before calling it.
Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org> Acked-by: Nicolas Pitre <nico@linaro.org> Cc: stable@kernel.org Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
mmc: core: Fix deadlock when the CONFIG_MMC_UNSAFE_RESUME is not defined
mmc_suspend_host() tries to claim host during suspend
and release it only when the bus suspend operation is
compeleted. If CONFIG_MMC_UNSAFE_RESUME is defined and
the host is flagged as removable, mmc_suspend_host()
tries to remove the card. In this process, the file system
sync can get blocked trying to acquire host which is already
claimed by mmc_suspend_host() causing deadlock.
Fix this deadlock by releasing host before ->remove() is called.
Mark Brown [Mon, 21 Nov 2011 18:00:51 +0000 (18:00 +0000)]
mmc: sdhci-s3c: Remove old and misprototyped suspend operations
Now that the driver is using dev_pm_ops the suspend operations in the
platform_driver structure won't get called so don't need to be there,
and certainly shouldn't be the same function as dev_pm_ops since the
signatures are different.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Chris Ball <cjb@laptop.org>
Kyungmin Park [Thu, 17 Nov 2011 04:34:33 +0000 (13:34 +0900)]
mmc: core: Fix typo at mmc_card_sleep
Fix wrong bus_ops->sleep check. (This isn't expected to have real-world
consequences, because the mmc core always defines both 'awake' and
'sleep' ops.)
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Chris Ball <cjb@laptop.org>
Girish K S [Tue, 15 Nov 2011 06:25:46 +0000 (11:55 +0530)]
mmc: core: Fix power_off_notify during suspend
The eMMC 4.5 devices respond to only RESET and AWAKE command in the
sleep state. Hence the mmc switch command to notify power off state
should be sent before the device enters sleep state.
This patch fixes the same.
Signed-off-by: Girish K S <girish.shivananjappa@linaro.org> Signed-off-by: Chris Ball <cjb@laptop.org>
Girish K S [Fri, 4 Nov 2011 10:52:47 +0000 (16:22 +0530)]
mmc: core: Fix setting power notify state variable for non-eMMC
This patch skips the setting of the power notify state variable
for non eMMC 4.5 devices. Also fixes the problem of omap_hsmmc
noisy/broken for suspend resume reported by Kevin Hilman.
Signed-off-by: Girish K S <girish.shivananjappa@linaro.org> Acked-by: Ulf Hansson <ulf.hansson@stericsson.com> Signed-off-by: Chris Ball <cjb@laptop.org>
Adds a quirk that sets the data read timeout to a fixed value instead
of relying on the information in the CSD. The timeout value chosen
is 300ms since that has proven enough for the problematic cards found,
but could be increased if other cards require this.
This patch also enables this quirk for certain Micron cards known to
have this problem.
Signed-off-by: Stefan Nilsson XK <stefan.xk.nilsson@stericsson.com> Signed-off-by: Ulf Hansson <ulf.hansson@stericsson.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Cc: <stable@kernel.org> Signed-off-by: Chris Ball <cjb@laptop.org>
Sascha Hauer [Fri, 11 Nov 2011 15:28:05 +0000 (16:28 +0100)]
mmc: mxcmmc: fix falling back to PIO
When we can't configure the dma channel we want to fall
back to PIO. We do this by setting host->do_dma to zero.
This does not work as do_dma is used to see whether dma
can be used for the current transfer. Instead, we have
to set host->dma to NULL.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Cc: <stable@kernel.org> Signed-off-by: Chris Ball <cjb@laptop.org>
Matt Fleming [Thu, 8 Dec 2011 12:10:50 +0000 (12:10 +0000)]
x86, efi: Make efi_call_phys_{prelog,epilog} CONFIG_RELOCATABLE-aware
efi_call_phys_prelog() sets up a 1:1 mapping of the physical address
range in swapper_pg_dir. Instead of replacing then restoring entries
in swapper_pg_dir we should be using initial_page_table which already
contains the 1:1 mapping.
It's safe to blindly switch back to swapper_pg_dir in the epilog
because the physical EFI routines are only called before
efi_enter_virtual_mode(), e.g. before any user processes have been
forked. Therefore, we don't need to track which pgd was in %cr3 when
we entered the prelog.
The previous code actually contained a bug because it assumed that the
kernel was loaded at a physical address within the first 8MB of ram,
usually at 0x100000. However, this isn't the case with a
CONFIG_RELOCATABLE=y kernel which could have been loaded anywhere in
the physical address space.
Also delete the ancient (and bogus) comments about the page table
being restored after the lock is released. There is no locking.
Linus Torvalds [Fri, 9 Dec 2011 22:45:44 +0000 (14:45 -0800)]
Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
cifs: check for NULL last_entry before calling cifs_save_resume_key
cifs: attempt to freeze while looping on a receive attempt
cifs: Fix sparse warning when calling cifs_strtoUCS
CIFS: Add descriptions to the brlock cache functions
Linus Torvalds [Fri, 9 Dec 2011 22:45:12 +0000 (14:45 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Calling __pa() with an ioremap()ed address is invalid
x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
x86/intel_mid: Kconfig select fix
x86/intel_mid: Fix the Kconfig for MID selection
Linus Torvalds [Fri, 9 Dec 2011 22:41:50 +0000 (14:41 -0800)]
Merge branch 'spi/for-3.2' of git://git.pengutronix.de/git/wsa/linux-2.6
* 'spi/for-3.2' of git://git.pengutronix.de/git/wsa/linux-2.6:
spi/gpio: fix section mismatch warning
spi/fsl-espi: disable CONFIG_SPI_FSL_ESPI=m build
spi/nuc900: Include linux/module.h
spi/ath79: fix compile error due to missing include
Linus Torvalds [Fri, 9 Dec 2011 16:18:08 +0000 (08:18 -0800)]
Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
md: raid5 crash during degradation
md/raid5: never wait for bad-block acks on failed device.
md: ensure new badblocks are handled promptly.
md: bad blocks shouldn't cause a Blocked status on a Faulty device.
md: take a reference to mddev during sysfs access.
md: refine interpretation of "hold_active == UNTIL_IOCTL".
md/lock: ensure updates to page_attrs are properly locked.
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: use new generic {enable,disable}_percpu_irq() routines
drivers/net/ethernet/tile: use skb_frag_page() API
asm-generic/unistd.h: support new process_vm_{readv,write} syscalls
arch/tile: fix double-free bug in homecache_free_pages()
arch/tile: add a few #includes and an EXPORT to catch up with kernel changes.
Linus Torvalds [Fri, 9 Dec 2011 16:07:42 +0000 (08:07 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Fix lost speaker volume controls
ALSA: hda/realtek - Create "Bass Speaker" for two speaker pins
ALSA: hda/realtek - Don't create extra controls with channel suffix
ALSA: hda - Fix remaining VREF mute-LED NID check in post-3.1 changes
ALSA: hda - Fix GPIO LED setup for IDT 92HD75 codecs
ASoC: Provide a more complete DMA driver stub
ASoC: Remove references to corgi and spitz from machine driver document
ASoC: Make SND_SOC_MX27VIS_AIC32X4 depend on I2C
ASoC: Fix dependency for SND_SOC_RAUMFELD and SND_PXA2XX_SOC_HX4700
ASoC: uda1380: Return proper error in uda1380_modinit failure path
ASoC: kirkwood: Make SND_KIRKWOOD_SOC_OPENRD and SND_KIRKWOOD_SOC_T5325 depend on I2C
ASoC: Mark WM8994 ADC muxes as virtual
ALSA: hda/realtek - Fix Oops in alc_mux_select()
ALSA: sis7019 - give slow codecs more time to reset
Linus Torvalds [Fri, 9 Dec 2011 16:07:24 +0000 (08:07 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do no try to schedule task events if there are none
lockdep, kmemcheck: Annotate ->lock in lockdep_init_map()
perf header: Use event_name() to get an event name
perf stat: Failure with "Operation not supported"
Modify initialization of PCIe capability registers in Tsi721 mport driver:
- change Completion Timeout value to avoid unexpected data transfer
aborts during intensive traffic.
- replace hardcoded offset of PCIe capability block by making it use the
common function.
This patch is applicable to kernel versions starting from 3.2-rc1.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug fix for Tsi721 RapidIO mport driver: Tsi721 supports four RapidIO
mailboxes (MBOX0 - MBOX3) as defined by RapidIO specification. Mailbox
resources has to be properly reported to allow use of all available
mailboxes (initial version reports only MBOX0).
This patch is applicable to kernel versions staring from 3.2-rc1.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Thu, 8 Dec 2011 22:34:32 +0000 (14:34 -0800)]
procfs: do not overflow get_{idle,iowait}_time for nohz
Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
iowait times") we are reporting idle/io_wait time also while a CPU is
tickless. We rely on get_{idle,iowait}_time functions to retrieve
proper data.
These functions, however, use usecs_to_cputime to translate micro
seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
which reduces the data type from u64 to unsigned int and also checks
whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
and returns MAX_JIFFY_OFFSET in that case.
When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
>3000s! until we overflow unsigned int. Just for reference
CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
CONFIG_HZ_1000 ~2s.
This results in a bug when people saw [h]top going mad reporting 100%
CPU usage even though there was basically no CPU load. The reason was
simply that /proc/stat stopped reporting idle/io_wait changes (and
reported MAX_JIFFY_OFFSET) and so the only change happening was for user
system time.
Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
to 32b type and it is much more appropriate for cumulative time values
(unlike usecs_to_jiffies which intended for timeout calculations).
Signed-off-by: Michal Hocko <mhocko@suse.cz> Tested-by: Artem S. Tashkinov <t.artem@mailcity.com> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Thu, 8 Dec 2011 22:34:30 +0000 (14:34 -0800)]
mm: vmalloc: check for page allocation failure before vmlist insertion
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().
This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.
Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.
Michal Hocko [Thu, 8 Dec 2011 22:34:27 +0000 (14:34 -0800)]
mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks
setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.
The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").
Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Dang Bo <bdang@vmware.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Arve Hjnnevg <arve@android.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Youquan Song [Thu, 8 Dec 2011 22:34:18 +0000 (14:34 -0800)]
thp: set compound tail page _count to zero
Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.
Youquan Song [Thu, 8 Dec 2011 22:34:16 +0000 (14:34 -0800)]
thp: add compound tail page _mapcount when mapped
With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.
The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd. If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.
So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.
This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.
Peter Zijlstra [Thu, 8 Dec 2011 22:34:13 +0000 (14:34 -0800)]
printk: avoid double lock acquire
Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race")
introduced another silly bug where we would want to acquire an already
held lock. Avoid this.
Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
More players joined to memory cgroup developments and Johannes' great work
changed internal design of memory cgroup dramatically. And he will do
more works. Michal Hokko did many bug fixes and know memory cgroup very
well. Daisuke Nishimura helped us very much but he seems busy now.
Thanks to his works.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
khugepaged can sometimes cause suspend to fail, requiring that the user
retry the suspend operation.
Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups. A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.
khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.
Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Jiri Slaby <jslaby@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause
really big pressure. Let's skip such shrinkers instead.
Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Fleming [Fri, 18 Nov 2011 13:09:11 +0000 (13:09 +0000)]
x86, efi: Calling __pa() with an ioremap()ed address is invalid
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.
On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:
BUG: unable to handle kernel paging request at f7f22280
IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
[...]
A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().
Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,
Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg@redhat.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Huang Ying <huang.ying.caritas@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeff Layton [Fri, 2 Dec 2011 01:23:34 +0000 (20:23 -0500)]
cifs: check for NULL last_entry before calling cifs_save_resume_key
Prior to commit eaf35b1, cifs_save_resume_key had some NULL pointer
checks at the top. It turns out that at least one of those NULL
pointer checks is needed after all.
When the LastNameOffset in a FIND reply appears to be beyond the end of
the buffer, CIFSFindFirst and CIFSFindNext will set srch_inf.last_entry
to NULL. Since eaf35b1, the code will now oops in this situation.
Fix this by having the callers check for a NULL last entry pointer
before calling cifs_save_resume_key. No change is needed for the
call site in cifs_readdir as it's not reachable with a NULL
current_entry pointer.
Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Reported-by: Adam G. Metzler <adamgmetzler@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
Jeff Layton [Fri, 2 Dec 2011 01:22:41 +0000 (20:22 -0500)]
cifs: attempt to freeze while looping on a receive attempt
In the recent overhaul of the demultiplex thread receive path, I
neglected to ensure that we attempt to freeze on each pass through the
receive loop.
Reported-and-Tested-by: Woody Suwalski <terraluna977@gmail.com> Reported-and-Tested-by: Adam Williamson <awilliam@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
Linus Torvalds [Thu, 8 Dec 2011 21:18:59 +0000 (13:18 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: drop spin lock when memory alloc fails
Btrfs: check if the to-be-added device is writable
Btrfs: try cluster but don't advance in search list
Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
Tetsuo Handa [Thu, 8 Dec 2011 12:24:06 +0000 (21:24 +0900)]
TOMOYO: Fix pathname handling of disconnected paths.
Current tomoyo_realpath_from_path() implementation returns strange pathname
when calculating pathname of a file which belongs to lazy unmounted tree.
Use local pathname rather than strange absolute pathname in that case.
Also, this patch fixes a regression by commit 02125a82 "fix apparmor
dereferencing potentially freed dentry, sanitize __d_path() API".