Peter Zijlstra [Fri, 27 Sep 2013 15:30:03 +0000 (17:30 +0200)]
sched: Revert need_resched() to look at TIF_NEED_RESCHED
Yuanhan reported a serious throughput regression in his pigz
benchmark. Using the ftrace patch I found that several idle
paths need more TLC before we can switch the generic
need_resched() over to preempt_need_resched.
The preemption paths benefit most from preempt_need_resched and
do indeed use it; all other need_resched() users don't really
care that much so reverting need_resched() back to
tif_need_resched() is the simple and safe solution.
Peter Zijlstra [Wed, 14 Aug 2013 12:51:00 +0000 (14:51 +0200)]
sched, x86: Optimize the preempt_schedule() call
Remove the bloat of the C calling convention out of the
preempt_enable() sites by creating an ASM wrapper which allows us to
do an asm("call ___preempt_schedule") instead.
Peter Zijlstra [Wed, 14 Aug 2013 12:51:00 +0000 (14:51 +0200)]
sched, x86: Provide a per-cpu preempt_count implementation
Convert x86 to use a per-cpu preemption count. The reason for doing so
is that accessing per-cpu variables is a lot cheaper than accessing
thread_info variables.
We still need to save/restore the actual preemption count due to
PREEMPT_ACTIVE so we place the per-cpu __preempt_count variable in the
same cache-line as the other hot __switch_to() variables such as
current_task.
NOTE: this save/restore is required even for !PREEMPT kernels as
cond_resched() also relies on preempt_count's PREEMPT_ACTIVE to ignore
task_struct::state.
Also rename thread_info::preempt_count to ensure nobody is
'accidentally' still poking at it.
Peter Zijlstra [Mon, 23 Sep 2013 17:04:26 +0000 (19:04 +0200)]
sched: Prepare for per-cpu preempt_count
When using per-cpu preempt_count variables we need to save/restore the
preempt_count on context switch (into per task storage; for instance
the old thread_info::preempt_count variable) because of
PREEMPT_ACTIVE.
However, this means that on fork() the preempt_count value of the last
context switch gets copied and if we had a PREEMPT_ACTIVE switch right
before cloning a child task the child task will now too have
PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
count.
Therefore we need to make init_task_preempt_count() unconditional;
this resets whatever preempt_count we inherited from our parent
process.
Doing so for !per-cpu implementations is harmless.
For !PREEMPT_COUNT kernels we need to be careful not to start life
with an increased preempt_count.
Peter Zijlstra [Wed, 14 Aug 2013 12:55:46 +0000 (14:55 +0200)]
sched: Create more preempt_count accessors
We need a few special preempt_count accessors:
- task_preempt_count() for when we're interested in the preemption
count of another (non-running) task.
- init_task_preempt_count() for properly initializing the preemption
count.
- init_idle_preempt_count() a special case of the above for the idle
threads.
With these no generic code ever touches thread_info::preempt_count
anymore and architectures could choose to remove it.
Peter Zijlstra [Wed, 14 Aug 2013 12:55:31 +0000 (14:55 +0200)]
sched: Add NEED_RESCHED to the preempt_count
In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.
Since the NEED_RESCHED flag is set across CPUs this needs to be an
atomic operation, however we very much want to avoid making
preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
infrastructure in place but at 3 sites test it and fold its value into
preempt_count; namely:
- resched_task() when setting TIF_NEED_RESCHED on the current task
- scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
remote task it follows it up with a reschedule IPI
and we can modify the cpu local preempt_count from
there.
- cpu_idle_loop() for when resched_task() found tsk_is_polling().
We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.
Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.
Peter Zijlstra [Wed, 11 Sep 2013 10:43:13 +0000 (12:43 +0200)]
sched, idle: Fix the idle polling state logic
Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.
The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).
Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).
Also switch the idle routines to using tif_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.
Peter Zijlstra [Tue, 17 Sep 2013 07:30:55 +0000 (09:30 +0200)]
sched, rcu: Make RCU use resched_cpu()
We're going to deprecate and remove set_need_resched() for it will do
the wrong thing. Make an exception for RCU and allow it to use
resched_cpu() which will do the right thing.
Peter Zijlstra [Wed, 11 Sep 2013 13:19:24 +0000 (15:19 +0200)]
x86: Use asm goto to implement better modify_and_test() functions
Linus suggested using asm goto to get rid of the typical SETcc + TEST
instruction pair -- which also clobbers an extra register -- for our
typical modify_and_test() functions.
Because asm goto doesn't allow output fields it has to include an
unconditinal memory clobber when it changes a memory variable to force
a reload.
Luckily all atomic ops already imply a compiler barrier to go along
with their memory barrier semantics.
Jason Low [Fri, 13 Sep 2013 18:26:53 +0000 (11:26 -0700)]
sched/balancing: Periodically decay max cost of idle balance
This patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain by approximately 1% per second. Also
decay the rq's max_idle_balance_cost value.
Jason Low [Fri, 13 Sep 2013 18:26:52 +0000 (11:26 -0700)]
sched/balancing: Consider max cost of idle balance per sched domain
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.
By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.
Jason Low [Fri, 13 Sep 2013 18:26:51 +0000 (11:26 -0700)]
sched: Reduce overestimating rq->avg_idle
When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.
This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.
Signed-off-by: Jason Low <jason.low2@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.
Vladimir Davydov [Sun, 15 Sep 2013 13:49:14 +0000 (17:49 +0400)]
sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
In busiest->group_imb case we can come to fix_small_imbalance() with
local->avg_load > busiest->avg_load. This can result in wrong imbalance
fix-up, because there is the following check there where all the
members are unsigned:
As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.
Fix it by substituting the subtraction with an equivalent addition in
the check.
[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]
Vladimir Davydov [Sun, 15 Sep 2013 13:49:13 +0000 (17:49 +0400)]
sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows
As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.
Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.
[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix comment for sched_info_depart
sched/Documentation: Update sched-design-CFS.txt documentation
sched/debug: Take PID namespace into account
sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones
Merge branch 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Gleb Natapov.
* 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: set "blocked by NMI" flag if EPT violation happens during IRET from NMI
kvm: free resources after canceling async_pf
KVM: nEPT: reset PDPTR register cache on nested vmentry emulation
KVM: mmu: allow page tables to be in read-only slots
KVM: x86 emulator: emulate RETF imm
Starting from v3.10 (probably commit f91e2590410b: "tty: Signal
foreground group processes in hangup") disassociate_ctty() sends SIGCONT
if tty && on_exit. This breaks LSB test-suite, in particular test8 in
_exit.c and test40 in sigcon5.c.
Put the "!on_exit" check back to restore the old behaviour.
Review by Peter Hurley:
"Yes, this regression was introduced by me in that commit. The effect
of the regression is that ptys will receive a SIGCONT when, in similar
circumstances, ttys would not.
The fact that two test vectors accidentally tripped over this
regression suggests that some other apps may as well.
Thanks for catching this"
Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Karel Srot <ksrot@redhat.com> Reviewed-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more tile architecture updates from Chris Metcalf:
"This second batch of changes is just cleanup of various kinds from
doing some tidying work in the sources.
Some dead code is removed, comment typos fixed, whitespace and style
issues cleaned up, and some header updates from our internal
"upstream" architecture team"
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: remove stray blank space
tile: <arch/> header updates from upstream
tile: improve gxio iorpc autogenerated code style
tile: double default VMALLOC space
tile: remove stale arch/tile/kernel/futex_64.S
tile: remove HUGE_VMAP dead code
tile: use pmd_pfn() instead of casting via pte_t
tile: fix typos in comment in arch/tile/kernel/unaligned.c
When we cancel 'async_pf_execute()', we should behave as if the work was
never scheduled in 'kvm_setup_async_pf()'.
Fixes a bug when we can't unload module because the vm wasn't destroyed.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: nEPT: reset PDPTR register cache on nested vmentry emulation
After nested vmentry stale cache can be used to reload L2 PDPTR pointers
which will cause L2 guest to fail. Fix it by invalidating cache on nested
vmentry emulation.
https://bugzilla.kernel.org/show_bug.cgi?id=60830
Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 9 Sep 2013 11:52:33 +0000 (13:52 +0200)]
KVM: mmu: allow page tables to be in read-only slots
Page tables in a read-only memory slot will currently cause a triple
fault because the page walker uses gfn_to_hva and it fails on such a slot.
OVMF uses such a page table; however, real hardware seems to be fine with
that as long as the accessed/dirty bits are set. Save whether the slot
is readonly, and later check it when updating the accessed and dirty bits.
Merge branch 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer code update from Thomas Gleixner:
- armada SoC clocksource overhaul with a trivial merge conflict
- Minor improvements to various SoC clocksource drivers
* 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: armada-370-xp: Add detailed clock requirements in devicetree binding
clocksource: armada-370-xp: Get reference fixed-clock by name
clocksource: armada-370-xp: Replace WARN_ON with BUG_ON
clocksource: armada-370-xp: Fix device-tree binding
clocksource: armada-370-xp: Introduce new compatibles
clocksource: armada-370-xp: Use CLOCKSOURCE_OF_DECLARE
clocksource: armada-370-xp: Simplify TIMER_CTRL register access
clocksource: armada-370-xp: Use BIT()
ARM: timer-sp: Set dynamic irq affinity
ARM: nomadik: add dynamic irq flag to the timer
clocksource: sh_cmt: 32-bit control register support
clocksource: em_sti: Convert to devm_* managed helpers
Chris Metcalf [Mon, 16 Sep 2013 18:18:21 +0000 (14:18 -0400)]
tile: <arch/> header updates from upstream
The hardware architecture descriptor headers have been updated, in
particular to reflect some larger MMIO fields on the mPIPE shims for
controlling the network hardware, from the recent Gx72 release.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Chris Metcalf [Mon, 16 Sep 2013 17:02:57 +0000 (13:02 -0400)]
tile: double default VMALLOC space
With per-cpu data as well as loaded kernel modules coming from
the vmalloc arena, we get close to the line all the time and
occasionally need more than we had, so just double it up by default.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
"Two minor cifs fixes and a minor documentation cleanup for cifs.txt"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update cifs.txt and remove some outdated infos
cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache
cifs: Do not take a reference to the page in cifs_readpage_worker()
Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86
Pull x86 platform updates from Matthew Garrett:
"Nothing amazing here, almost entirely cleanups and minor bugfixes and
one bit of hardware enablement in the amilo-rfkill driver"
* 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86:
platform/x86: panasonic-laptop: reuse module_acpi_driver
samsung-laptop: fix config build error
platform: x86: remove unnecessary platform_set_drvdata()
amilo-rfkill: Enable using amilo-rfkill with the FSC Amilo L1310.
wmi: parse_wdg() should return kernel error codes
hp_wmi: Fix unregister order in hp_wmi_rfkill_setup()
platform: replace strict_strto*() with kstrto*()
x86: irst: use module_acpi_driver to simplify the code
x86: smartconnect: use module_acpi_driver to simplify the code
platform samsung-q10: use ACPI instead of direct EC calls
thinkpad_acpi: add the ability setting TPACPI_LED_NONE by quirk
thinkpad_acpi: return -NODEV while operating uninitialized LEDs
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull misc SCSI driver updates from James Bottomley:
"This patch set is a set of driver updates (megaraid_sas, fnic, lpfc,
ufs, hpsa) we also have a couple of bug fixes (sd out of bounds and
ibmvfc error handling) and the first round of esas2r checker fixes and
finally the much anticipated big endian additions for megaraid_sas"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (47 commits)
[SCSI] fnic: fnic Driver Tuneables Exposed through CLI
[SCSI] fnic: Kernel panic while running sh/nosh with max lun cfg
[SCSI] fnic: Hitting BUG_ON(io_req->abts_done) in fnic_rport_exch_reset
[SCSI] fnic: Remove QUEUE_FULL handling code
[SCSI] fnic: On system with >1.1TB RAM, VIC fails multipath after boot up
[SCSI] fnic: FC stat param seconds_since_last_reset not getting updated
[SCSI] sd: Fix potential out-of-bounds access
[SCSI] lpfc 8.3.42: Update lpfc version to driver version 8.3.42
[SCSI] lpfc 8.3.42: Fixed issue of task management commands having a fixed timeout
[SCSI] lpfc 8.3.42: Fixed inconsistent spin lock usage.
[SCSI] lpfc 8.3.42: Fix driver's abort loop functionality to skip IOs already getting aborted
[SCSI] lpfc 8.3.42: Fixed failure to allocate SCSI buffer on PPC64 platform for SLI4 devices
[SCSI] lpfc 8.3.42: Fix WARN_ON when driver unloads
[SCSI] lpfc 8.3.42: Avoided making pci bar ioremap call during dual-chute WQ/RQ pci bar selection
[SCSI] lpfc 8.3.42: Fixed driver iocbq structure's iocb_flag field running out of space
[SCSI] lpfc 8.3.42: Fix crash on driver load due to cpu affinity logic
[SCSI] lpfc 8.3.42: Fixed logging format of setting driver sysfs attributes hard to interpret
[SCSI] lpfc 8.3.42: Fixed back to back RSCNs discovery failure.
[SCSI] lpfc 8.3.42: Fixed race condition between BSG I/O dispatch and timeout handling
[SCSI] lpfc 8.3.42: Fixed function mode field defined too small for not recognizing dual-chute mode
...
Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
Pull SLAB update from Pekka Enberg:
"Nothing terribly exciting here apart from Christoph's kmalloc
unification patches that brings sl[aou]b implementations closer to
each other"
* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slab: Use correct GFP_DMA constant
slub: remove verify_mem_not_deleted()
mm/sl[aou]b: Move kmallocXXX functions to common code
mm, slab_common: add 'unlikely' to size check of kmalloc_slab()
mm/slub.c: beautify code for removing redundancy 'break' statement.
slub: Remove unnecessary page NULL check
slub: don't use cpu partial pages on UP
mm/slub: beautify code for 80 column limitation and tab alignment
mm/slub: remove 'per_cpu' which is useless variable
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input update from Dmitry Torokhov:
"The only change is David Hermann's new EVIOCREVOKE evdev ioctl that
allows safely passing file descriptors to input devices to session
processes and later being able to stop delivery of events through
these fds so that inactive sessions will no longer receive user input
that does not belong to them"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: evdev - add EVIOCREVOKE ioctl
Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr
size in lba") caused his GPT formatted eMMC device not to boot. The
reason is that this commit enforced Linux to always check the lesser of
the whole disk or 2Tib for the pMBR size in LBA. While most disk
partitioning tools out there create a pMBR with these characteristics,
Microsoft does not, as it always sets the entry to the maximum 32-bit
limitation - even though a drive may be smaller than that[1].
Loosen this check and only verify that the size is either the whole disk
or 0xFFFFFFFF. No tool in its right mind would set it to any value
other than these.
Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback fix from Wu Fengguang:
"A trivial writeback fix"
* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Do not sort b_io list only because of block device inode
vfs: fix dentry LRU list handling and nr_dentry_unused accounting
The LRU list changes interacted badly with our nr_dentry_unused
accounting, and even worse with the new DCACHE_LRU_LIST bit logic.
This introduces helper functions to make sure everything follows the
proper dcache d_lru list rules: the dentry cache is complicated by the
fact that some of the hotpaths don't even want to look at the LRU list
at all, and the fact that we use the same list entry in the dentry for
both the LRU list and for our temporary shrinking lists when removing
things from the LRU.
The helper functions temporarily have some extra sanity checking for the
flag bits that have to match the current LRU state of the dentry. We'll
remove that before the final 3.12 release, but considering how easy it
is to get wrong, this first cleanup version has some very particular
sanity checking.
Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache
When reading a single page with cifs_readpage(), we make a call to
fscache_read_or_alloc_page() which once done, asynchronously calls
the completion function cifs_readpage_from_fscache_complete(). This
completion function unlocks the page once it has been populated from
cache. The module then attempts to unlock the page a second time in
cifs_readpage() which leads to warning messages.
In case of a successful call to fscache_read_or_alloc_page() we should skip
the second unlock_page() since this will be called by the
cifs_readpage_from_fscache_complete() once the page has been populated by
fscache.
With the modifications to cifs_readpage_worker(), we will need to re-grab the
page lock in cifs_write_begin().
The problem was first noticed when testing new fscache patches for cifs.
https://bugzilla.redhat.com/show_bug.cgi?id=1005737
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
cifs: Do not take a reference to the page in cifs_readpage_worker()
We do not need to take a reference to the pagecache in
cifs_readpage_worker() since the calling function will have already
taken one before passing the pointer to the page as an argument to the
function.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Some more low risk cleanup patches:
- Remove unnecessary pci_set_drvdata in k10temp driver from Jingoo Han
- Fix return values in several drivers from Sachin Kamat
- Remove redundant break in amc6821 driver from Sachin Kamat"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (k10temp) remove unnecessary pci_set_drvdata()
hwmon: (tmp421) Fix return value
hwmon: (amc6821) Remove redundant break
hwmon: (amc6821) Fix return value
hwmon: (ibmaem) Fix return value
hwmon: (emc2103) Fix return value
Pull aio changes from Ben LaHaise:
"First off, sorry for this pull request being late in the merge window.
Al had raised a couple of concerns about 2 items in the series below.
I addressed the first issue (the race introduced by Gu's use of
mm_populate()), but he has not provided any further details on how he
wants to rework the anon_inode.c changes (which were sent out months
ago but have yet to be commented on).
The bulk of the changes have been sitting in the -next tree for a few
months, with all the issues raised being addressed"
* git://git.kvack.org/~bcrl/aio-next: (22 commits)
aio: rcu_read_lock protection for new rcu_dereference calls
aio: fix race in ring buffer page lookup introduced by page migration support
aio: fix rcu sparse warnings introduced by ioctx table lookup patch
aio: remove unnecessary debugging from aio_free_ring()
aio: table lookup: verify ctx pointer
staging/lustre: kiocb->ki_left is removed
aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
aio: convert the ioctx list to table lookup v3
aio: double aio_max_nr in calculations
aio: Kill ki_dtor
aio: Kill ki_users
aio: Kill unneeded kiocb members
aio: Kill aio_rw_vect_retry()
aio: Don't use ctx->tail unnecessarily
aio: io_cancel() no longer returns the io_event
aio: percpu ioctx refcount
aio: percpu reqs_available
aio: reqs_active -> reqs_available
aio: fix build when migration is disabled
...
Chris Metcalf [Wed, 11 Sep 2013 17:57:15 +0000 (13:57 -0400)]
tile: remove HUGE_VMAP dead code
A config option to allow a variant vmap() using huge pages that was never
upstreamed had some bits of code related to it scattered around the tile
architecture; the config option was removed downstream and this commit
cleans up the scattered evidence of it from the upstream as well.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig fix from Michal Marek:
"This is a fix for a regression caused by my previous pull request.
A sed command in scripts/config that used colons as separator was
accidentally changed to use slashes, which fails when you use slashes
in a value. Changing it back to colons is of course not a proper fix,
but at least it will be broken in the same way it had been for four
years. A proper fix is pending"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
scripts/config: fix variable substitution command
Merge tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux
Pull blackfin updates from Steven Miao.
* tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux:
blackfin: Ignore generated uImages
blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x.
bf609: adv7343: add S-Video and Component output support
bf609: add adv7343 video encoder support
clock: add stmmac clock for ethernet driver
blackfin: scb: Add SCB1 to SCB9 config options and data.
blackfin: scb: Add system crossbar init code.
Pull crypto fixes from Herbert Xu:
"This fixes a 7+ year race condition in the crypto API that causes
sporadic crashes when multiple threads load the same algorithm.
It also fixes the crct10dif algorithm again to prevent boot failures
on systems where the initramfs tool ignores module softdeps"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: crct10dif - Add fallback for broken initrds
crypto: api - Fix race condition in larval lookup
When working on report indexes, always validate that they are in bounds.
Without this, a HID device could report a malicious feature report that
could trick the driver into a heap overflow:
[ 634.885003] usb 1-1: New USB device found, idVendor=0596, idProduct=0500
...
[ 676.469629] BUG kmalloc-192 (Tainted: G W ): Redzone overwritten
Note that we need to change the indexes from s8 to s16 as they can
be between -1 and 255.
A HID device could send a malicious output report that would cause the
logitech-dj HID driver to leak kernel memory contents to the device, or
trigger a NULL dereference during initialization:
[ 304.424553] usb 1-1: New USB device found, idVendor=046d, idProduct=c52b
...
[ 304.780467] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[ 304.781409] IP: [<ffffffff815d50aa>] logi_dj_recv_send_report.isra.11+0x1a/0x90
When dealing with usage_index, be sure to properly use unsigned instead of
int to avoid overflows.
When working on report fields, always validate that their report_counts are
in bounds.
Without this, a HID device could report a malicious feature report that
could trick the driver into a heap overflow:
[ 634.885003] usb 1-1: New USB device found, idVendor=0596, idProduct=0500
...
[ 676.469629] BUG kmalloc-192 (Tainted: G W ): Redzone overwritten
A HID device could send a malicious output report that would cause the
lenovo-tpkbd HID driver to write just beyond the output report allocation
during initialization, causing a heap overflow:
[ 76.109807] usb 1-1: New USB device found, idVendor=17ef, idProduct=6009
...
[ 80.462540] BUG kmalloc-192 (Tainted: G W ): Redzone overwritten
broke the build on MIPS since vpe_attrs should be an array
of 'struct device_attribute' pointers.
Fixes the following build problem:
arch/mips/kernel/vpe.c:1372:2: error: missing braces around initializer
[-Werror=missing-braces]
arch/mips/kernel/vpe.c:1372:2: error: (near initialization for 'vpe_attrs[0]')
[-Werror=missing-braces]
A HID device could send a malicious output report that would cause the
lg, lg3, and lg4 HID drivers to write beyond the output report allocation
during an event, causing a heap overflow:
[ 325.245240] usb 1-1: New USB device found, idVendor=046d, idProduct=c287
...
[ 414.518960] BUG kmalloc-4096 (Not tainted): Redzone overwritten
Additionally, while lg2 did correctly validate the report details, it was
cleaned up and shortened.
A HID device could send a malicious output report that would cause the
steelseries HID driver to write beyond the output report allocation
during initialization, causing a heap overflow:
[ 167.981534] usb 1-1: New USB device found, idVendor=1038, idProduct=1410
...
[ 182.050547] BUG kmalloc-256 (Tainted: G W ): Redzone overwritten
This driver must validate the availability of the HID output report and
its size before it can write LED states via buzz_set_leds(). This stops
a heap overflow that is possible if a device provides a malicious HID
output report:
[ 108.171280] usb 1-1: New USB device found, idVendor=054c, idProduct=0002
...
[ 117.507877] BUG kmalloc-192 (Not tainted): Redzone overwritten
The zeroplus HID driver was not checking the size of allocated values
in fields it used. A HID device could send a malicious output report
that would cause the driver to write beyond the output report allocation
during initialization, causing a heap overflow:
[ 1442.728680] usb 1-1: New USB device found, idVendor=0c12, idProduct=0005
...
[ 1466.243173] BUG kmalloc-192 (Tainted: G W ): Redzone overwritten
Many drivers need to validate the characteristics of their HID report
during initialization to avoid misusing the reports. This adds a common
helper to perform validation of the report exisitng, the field existing,
and the expected number of values within the field.
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Leonid Yegoshin [Wed, 11 Sep 2013 19:17:47 +0000 (14:17 -0500)]
MIPS: Fix SMP core calculations when using MT support.
The TCBIND register is only available if the core has MT support. It
should not be read otherwise. Secondly, the number of TCs (siblings)
are calculated differently depending on if the kernel is configured
as SMVP or SMTC.
Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/5822/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This change complements commit d0da7c002f7b2a93582187a9e3f73891a01d8ee4
and brings clear_ioasic_irq back, renaming it to clear_ioasic_dma_irq at
the same time, to make I/O ASIC DMA interrupts functional.
Unlike ordinary I/O ASIC interrupts DMA interrupts need to be deasserted
by software by writing 0 to the respective bit in I/O ASIC's System
Interrupt Register (SIR), similarly to how CP0.Cause.IP0 and CP0.Cause.IP1
bits are handled in the CPU (the difference is SIR DMA interrupt bits are
R/W0C so there's no need for an RMW cycle). Otherwise the handler is
reentered over and over again.
The only current user is the DEC LANCE Ethernet driver and its extremely
uncommon DMA memory error handler that does not care when exactly the
interrupt is cleared. Anticipating the use of DMA interrupts by the Zilog
SCC driver this change however exports clear_ioasic_dma_irq for device
drivers to choose the right application-specific sequence to clear the
request explicitly rather than calling it implicitly in the .irq_eoi
handler of `struct irq_chip'. Previously these interrupts were cleared in
the .end handler of the said structure, before it was removed.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/5826/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Not all I/O ASIC versions have the free-running counter implemented, an
early revision used in the 5000/1xx models aka 3MIN and 4MIN did not have
it. Therefore we cannot unconditionally use it as a clock source.
Fortunately if not implemented its register slot has a fixed value so it
is enough if we check for the value at the end of the calibration period
being the same as at the beginning.
This also means we need to look for another high-precision clock source on
the systems affected. The 5000/1xx can have an R4000SC processor
installed where the CP0 Count register can be used as a clock source.
Unfortunately all the R4k DECstations suffer from the missed timer
interrupt on CP0 Count reads erratum, so we cannot use the CP0 timer as a
clock source and a clock event both at a time. However we never need an
R4k clock event device because all DECstations have a DS1287A RTC chip
whose periodic interrupt can be used as a clock source.
This gives us the following four configuration possibilities for I/O ASIC
DECstations:
1. No I/O ASIC counter and no CP0 timer, e.g. R3k 5000/1xx (3MIN).
2. No I/O ASIC counter but the CP0 timer, i.e. R4k 5000/150 (4MIN).
3. The I/O ASIC counter but no CP0 timer, e.g. R3k 5000/240 (3MAX+).
4. The I/O ASIC counter and the CP0 timer, e.g. R4k 5000/260 (4MAX+).
For #1 and #2 this change stops the I/O ASIC free-running counter from
being installed as a clock source of a 0Hz frequency. For #2 it also
arranges for the CP0 timer to be used as a clock source rather than a
clock event device, because having an accurate wall clock is more
important than a high-precision interval timer. For #3 there is no
change. For #4 the change makes the I/O ASIC free-running counter
installed as a clock source so that the CP0 timer can be used as a clock
event device.
Unfortunately the use of the CP0 timer as a clock event device relies on a
succesful completion of c0_compare_interrupt. That never happens, because
while waiting for a CP0 Compare interrupt to happen the function spins in
a loop reading the CP0 Count register. This makes the CP0 Count erratum
trigger reliably causing the interrupt waited for to be lost in all cases.
As a result #4 resorts to using the CP0 timer as a clock source as well,
just as #2. However we want to keep this separate arrangement in case
(hope) c0_compare_interrupt is eventually rewritten such that it avoids
the erratum.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/5825/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
"This has been sitting in -next for a while with no objections and all
MIPS defconfigs except one are building fine; that one platform got
broken by another patch in your tree and I'm going to submit a patch
separately.
- a handful of fixes that didn't make 3.11
- a few bits of Octeon 3 support with more to come for a later
release
- platform enhancements for Octeon, ath79, Lantiq, Netlogic and
Ralink SOCs
- a GPIO driver for the Octeon
- some dusting off of the DECstation code
- the usual dose of cleanups"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (65 commits)
MIPS: DMA: Fix BUG due to smp_processor_id() in preemptible code
MIPS: kexec: Fix random crashes while loading crashkernel
MIPS: kdump: Skip walking indirection page for crashkernels
MIPS: DECstation HRT calibration bug fixes
MIPS: Export copy_from_user_page() (needed by lustre)
MIPS: Add driver for the built-in PCI controller of the RT3883 SoC
MIPS: DMA: For BMIPS5000 cores flush region just like non-coherent R10000
MIPS: ralink: Add support for reset-controller API
MIPS: ralink: mt7620: Add cpu-feature-override header
MIPS: ralink: mt7620: Add spi clock definition
MIPS: ralink: mt7620: Add wdt clock definition
MIPS: ralink: mt7620: Improve clock frequency detection
MIPS: ralink: mt7620: This SoC has EHCI and OHCI hosts
MIPS: ralink: mt7620: Add verbose ram info
MIPS: ralink: Probe clocksources from OF
MIPS: ralink: Add support for systick timer found on newer ralink SoC
MIPS: ralink: Add support for periodic timer irq
MIPS: Netlogic: Built-in DTB for XLP2xx SoC boards
MIPS: Netlogic: Add support for USB on XLP2xx
MIPS: Netlogic: XLP2xx update for I2C controller
...
Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs
Pull xfs update #2 from Ben Myers:
"Here we have defrag support for v5 superblock, a number of bugfixes
and a cleanup or two.
- defrag support for CRC filesystems
- fix endian worning in xlog_recover_get_buf_lsn
- fixes for sparse warnings
- fix for assert in xfs_dir3_leaf_hdr_from_disk
- fix for log recovery of remote symlinks
- fix for log recovery of btree root splits
- fixes formemory allocation failures with ACLs
- fix for assert in xfs_buf_item_relse
- fix for assert in xfs_inode_buf_verify
- fix an assignment in an assert that should be a test in
xfs_bmbt_change_owner
- remove dead code in xlog_recover_inode_pass2"
* tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs:
xfs: remove dead code from xlog_recover_inode_pass2
xfs: = vs == typo in ASSERT()
xfs: don't assert fail on bad inode numbers
xfs: aborted buf items can be in the AIL.
xfs: factor all the kmalloc-or-vmalloc fallback allocations
xfs: fix memory allocation failures with ACLs
xfs: ensure we copy buffer type in da btree root splits
xfs: set remote symlink buffer type for recovery
xfs: recovery of swap extents operations for CRC filesystems
xfs: swap extents operations for CRC filesystems
xfs: check magic numbers in dir3 leaf verifier first
xfs: fix some minor sparse warnings
xfs: fix endian warning in xlog_recover_get_buf_lsn()
Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
"Lots of activity again this round for I/O performance optimizations
(per-cpu IDA pre-allocation for vhost + iscsi/target), and the
addition of new fabric independent features to target-core
(COMPARE_AND_WRITE + EXTENDED_COPY).
The main highlights include:
- Support for iscsi-target login multiplexing across individual
network portals
- Generic Per-cpu IDA logic (kent + akpm + clameter)
- Conversion of vhost to use per-cpu IDA pre-allocation for
descriptors, SGLs and userspace page pointer list
- Conversion of iscsi-target + iser-target to use per-cpu IDA
pre-allocation for descriptors
- Add support for generic COMPARE_AND_WRITE (AtomicTestandSet)
emulation for virtual backend drivers
- Add support for generic EXTENDED_COPY (CopyOffload) emulation for
virtual backend drivers.
- Add support for fast memory registration mode to iser-target (Vu)
The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of
particular significance, which make us the first and only open source
target to support the full set of VAAI primitives.
Currently Linux clients are lacking upstream support to actually
utilize these primitives. However, with server side support now in
place for folks like MKP + ZAB working on the client, this logic once
reserved for the highest end of storage arrays, can now be run in VMs
on their laptops"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits)
target/iscsi: Bump versions to v4.1.0
target: Update copyright ownership/year information to 2013
iscsi-target: Bump default TCP listen backlog to 256
target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out
iscsi-target; Bump default CmdSN Depth to 64
iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set
iscsi-target: Add thread_set->ts_activate_sem + use common deallocate
iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE
target: remove unused including <linux/version.h>
iser-target: introduce fast memory registration mode (FRWR)
iser-target: generalize rdma memory registration and cleanup
iser-target: move rdma wr processing to a shared function
target: Enable global EXTENDED_COPY setup/release
target: Add Third Party Copy (3PC) bit in INQUIRY response
target: Enable EXTENDED_COPY setup in spc_parse_cdb
target: Add support for EXTENDED_COPY copy offload emulation
target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check
target: Add global device list for EXTENDED_COPY
target: Make helpers non static for EXTENDED_COPY command setup
target: Make spc_parse_naa_6h_vendor_specific non static
...
Merge more patches from Andrew Morton:
"The rest of MM. Plus one misc cleanup"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
mm/Kconfig: add MMU dependency for MIGRATION.
kernel: replace strict_strto*() with kstrto*()
mm, thp: count thp_fault_fallback anytime thp fault fails
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
thp: do_huge_pmd_anonymous_page() cleanup
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
mm: cleanup add_to_page_cache_locked()
thp: account anon transparent huge pages into NR_ANON_PAGES
truncate: drop 'oldsize' truncate_pagecache() parameter
mm: make lru_add_drain_all() selective
memcg: document cgroup dirty/writeback memory statistics
memcg: add per cgroup writeback pages accounting
memcg: check for proper lock held in mem_cgroup_update_page_stat
memcg: remove MEMCG_NR_FILE_MAPPED
memcg: reduce function dereference
memcg: avoid overflow caused by PAGE_ALIGN
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
memcg: correct RESOURCE_MAX to ULLONG_MAX
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
...
Chen Gang [Thu, 12 Sep 2013 22:14:08 +0000 (15:14 -0700)]
mm/Kconfig: add MMU dependency for MIGRATION.
MIGRATION must depend on MMU, or allmodconfig for the nommu sh
architecture fails to build:
CC mm/migrate.o
mm/migrate.c: In function 'remove_migration_pte':
mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration]
if (pmd_trans_huge(*pmd))
^
mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration]
if (!is_swap_pte(pte))
^
...
Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will
select MIGRATION by force.
Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 12 Sep 2013 22:14:06 +0000 (15:14 -0700)]
mm, thp: count thp_fault_fallback anytime thp fault fails
Currently, thp_fault_fallback in vmstat only gets incremented if a
hugepage allocation fails. If current's memcg hits its limit or the page
fault handler returns an error, it is incorrectly accounted as a
successful thp_fault_alloc.
Count thp_fault_fallback anytime the page fault handler falls back to
using regular pages and only count thp_fault_alloc when a hugepage has
actually been faulted.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>