Michael Ellerman [Wed, 25 Nov 2015 03:25:16 +0000 (14:25 +1100)]
powerpc/kernel: Drop HMT_MEDIUM_PPR_DISCARD
HMT_MEDIUM_PPR_DISCARD is a macro which is present at the start of most
of our first level exception handlers. It conditionally executes a
HMT_MEDIUM instruction, which sets the processor priority to medium.
On on modern systems, ie. Power7 and later, it is nop'ed out at boot.
All it does is make the exception vectors more cramped, and consume 4
bytes of icache.
On old systems it has the effect of boosting the processor priority at
the start of exception processing. If we were previously in the idle
loop for example, we may be at low or very low priority. This is
desirable as we want to process the exception as fast as possible.
However looking closely at the generated code, we see that in all cases
we execute another HMT_MEDIUM just four instructions later. With code
patching applied, the final code on an old (Power6) system will look
like, eg:
So I suggest that the added code complexity of HMT_MEDIUM_PPR_DISCARD is
not justified by the benefit of boosting the processor priority for the
duration of four instructions, and therefore we drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 24 Nov 2015 11:26:12 +0000 (22:26 +1100)]
powerpc/rtas: Make enter_rtas() private
There are no longer any users of enter_rtas() outside of rtas.c, so make
it "private", by moving the declaration inside rtas.c. Hopefully this
will encourage people to use one of the wrappers which takes the sharp
edges off the RTAS calling sequence.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 24 Nov 2015 11:26:11 +0000 (22:26 +1100)]
powerpc/rtas: Use rtas_call_unlocked() in call_rtas_display_status()
Although call_rtas_display_status() does actually want to use the
regular RTAS locking, it doesn't want the extra logic that is in
rtas_call(), so currently it open codes the logic.
Instead we can use rtas_call_unlocked(), after taking the RTAS lock.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Wed, 16 Dec 2015 10:01:42 +0000 (21:01 +1100)]
powerpc/rtas: Add rtas_call_unlocked()
Most users of RTAS (Run-Time Abstraction Services) use rtas_call(),
which deals with locking as well as endian handling.
However we have two users outside of rtas.c that can't use rtas_call()
because they have different locking requirements.
The hotplug CPU code can't take the RTAS lock because the CPU would go
offline with the lock held and no other CPUs would be able to call RTAS
until the CPU came back online.
The xmon code doesn't want to take the lock because it would risk dead
locking when we are trying to recover from a crash.
Both sites required multiple patches when we added little endian
support, proving that programmers can't do endian right.
Although that ship has sailed, we can still clean the code up by
providing an unlocked version of rtas_call() which avoids the need to
open code the logic elsewhere.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Stewart Smith [Wed, 9 Dec 2015 06:18:20 +0000 (17:18 +1100)]
powerpc/powernv: remove FW_FEATURE_OPALv3 and just use FW_FEATURE_OPAL
Long ago, only in the lab, there was OPALv1 and OPALv2. Now there is
just OPALv3, with nobody ever expecting anything on pre-OPALv3 to
be cared about or supported by mainline kernels.
So, let's remove FW_FEATURE_OPALv3 and instead use FW_FEATURE_OPAL
exclusively.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Stewart Smith [Wed, 9 Dec 2015 06:18:19 +0000 (17:18 +1100)]
powerpc/powernv: Remove OPALv2 firmware define and references
OPALv2 only ever existed in the lab and didn't escape to the world.
All OPAL systems in the wild are OPALv3.
The probability of there being an OPALv2 system still powered on
anywhere inside IBM is approximately zero, let alone anyone
expecting to run mainline kernels.
So, start to remove references to OPALv2.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Stewart Smith [Wed, 9 Dec 2015 06:18:18 +0000 (17:18 +1100)]
powerpc/powernv: panic() on OPAL < V3
The OpenPower Abstraction Layer firmware went through a couple
of iterations in the lab before being released. What we now know
as OPAL advertises itself as OPALv3.
OPALv2 and OPALv1 never made it outside the lab, and the possibility
of anyone at all ever building a mainline kernel today and expecting
it to boot on such hardware is zero.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Sun, 6 Dec 2015 23:50:51 +0000 (10:50 +1100)]
selftests/powerpc: Add script to test HMI functionality
HMIs (Hypervisor Management|Maintenance Interrupts) are a class of interrupt
on POWER systems.
HMI support has traditionally been exceptionally difficult to test, however
Skiboot ships a tool that, with the correct magic numbers, will inject them.
This, therefore, is a first pass at a script to inject HMIs and monitor
Linux's response. It injects an HMI on each core on every chip in turn
It then watches dmesg to see if it's acknowledged by Linux.
On a Tuletta, I observed that we see 8 (or sometimes 9 or more) events per
injection, regardless of SMT setting, so we wait for 8 before progressing.
It sits in a new scripts/ directory in selftests/powerpc, because it's not
designed to be run as part of the regular make selftests process. In
particular, it is quite possibly going to end up garding lots of your CPUs,
so it should only be run if you know how to undo that.
CC: Mahesh J Salgaonkar <mahesh.salgaonkar@in.ibm.com> Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 15 Dec 2015 07:09:14 +0000 (18:09 +1100)]
powerpc: Remove broken GregorianDay()
GregorianDay() is supposed to calculate the day of the week
(tm->tm_wday) for a given day/month/year. In that calcuation it
indexed into an array called MonthOffset using tm->tm_mon-1. However
tm_mon is zero-based, not one-based, so this is off-by-one. It also
means that every January, GregoiranDay() will access element -1 of
the MonthOffset array.
It also doesn't appear to be a correct algorithm either: see in
contrast kernel/time/timeconv.c's time_to_tm function.
It's been broken forever, which suggests no-one in userland uses
this. It looks like no-one in the kernel uses tm->tm_wday either
(see e.g. drivers/rtc/rtc-ds1305.c:319).
tm->tm_wday is conventionally set to -1 when not available in
hardware so we can simply set it to -1 and drop the function.
(There are over a dozen other drivers in drivers/rtc that do
this.)
Found using UBSAN.
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrew Morton <akpm@linux-foundation.org> # as an example of what UBSan finds. Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com> Cc: rtc-linux@googlegroups.com Signed-off-by: Daniel Axtens <dja@axtens.net> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Thu, 10 Dec 2015 09:49:33 +0000 (20:49 +1100)]
selftests/powerpc: Add test to check if VSRs are corrupted
When a transaction is aborted, VSR values should rollback to the
checkpointed values before the transaction began. VSRs used elsewhere in
the kernel during a transaction, or while the transaction is suspended
should not affect the checkpointed values.
Prior to the bug fix in commit d31626f70b61 ("powerpc: Don't corrupt
transactional state when using FP/VMX in kernel") when VMX was requested
by the kernel the .vr_state (which held the checkpointed state of VSRs
before the transaction) was overwritten with the current state from
outside the transation. Thus if the transaction did not complete, the
VSR values would be "rolled back" to potentially incorrect values.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Wed, 25 Nov 2015 02:46:25 +0000 (13:46 +1100)]
powerpc/xmon: Append linux_banner to exception information in xmon.
Currently if you are in xmon without an oops etc. to view the kernel
version you have to type "d $linux_banner" - not necessarily obvious. As
this is useful information, append to the output of "e" command.
Kernel prints respective warnings about various EPOW events for
user information/action after parsing EPOW interrupts. At times
below EPOW reset event warning is seen to be flooding kernel log
over a period of time.
May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
These EPOW reset events are spurious in nature and are triggered by
firmware without an actual EPOW event being reset. This patch avoids these
multiple EPOW reset warnings by using a counter variable. This variable
is incremented every time an EPOW event is reported. Upon receiving a EPOW
reset event the same variable is checked to filter out spurious events and
decremented accordingly.
This patch also improves log messages to better describe EPOW event being
reported. Merged adjacent log messages into single one to reduce number of
lines printed per event.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Fri, 20 Nov 2015 04:15:34 +0000 (15:15 +1100)]
selftests/powerpc: Add TM signal with invalid stack test
Test the kernels signal generation code to ensure it can handle an
invalid stack pointer when transactional.
Signed-off-by: Michael Neuling <mikey@neuling.org> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Skip if we don't have TM] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Fri, 20 Nov 2015 04:15:33 +0000 (15:15 +1100)]
selftests/powerpc: Add TM signal return test
Test the kernel's signal return code to ensure that it doesn't crash
when both the transactional and suspend MSR bits are set in the signal
context.
Signed-off-by: Michael Neuling <mikey@neuling.org> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Skip if we don't have TM] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
You get the TM[] only if at least one TM MSR bit is set. Inside the
TM[], E means Enabled (bit 32), S means Suspended (bit 33), and T
means Transactional (bit 34)
If no bits are set, you get no TM[] output.
Include rework of printbits() to handle this case.
Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Boqun Feng [Mon, 2 Nov 2015 01:30:32 +0000 (09:30 +0800)]
powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered
According to memory-barriers.txt, xchg*, cmpxchg* and their atomic_
versions all need to be fully ordered, however they are now just
RELEASE+ACQUIRE, which are not fully ordered.
So also replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee fully ordered semantics
of atomic{,64}_{cmp,}xchg() and {cmp,}xchg(), as a complement of commit b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics")
This patch depends on patch "powerpc: Make value-returning atomics fully
ordered" for PPC_ATOMIC_ENTRY_BARRIER definition.
Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Boqun Feng [Mon, 2 Nov 2015 01:30:31 +0000 (09:30 +0800)]
powerpc: Make value-returning atomics fully ordered
According to memory-barriers.txt:
> Any atomic operation that modifies some state in memory and returns
> information about the state (old or new) implies an SMP-conditional
> general memory barrier (smp_mb()) on each side of the actual
> operation ...
Which mean these operations should be fully ordered. However on PPC,
PPC_ATOMIC_ENTRY_BARRIER is the barrier before the actual operation,
which is currently "lwsync" if SMP=y. The leading "lwsync" can not
guarantee fully ordered atomics, according to Paul Mckenney:
https://lkml.org/lkml/2015/10/14/970
To fix this, we define PPC_ATOMIC_ENTRY_BARRIER as "sync" to guarantee
the fully-ordered semantics.
This also makes futex atomics fully ordered, which can avoid possible
memory ordering problems if userspace code relies on futex system call
for fully ordered semantics.
Fixes: b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics") Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The slot information of base page size hash pte is stored in the
pgtable_t w.r.t transparent hugepage. We need to make sure we don't
index beyond pgtable_t size.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pte and pmd table size are dependent on config items. Don't
hard code the same. This make sure we use the right value
when masking pmd entries and also while checking pmd_bad
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For a pte entry we will have _PAGE_PTE set. Our pte page
address have a minimum alignment requirement of HUGEPD_SHIFT_MASK + 1.
We use the lower 7 bits to indicate hugepd. ie.
For pmd and pgd we can find:
1) _PAGE_PTE set pte -> indicate PTE
2) bits [2..6] non zero -> indicate hugepd.
They also encode the size. We skip bit 1 (_PAGE_PRESENT).
3) othewise pointer to next table.
Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
W.r.t hugetlb, we support two format for pmd. With book3s_64 and
64K linux page size, we can have pte at the pmd level. Hence we
don't need to support hugepd there. For everything else hugepd
is supported and pmd_huge is (0).
Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Add helper for converting pte bit to hpte bits
Instead of open coding it in multiple code paths, export the helper
and add more documentation. Also make sure we don't make assumption
regarding pte bit position
Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Don't track subpage valid bit in pte_t
This free up 11 bits in pte_t. In the later patch we also change
the pte_t format so that we can start supporting migration pte
at pmd level. We now track 4k subpage valid bit as below
If we have _PAGE_COMBO set, we override the _PAGE_F_GIX_SHIFT
and _PAGE_F_SECOND. Together we have 4 bits, each of them
used to indicate whether any of the 4 4k subpage in that group
is valid. ie,
[ group 1 bit ] [ group 2 bit ] ..... [ group 4 ]
[ subpage 1 - 4] [ subpage 5- 8] ..... [ subpage 13 - 16]
We still track each 4k subpage slot number and secondary hash
information in the second half of pgtable_t. Removing the subpage
tracking have some significant overhead on aim9 and ebizzy benchmark and
to support THP with 4K subpage, we do need a pgtable_t of 4096 bytes.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We copy only needed PTE bits define from pte-common.h to respective
hash related header. This should greatly simply later patches in which
we are going to change the pte format for hash config
Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Don't have generic headers introduce functions touching pte bits
We are going to drop pte_common.h in the later patch. The idea is to
enable hash code not require to define all PTE bits. Having PTE bits
defined in pte_common.h made the code unnecessarily complex.
Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Fix infinite loop in hash fault with 4K page size
This is the same bug we fixed as part of 09567e7fd44291bfc08accfdd67ad8f467842332
("powerpc/mm: Check paca psize is up to date for huge mappings"). Please
check that for details. The difference here is that faults were
happening on a 4K page at an address previously mapped by hugetlb.
Anton Blanchard [Wed, 9 Dec 2015 09:11:47 +0000 (20:11 +1100)]
powerpc: Fix DSCR inheritance over fork()
Two DSCR tests have a hack in them:
/*
* XXX: Force a context switch out so that DSCR
* current value is copied into the thread struct
* which is required for the child to inherit the
* changed value.
*/
sleep(1);
We should not be working around this in the testcase, it is a kernel bug.
Fix it by copying the current DSCR to the child, instead of what we
had in the thread struct at last context switch.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 10 Dec 2015 09:44:39 +0000 (20:44 +1100)]
powerpc: Call restore_sprs() before _switch()
commit 152d523e6307 ("powerpc: Create context switch helpers save_sprs()
and restore_sprs()") moved the restore of SPRs after the call to _switch().
There is an issue with this approach - new tasks do not return through
_switch(), they are set up by copy_thread() to directly return through
ret_from_fork() or ret_from_kernel_thread(). This means restore_sprs() is
not getting called for new tasks.
Fix this by moving restore_sprs() before _switch().
Fixes: 152d523e6307 ("powerpc: Create context switch helpers save_sprs() and restore_sprs()") Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 10 Dec 2015 09:04:05 +0000 (20:04 +1100)]
powerpc: Call check_if_tm_restore_required() in enable_kernel_*()
Commit a0e72cf12b1a ("powerpc: Create msr_check_and_{set,clear}()")
removed a call to check_if_tm_restore_required() in the
enable_kernel_*() functions. Add them back in.
Fixes: a0e72cf12b1a ("powerpc: Create msr_check_and_{set,clear}()") Reported-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The enable_kernel_*() functions leave the relevant MSR bits enabled
until we exit the kernel sometime later. Create disable versions
that wrap the kernel use of FP, Altivec VSX or SPE.
While we don't want to disable it normally for performance reasons
(MSR writes are slow), it will be used for a debug boot option that
does this and catches bad uses in other areas of the kernel.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 29 Oct 2015 00:44:04 +0000 (11:44 +1100)]
powerpc: Create msr_check_and_{set,clear}()
Create helper functions to set and clear MSR bits after first
checking if they are already set. Grouping them will make it
easy to avoid the MSR writes in a subsequent optimisation.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 29 Oct 2015 00:44:02 +0000 (11:44 +1100)]
powerpc: Move part of giveup_vsx into c
Move the MSR modification into c. Removing it from the assembly
function will allow us to avoid costly MSR writes by batching them
up.
Check the FP and VMX bits before calling the relevant giveup_*()
function. This makes giveup_vsx() and flush_vsx_to_thread() perform
more like their sister functions, and allows us to use
flush_vsx_to_thread() in the signal code.
Move the check_if_tm_restore_required() check in.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 29 Oct 2015 00:44:00 +0000 (11:44 +1100)]
powerpc: Remove NULL task struct pointer checks in FP and vector code
We used to allow giveup_*() to be called with a NULL task struct
pointer. Now those cases are handled in the caller we can remove
the checks. We can also remove giveup_altivec_notask() which is also
unused.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 29 Oct 2015 00:43:58 +0000 (11:43 +1100)]
powerpc: Simplify TM restore checks
Instead of having multiple giveup_*_maybe_transactional() functions,
separate out the TM check into a new function called
check_if_tm_restore_required().
This will make it easier to optimise the giveup_*() functions in a
subsequent patch.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 29 Oct 2015 00:43:57 +0000 (11:43 +1100)]
powerpc: Remove UP only lazy floating point and vector optimisations
The UP only lazy floating point and vector optimisations were written
back when SMP was not common, and neither glibc nor gcc used vector
instructions. Now SMP is very common, glibc aggressively uses vector
instructions and gcc autovectorises.
We want to add new optimisations that apply to both UP and SMP, but
in preparation for that remove these UP only optimisations.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 29 Oct 2015 00:43:55 +0000 (11:43 +1100)]
powerpc: Create context switch helpers save_sprs() and restore_sprs()
Move all our context switch SPR save and restore code into two
helpers. We do a few optimisations:
- Group all mfsprs and all mtsprs. In many cases an mtspr sets a
scoreboarding bit that an mfspr waits on, so the current practise of
mfspr A; mtspr A; mfpsr B; mtspr B is the worst scheduling we can
do.
- SPR writes are slow, so check that the value is changing before
writing it.
Anton Blanchard [Thu, 29 Oct 2015 00:43:53 +0000 (11:43 +1100)]
powerpc: Don't disable kernel FP/VMX/VSX MSR bits on context switch
Writing the MSR is slow, so we want to avoid it whenever possible.
A subsequent patch will add a debug option that strictly manages the
FP/VMX/VSX unavailable bits. For now just remove it, matching what
we do in other areas of the kernel (eg enable_kernel_altivec()).
Paul Mackerras [Thu, 12 Nov 2015 05:44:42 +0000 (16:44 +1100)]
powerpc/64: Include KVM guest test in all interrupt vectors
Currently, if HV KVM is configured but PR KVM isn't, we don't include
a test to see whether we were interrupted in KVM guest context for the
set of interrupts which get delivered directly to the guest by hardware
if they occur in the guest. This includes things like program
interrupts.
However, the recent bug where userspace could set the MSR for a VCPU
to have an illegal value in the TS field, and thus cause a TM Bad Thing
type of program interrupt on the hrfid that enters the guest, showed that
we can never be completely sure that these interrupts can never occur
in the guest entry/exit code. If one of these interrupts does happen
and we have HV KVM configured but not PR KVM, then we end up trying to
run the handler in the host with the MMU set to the guest MMU context,
which generally ends badly.
Thus, for robustness it is better to have the test in every interrupt
vector, so that if some way is found to trigger some interrupt in the
guest entry/exit path, we can handle it without immediately crashing
the host.
This means that the distinction between KVMTEST and KVMTEST_PR goes
away. Thus we delete KVMTEST_PR and associated macros and use KVMTEST
everywhere that we previously used either KVMTEST_PR or KVMTEST. It
also means that SOFTEN_TEST_HV_201 becomes the same as SOFTEN_TEST_PR,
so we deleted SOFTEN_TEST_HV_201 and use SOFTEN_TEST_PR instead.
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Thu, 19 Nov 2015 06:04:53 +0000 (17:04 +1100)]
powerpc: Standardise on NR_syscalls rather than __NR_syscalls.
Most architectures use NR_syscalls as the #define for the number of syscalls.
We use __NR_syscalls, and then define NR_syscalls as __NR_syscalls.
__NR_syscalls is not used outside arch code, whereas NR_syscalls is. So as
NR_syscalls must be defined and __NR_syscalls does not, replace __NR_syscalls
with NR_syscalls.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
John Ogness [Wed, 11 Nov 2015 13:48:50 +0000 (14:48 +0100)]
powerpc/powermac: set IRQF_NO_THREAD for xmon/cascade handlers
The xmon and cascade irq handlers must not run as threads.
pmac_pic_lock is already a raw_spinlock, but the irq flag
IRQF_NO_THREAD needs to be set as well.
Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cxl: use correct operator when writing pcie config space values
When writing a value to config space, cxl_pcie_write_config() calls
cxl_pcie_config_info() to obtain a mask and shift value, shifts the new
value accordingly, then uses the mask to combine the shifted value with the
existing value at the address as part of a read-modify-write pattern.
Currently, we use a logical OR operator rather than a bitwise OR operator,
which means any use of this function results in an incorrect value being
written. Replace the logical OR operator with a bitwise OR operator so the
value is written correctly.
Reported-by: Michael Ellerman <mpe@ellerman.id.au> Cc: stable@vger.kernel.org Fixes: 6f7f0b3df6d4 ("cxl: Add AFU virtual PHB and kernel API") Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Vaibhav Jain [Mon, 16 Nov 2015 04:03:45 +0000 (09:33 +0530)]
cxl: Fix possible idr warning when contexts are released
An idr warning is reported when a context is release after the capi card
is unbound from the cxl driver via sysfs. Below are the steps to
reproduce:
1. Create multiple afu contexts in an user-space application using libcxl.
2. Unbind capi card from cxl using command of form
echo <capi-card-pci-addr> > /sys/bus/pci/drivers/cxl-pci/unbind
3. Exit/kill the application owning afu contexts.
After above steps a warning message is usually seen in the kernel logs
of the form "idr_remove called for id=<context-id> which is not
allocated."
This is caused by the function cxl_release_afu which destroys the
contexts_idr table. So when a context is release no entry for context pe
is found in the contexts_idr table and idr code prints this warning.
This patch fixes this issue by increasing & decreasing the ref-count on
the afu device when a context is initialized or when its freed
respectively. This prevents the afu from being released until all the
afu contexts have been released. The patch introduces two new functions
namely cxl_afu_get/put that manage the ref-count on the afu device.
Also the patch removes code inside cxl_dev_context_init that increases ref
on the afu device as its guaranteed to be alive during this function.
Reported-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Thu, 19 Nov 2015 04:44:45 +0000 (15:44 +1100)]
powerpc/tm: Check for already reclaimed tasks
Currently we can hit a scenario where we'll tm_reclaim() twice. This
results in a TM bad thing exception because the second reclaim occurs
when not in suspend mode.
The scenario in which this can happen is the following. We attempt to
deliver a signal to userspace. To do this we need obtain the stack
pointer to write the signal context. To get this stack pointer we
must tm_reclaim() in case we need to use the checkpointed stack
pointer (see get_tm_stackpointer()). Normally we'd then return
directly to userspace to deliver the signal without going through
__switch_to().
Unfortunatley, if at this point we get an error (such as a bad
userspace stack pointer), we need to exit the process. The exit will
result in a __switch_to(). __switch_to() will attempt to save the
process state which results in another tm_reclaim(). This
tm_reclaim() now causes a TM Bad Thing exception as this state has
already been saved and the processor is no longer in TM suspend mode.
Whee!
This patch checks the state of the MSR to ensure we are TM suspended
before we attempt the tm_reclaim(). If we've already saved the state
away, we should no longer be in TM suspend mode. This has the
additional advantage of checking for a potential TM Bad Thing
exception.
Found using syscall fuzzer.
Fixes: fb09692e71f1 ("powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes") Cc: stable@vger.kernel.org # v3.9+ Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Thu, 19 Nov 2015 04:44:44 +0000 (15:44 +1100)]
powerpc/tm: Block signal return setting invalid MSR state
Currently we allow both the MSR T and S bits to be set by userspace on
a signal return. Unfortunately this is a reserved configuration and
will cause a TM Bad Thing exception if attempted (via rfid).
This patch checks for this case in both the 32 and 64 bit signals
code. If both T and S are set, we mark the context as invalid.
Found using a syscall fuzzer.
Fixes: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") Cc: stable@vger.kernel.org # v3.9+ Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Linus Torvalds [Sun, 22 Nov 2015 23:21:40 +0000 (15:21 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge slub bulk allocator updates from Andrew Morton:
"This missed the merge window because I was waiting for some repairs to
come in. Nothing actually uses the bulk allocator yet and the changes
to other code paths are pretty small. And the net guys are waiting
for this so they can start merging the client code"
More comments from Jesper Dangaard Brouer:
"The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in
previous kernel. The present version contains a bug. Vladimir
Davydov noticed it contained a bug, when kernel is compiled with
CONFIG_MEMCG_KMEM (see commit 03ec0ed57ffc: "slub: fix kmem cgroup
bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in
kmem_cache_free_bulk() were missing (see commit 033745189b1b "slub:
add missing kmem cgroup support to kmem_cache_free_bulk").
I don't consider the fix stable-material because there are no in-tree
users of the API.
But with known bugs (for memcg) I cannot start using the API in the
net-tree"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
slab/slub: adjust kmem_cache_alloc_bulk API
slub: add missing kmem cgroup support to kmem_cache_free_bulk
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
slub: optimize bulk slowpath free by detached freelist
slub: support for bulk free with SLUB freelists
Linus Torvalds [Sun, 22 Nov 2015 23:10:57 +0000 (15:10 -0800)]
Merge tag 'tty-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are a few small tty/serial driver fixes for 4.4-rc2 that resolve
some reported problems.
All have been in linux-next, full details are in the shortlog below"
* tag 'tty-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: export fsl8250_handle_irq
serial: 8250_mid: Add missing dependency
tty: audit: Fix audit source
serial: etraxfs-uart: Fix crash
serial: fsl_lpuart: Fix earlycon support
bcm63xx_uart: Use the device name when registering an interrupt
tty: Fix direct use of tty buffer work
tty: Fix tty_send_xchar() lock order inversion