powerpc/mm/hash64: Fix subpage protection with 4K HPTE config
With Linux page size of 64K and hardware only supporting 4K HPTE, if we
use subpage protection, we always fail for the subpage 0 as shown
below (using the selftest subpage_prot test):
520175565: (4520111850): Failed at 0x3fffad4b0000 (p=13,sp=0,w=0), want=fault, got=pass ! 4520890210: (4520826495): Failed at 0x3fffad5b0000 (p=29,sp=0,w=0), want=fault, got=pass ! 4521574251: (4521510536): Failed at 0x3fffad6b0000 (p=45,sp=0,w=0), want=fault, got=pass ! 4522258324: (4522194609): Failed at 0x3fffad7b0000 (p=61,sp=0,w=0), want=fault, got=pass !
This is because hash preload wrongly inserts the HPTE entry for subpage
0 without looking at the subpage protection information.
Fix it by teaching should_hash_preload() not to preload if we have
subpage protection configured for that range.
It appears this has been broken since it was introduced in 2008.
Fixes: fa28237cfcc5 ("[POWERPC] Provide a way to protect 4k subpages when using 64k pages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rework into should_hash_preload() to avoid build fails w/SLICES=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/hash64: Factor out hash preload psize check
Currently we have a check in hash_preload() against the psize, which is
only included when CONFIG_PPC_MM_SLICES is enabled. We want to expand
this check in a subsequent patch, so factor it out to allow that. As a
bonus it removes the #ifdef in the C code.
Unfortunately we can't put this in the existing CONFIG_PPC_MM_SLICES
block because it would require a forward declaration.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc: Update of_remove_property() call sites to remove null checking
After obtaining a property from of_find_property() and before calling
of_remove_property() most code checks to ensure that the property
returned from of_find_property() is not null. The previous patch moved
this check to the start of the function of_remove_property() in order to
avoid the case where this check isn't done and a null value is passed.
This ensures the check is always conducted before taking locks and
attempting to remove the property. Thus it is no longer necessary to
perform a check for null values before invoking of_remove_property().
Update of_remove_property() call sites in order to remove redundant
checking for null property value as check is now performed within the
of_remove_property function().
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
[mpe: Unbreak some lines which are just >80 chars for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
drivers/of: Add check for null property in of_remove_property()
The validity of the property input argument to of_remove_property() is
never checked within the function and thus it is possible to pass a null
value. It happens that this will be picked up in __of_remove_property()
as no matching property of the device node will be found and thus an
error will be returned, however once again there is no explicit check
for a null value. By the time this is detected 2 locks have already been
acquired which is completely unnecessary if the property to remove is
null.
Add an explicit check in the function of_remove_property() for a null
property value and return -ENODEV in this case, this is consistent with
what the previous return value would have been when the null value was
not detected and passed to __of_remove_property().
By moving an explicit check for the property paramenter into the
of_remove_property() function, this will remove the need to perform this
check in calling code before invocation of the of_remove_property()
function.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/pseries: Add null property check to pseries_discover_pic()
The return value of of_get_property() isn't checked before it is passed
to the strstr() function, if it happens that the return value is null
then this will result in a null pointer being dereferenced.
Add a check to see if the return value of of_get_property() is null and
if it is continue straight on to the next node.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Chris Smart <chris@distroguy.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv/pci: Fix cfg_dbg() & replace with pr_devel()
When cfg_dbg() is enabled (i.e. mapped to printk()), gcc produces
errors as the __func__ parameter is missing (pnv_pci_cfg_read() has one);
this adds the missing parameter.
cfg_dbg() is just an inferior version of pr_devel() so use the latter
instead.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Chris Smart [Mon, 2 May 2016 03:51:38 +0000 (13:51 +1000)]
selftests/powerpc: Test cp_abort during context switch
Test that performing a copy paste sequence in userspace on P9 does not
result in a leak of the copy into the paste of another process.
This is based on Anton Blanchard's context_switch benchmarking code. It
sets up two processes tied to the same CPU, one which copies and one
which pastes.
The paste should never succeed and the test fails if it does.
This is a test for commit, "8a64904 powerpc: Add support for userspace
P9 copy paste."
Patch created with much assistance from Michael Neuling
<mikey@neuling.org>
Signed-off-by: Chris Smart <chris@distroguy.com> Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Chris Smart [Mon, 2 May 2016 06:00:58 +0000 (16:00 +1000)]
powerpc: Remove unnecessary CONFIG_SMP #ifdefs
The code in machine_restart/power_off/halt() includes #ifdefs around
calls to smp_send_stop(), however these are not required as
include/linux/smp.h includes an empty version of this function for
CONFIG_SMP=n builds.
Signed-off-by: Chris Smart <chris@distroguy.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Support for the A2 cpu was removed in commit fb5a515704d7 ("powerpc:
Remove platforms/wsp and associated pieces"), and the externs:
__setup_cpu_a2 and __restore_cpu_a2 are still around and unused, so
remove them.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The usage in mm mmu_context_nohash.c is bogus, because we set the
context.id value to MMU_NO_CONTEXT 4 lines previously in the same
function, meaning slice_mm_new_context() will always be true.
The book3s 64 usage was removed in the previous commit. So remove it as
unused.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/subpage: Initialise user psize correctly
As part of the radix support we switched Book3s64 to use a value of ~0
for MMU_NO_CONTEXT. That is because id 0 is special on radix.
However that broke the logic in init_new_context(). The code there needs
to differentiate between a newly allocated context and one inherited via
fork. Previously it worked because a newly allocated context has an id
of zero (because it was just memset() to zero), which used to match
MMU_NO_CONTEXT, and therefore slice_mm_new_context() did the right
thing.
Instead check against a context.id value of zero instead of using
slice_mm_new_context().
Without this patch we never call slice_set_user_psize(), and end up with
a slice psize value of zero and we always end up using 4K HPTE.
powerpc/mm/radix: Use firmware feature to enable Radix MMU
We use the existing "ibm,pa-features" device-tree property to enable
Radix MMU mode. This means we default to hash mode unless firmware tells
us it's OK to start using Radix mode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Add THP support for 4K linux page size
This adds THP support for 4K Linux page size config with radix. We still
don't do THP with 4K Linux page size and hash page table. Hash page
table needs a 16MB hugepage and we can't do THP with 16MM hugepage and
4K Linux page size.
We add missing functions to 4K hash config to get it to build and
hash__has_transparent_hugepage() makes sure we don't enable THP for 4K
hash config. To catch wrong usage of THP related with 4K config, we add
BUG() in those dummy functions we added to get it compile.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The deposited pgtable_t is a pte fragment hence we cannot use page->lru
for linking then together. We use the first two 64 bits for pte fragment
as list_head type to link all deposited fragments together. On withdraw
we properly zero then out.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In this patch we make the number of pte fragments per level 4 page table
page a variable. Radix level 4 table size is 256 bytes and hence we can
have 256 fragments per level 4 page. We don't update the fragment count
in this patch. We need to do performance measurements to find the right
value for fragment count.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: vmalloc abstraction in preparation for radix
The vmalloc range differs between hash and radix config. Hence make
VMALLOC_START and related constants a variable which will be runtime
initialized depending on whether hash or radix mode is active.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix missing init of ioremap_bot in pgtable_64.c for ppc64e] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Make 4K and 64K use pte_t for pgtable_t
This patch switches 4K Linux page size config to use pte_t * type
instead of struct page * for pgtable_t. This simplifies the code a lot
and helps in consolidating both 64K and 4K page allocator routines. The
changes should not have any impact, because we already store physical
address in the upper level page table tree and that implies we already
do struct page * to physical address conversion.
One change to note here is we move the pgtable_page_dtor() call for
nohash to pte_fragment_free_mm(). The nohash related change is due to
the related changes in pgtable_64.c.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Revert changes made to nohash pgalloc-64.h
This reverts pgalloc related changes WRT implementing 4-level page
table for 64K Linux page size and storing of physical address in higher
level page tables since they are only applicable to book3s64 variant
and we now have a separate copy for book3s64. This helps to keep these
headers simpler.
Cc: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Make a copy of pgalloc.h for 32 and 64 book3s
This patch start to make a book3s variant for pgalloc headers. We have
multiple book3s specific changes such as:
* 4 level page table
* store physical address in higher level table
* use pte_t * for pgtable_t
Having a book3s64 specific variant helps to keep code simpler and remove
lots of #ifdef around code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Pick the address layout for radix config
Hash needs special get_unmapped_area() handling because of limitations
around base page size, so we have to set HAVE_ARCH_UNMAPPED_AREA.
With radix we don't have such restrictions, so we could use the generic
code. But because we've set HAVE_ARCH_UNMAPPED_AREA (for hash), we have
to re-implement the same logic as the generic code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to add asm changes in the follow up patches. Add the
feature bit now so that we can get it all build.
mpe: When CONFIG_PPC_RADIX_MMU=n we omit MMU_FTR_RADIX from the
MMU_FTRS_POSSIBLE mask. This allows the compiler to work out that those
checks will always be false and so the code can be elided completely.
Note we do *not* define MMU_FTR_RADIX to 0 in the RADIX_MMU=n case,
because that doesn't work with the ASM_FTR patching. In particular an
IF_SET section will result in a mask and value of zero, which is always
true, meaning the section *won't* be patched, which is the opposite of
what we want.
Michael Ellerman [Wed, 11 May 2016 05:30:47 +0000 (15:30 +1000)]
powerpc/mm: Add mask of possible MMU features
Follow the example of the cpu feature code, and add a mask of possible
MMU features, MMU_FTRS_POSSIBLE.
This is used in mmu_has_feature(), which allows the possible mask to act
as a shortcut for any features that are not possible, but still allows
the feature bit itself to be defined.
We will use this in the next commit to allow MMU_FTR_RADIX checks to be
elided when MMU_FTR_RADIX is not possible.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Core kernel doesn't track the page size of the VA range that we are
invalidating. Hence we end up flushing TLB for the entire mm here. Later
patches will improve this.
We also don't flush page walk cache separetly instead use RIC=2 when
flushing TLB, because we do a MMU gather flush after freeing page table.
MMU_NO_CONTEXT is updated for hash.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Abstraction for vmemmap and map_kernel_page()
For hash we create vmemmap mapping using bolted hash page table entries.
For radix we fill the radix page table. The next patch will add the
radix details for creating vmemmap mappings.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Add radix callbacks for early init routines
This adds routines for early setup for radix. We use device tree
property "ibm,processor-radix-AP-encodings" to find supported page
sizes. If we don't find the above we consider 64K and 4K as supported
page sizes.
We do map vmemap using 2M page size if we can. The linear mapping is
done such that we use required page size for that range. For example
memory of 3.5G is mapped such that we use 1G mapping till 3G range and
use 2M mapping for the rest.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Add radix callback for pmd accessors
This only does 64K Linux page support for now. 64K hash Linux config
THP needs to differentiate it from hugetlb huge page because with THP we
need to track hash pte slot information with respect to each subpage.
This is not needed with hugetlb hugepage, because we don't do MPSS with
hugetlb.
Radix doesn't have any such restrictions.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Move hugetlb and THP related pmd accessors to pgtable.h
Here we create pgtable-64/4k.h and move pmd accessors that are common
between hash and radix there. We can't do much sharing with 4K Linux
page size because 4K Linux page size with hash config doesn't support
THP. So for now it is empty. In later patches we will add functions that
does conditional hash/radix accessors there.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For those pte accessors, that operate on a different set of pte bits
between hash/radix, we add a generic variant that does a conditional
to hash linux or radix variant.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In this patch we add the radix Kconfig and conditional check.
radix_enabled() is written to always return 0 here. Once we have all
needed radix changes added, we will update this to an mmu_feature check.
We need to add this early so that we can get it all build in the early
stage.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds Power ISA 3.0 specific pte defines. We share most of the
details with hash Linux page table format. This patch indicates only
things where we differ.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Radix and hash MMU models support different page table sizes. Make
the #defines a variable so that existing code can work with variable
sizes.
Slice related code is only used by hash, so use hash constants there. We
will replicate some of the boundary conditions with resepct to TASK_SIZE
using radix values too. Right now we do boundary condition check using
hash constants.
Swapper pgdir size is initialized in asm code. We select the max pgd
size to keep it simple. For now we select hash pgdir. When adding radix
we will switch that to radix pgdir which is 64K.
BUILD_BUG_ON check which is removed is already done in hugepage_init()
using MAYBE_BUILD_BUG_ON().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/book3s: Rename hash specific PTE bits to carry H_ prefix
This helps to make following hash only pte bits easier.
We have kept _PAGE_CHG_MASK, _HPAGE_CHG_MASK and _PAGE_PROT_BITS as it
is in this patch eventhough they use hash specific bits. Using them in
radix as it is should be ok, because with radix we expect those bit
positions to be zero.
Only renames in this patch, no change in functionality.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PowerISA 3.0 adds a parition table indexed by LPID. Parition table
allows us to specify the MMU model that will be used for guest and host
translation.
This patch adds support with SLB based hash model (UPRT = 0). What is
required with this model is to support the new hash page table entry
format and also setup partition table such that we use hash table for
address translation.
We don't have segment table support yet.
In order to make sure we don't load KVM module on Power9 (since we don't
have kvm support yet) this patch also disables KVM on Power9.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Use generic version of ptep_clear_flush_young()
The radix variant is going to require a flush_tlb_range(). With
flush_tlb_range() added, ptep_clear_flush_young() is the same as the
generic version. So drop the powerpc specific variant.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Use generic version of pmdp_clear_flush_young()
The radix variant is going to require a flush_pmd_tlb_range(). With
flush_pmd_tlb_range() added, pmdp_clear_flush_young() is the same as the
generic version. So drop the powerpc specific variant.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PowerISA 3.0 introduces two pte bits with the below meaning for radix:
00 -> Normal Memory
01 -> Strong Access Order (SAO)
10 -> Non idempotent I/O (Cache inhibited and guarded)
11 -> Tolerant I/O (Cache inhibited)
We drop the existing WIMG bits in the Linux page table in favour of the
above constants. We loose _PAGE_WRITETHRU with this conversion. We only
use writethru via pgprot_cached_wthru() which is used by
fbdev/controlfb.c which is Apple control display and also PPC32.
With respect to _PAGE_COHERENCE, we have been marking hpte always
coherent for some time now. htab_convert_pte_flags() always added
HPTE_R_M.
NOTE: KVM changes need closer review.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PS3 had used a PPP bit hack to implement a read only mapping in the
kernel area. Since we are bolting the ioremap area, it used the pte
flags _PAGE_PRESENT | _PAGE_USER to get a PPP value of 0x3 there by
resulting in a read only mapping. This means the area can be accessed by
user space, but kernel will never return such an address to user space.
But we can do better by implementing a read only kernel mapping using
PPP bits 0b110.
This also allows us to do read only kernel mapping for radix in later
patches.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PTE_RPN_SHIFT is actually page size dependent. Even though PowerISA 3.0
expects only the lower 12 bits to be zero, we will always find the pages
to be PAGE_SHIFT aligned. In case of hash config, this also allows us to
use the additional 3 bits to track pte specific information. We need
to make sure we use these bits only for hash specific pte flags.
For both 4K and 64K config, pte now can hold 57 bits address.
Inorder to keep things simpler, drop PTE_RPN_SHIFT and PTE_RPN_SIZE and
specify the 57 bit detail explicitly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED
_PAGE_PRIVILEGED means the page can be accessed only by the kernel. This
is done to keep pte bits similar to PowerISA 3.0 Radix PTE format. User
pages are now marked by clearing _PAGE_PRIVILEGED bit.
Previously we allowed the kernel to have a privileged page in the lower
address range (USER_REGION). With this patch such access is denied.
We also prevent a kernel access to a non-privileged page in higher
address range (ie, REGION_ID != 0).
Both the above access scenarios should never happen.
Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have a common declaration in pte-common.h Add a book3s specific one
and switch to pte_user() in callchain.c. In a subsequent patch we will
switch _PAGE_USER to _PAGE_PRIVILEGED in the book3s version only.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 29 Apr 2016 13:25:32 +0000 (23:25 +1000)]
powerpc/mm: Convert pte_user() to static inline
In a subsequent patch we want to add a second definition of pte_user().
Before we do that, make the signature clear, ie. it takes a pte_t and
returns bool.
We move it up inside the existing #ifndef __ASSEMBLY__ block, but
otherwise it's a straight conversion.
Convert the call in settlbcam(), which passes an unsigned long, to pass
a pte_t.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/subpage: Clear RWX bit to indicate no access
Subpage protection used to depend on the _PAGE_USER bit to implement no
access mode. This patch switches that to use _PAGE_RWX. We clear Read,
Write and Execute access from the pte instead of clearing _PAGE_USER
now. This was done so that we can switch to _PAGE_PRIVILEGED in a later
patch.
subpage_protection() returns pte bits that need to be cleared. Instead
of updating the interface to handle no-access in a separate way, it
appears simpler to clear RWX acecss to indicate no access.
We still don't insert hash ptes for no access implied by !_PAGE_RWX.
Hence we should not get PROT_FAULT with change.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Use _PAGE_READ to indicate Read access
This splits the _PAGE_RW bit into _PAGE_READ and _PAGE_WRITE. It also
removes the dependency on _PAGE_USER for implying read only. Few things
to note here is that, we have read implied with write and execute
permission. Hence we should always find _PAGE_READ set on hash pte
fault.
We still can't switch PROT_NONE to !(_PAGE_RWX). Auto numa depends on
marking a prot none pte _PAGE_WRITE. (For more details look at b191f9b106ea "mm: numa: preserve PTE write permissions across a NUMA
hinting fault")
Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 29 Apr 2016 13:25:29 +0000 (23:25 +1000)]
powerpc/mm: Use pte_raw() in pte_same()/pmd_same()
We can avoid doing endian conversions by using pte_raw() in pxx_same().
The swap of the constant (_PAGE_HPTEFLAGS) should be done at compile
time by the compiler.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Use big endian Linux page tables for book3s 64
Traditionally Power server machines have used the Hashed Page Table MMU
mode. In this mode Linux manages its own tree of nested page tables,
aka. "the Linux page tables", which are not used by the hardware
directly, and software loads translations into the hash page table for
use by the hardware.
Power ISA 3.0 defines a new MMU mode, known as Radix Tree Translation,
where the hardware can directly operate on the Linux page tables.
However the hardware requires that the page tables be in big endian
format.
To accommodate this, switch the pgtable types to __be64 and add
appropriate endian conversions.
Because we will be supporting a single kernel binary that boots using
either radix or hash mode, we always store the Linux page tables big
endian, even in hash mode where they are not actually used by the
hardware.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix sparse errors, flesh out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 29 Apr 2016 13:25:27 +0000 (23:25 +1000)]
powerpc/mm: Add pte_xchg() helper
We have five locations in 64-bit hash MMU code that do a cmpxchg() of a
PTE. Currently doing it inline OK, but in a future patch we will be
converting the PTEs to __be64 in some configs. In that case we will need
casts at every cmpxchg() site in order to keep sparse happy.
So move the logic into a helper, this is a reasonably nice cleanup on
its own.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 29 Apr 2016 13:25:25 +0000 (23:25 +1000)]
powerpc/mm: Always use STRICT_MM_TYPECHECKS
Testing done by Paul Mackerras has shown that with a modern compiler
there is no negative effect on code generation from enabling
STRICT_MM_TYPECHECKS.
So remove the option, and always use the strict type definitions.
Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
IB/qib: Use cache inhibitted and guarded mapping on powerpc
The driver was requesting for a writethrough mapping. But with those
flags we will end up with an SAO mapping because we now have memory
conherence always enabled. ie, the existing mapping will end up with a
WIMG value 0b1110 which is Strong Access Order.
Update this to use cache inhibitted guarded mapping.
Cc: Doug Ledford <dledford@redhat.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: linux-rdma@vger.kernel.org Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
ftrace: Match dot symbols when searching functions on ppc64
In the ppc64 big endian ABI, function symbols point to function
descriptors. The symbols which point to the function entry points
have a dot in front of the function name. Consequently, when the
ftrace filter mechanism searches for the symbol corresponding to
an entry point address, it gets the dot symbol.
As a result, ftrace filter users have to be aware of this ABI detail on
ppc64 and prepend a dot to the function name when setting the filter.
The perf probe command insulates the user from this by ignoring the dot
in front of the symbol name when matching function names to symbols,
but the sysfs interface does not. This patch makes the ftrace filter
mechanism do the same when searching symbols.
Fixes the following failure in ftracetest's kprobe_ftrace.tc:
.../kprobe_ftrace.tc: line 9: echo: write error: Invalid argument
That failure is on this line of kprobe_ftrace.tc:
echo _do_fork > set_ftrace_filter
This is because there's no _do_fork entry in the functions list:
Daniel Axtens [Tue, 26 Apr 2016 13:49:09 +0000 (23:49 +1000)]
powerpc: rework sparse for lib/xor_vmx.c
Sparse doesn't seem to be passing -maltivec around properly, leading
to lots of errors:
.../include/altivec.h:34:2: error: Use the "-maltivec" flag to enable PowerPC AltiVec support
arch/powerpc/lib/xor_vmx.c:27:16: error: Expected ; at end of declaration
arch/powerpc/lib/xor_vmx.c:27:16: error: got signed
arch/powerpc/lib/xor_vmx.c:60:9: error: No right hand side of '*'-expression
arch/powerpc/lib/xor_vmx.c:60:9: error: Expected ; at end of statement
arch/powerpc/lib/xor_vmx.c:60:9: error: got v1_in
...
arch/powerpc/lib/xor_vmx.c:87:9: error: too many errors
Only include the altivec.h header for non-__CHECKER__ builds.
For builds with __CHECKER__, make up some stubs instead, as
suggested by Balbir. (The vector size of 16 is arbitrary.)
Suggested-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Daniel Axtens <dja@axtens.net> Tested-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Chris Smart [Tue, 26 Apr 2016 00:28:50 +0000 (10:28 +1000)]
powerpc: Add support for userspace P9 copy paste
The copy paste facility introduced in POWER9 provides an optimised
mechanism for a userspace application to copy a cacheline. This is
provided by a pair of instructions, copy and paste, while a third,
cp_abort (copy paste abort), provides a clean up of the state in case of
a failure.
The copy instruction will read a 128 byte cacheline and store it in an
internal buffer. The subsequent paste instruction will store this
internal buffer to memory and set a CR field if the paste succeeds.
Since the state of the copy paste buffer is internal (and not
architecturally visible), in the unlikely event of a context switch, the
state cannot be stored and the paste should therefore fail.
The cp_abort instruction exists to fail and clean up any such
interrupted copy paste sequence and is to be called by the kernel as
part of the context switch. Doing so prevents data from a preceding copy
in one process leaking into the paste of another.
This code enables use of the cp_abort instruction if a supported
processor is detected.
NOTE: this is for userspace only, not in kernel, and does not deal
with KVM guests.
Patch created with much assistance from Michael Neuling
<mikey@neuling.org>
Signed-off-by: Chris Smart <chris@distroguy.com> Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
mpic_init_sys() currently doesn't check whether
subsys_system_register() succeeded or not. Check the return code of
subsys_system_register() and clean up if there's an error.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Tue, 26 Apr 2016 05:02:50 +0000 (15:02 +1000)]
powerpc/eeh: fix misleading indentation
Found by smatch.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cxl: Fix DAR check & use REGION_ID instead of opencoding
The current code will set _PAGE_USER to the access flags for any
fault address, because the ~ operation will be true for all address we
take a fault on. But setting _PAGE_USER also means that the fault will
be handled only if the page table have _PAGE_USER set. Hence there is
no security hole with the current code.
Now if it is an user space access, then the change in this patch really
don't have an impact because we have (!ctx->kernel) set true
and we take the if condition true.
Now kernel context created fault on an address in the kernel range
will result in a fault loop because we will not insert the
hash pte due to access and pte permission mismatch. This patch fix
the above issue.
Fixes: f204e0b8cedd ("cxl: Driver code for powernv PCIe based cards for userspace access") Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cxl: Increase timeout for detection of AFU mmio hang
PSL designers recommend a larger value for the mmio hang pulse, 256 us
instead of 1 us. The CAIA architecture states that it needs to be
smaller than 1/2 of the RTOS timeout set in the PHB for outbound
non-posted transactions, which is still (easily) the case here.
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Tested-by: Frank Haverkamp <haver@linux.vnet.ibm.com> Tested-by: Manoj Kumar <manoj@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Frederic Barrat [Mon, 21 Mar 2016 19:32:48 +0000 (14:32 -0500)]
cxl: Allow initialization on timebase sync failures
Failure to synchronize the PSL timebase currently prevents the
initialization of the cxl card, thus rendering the card useless. This
is too extreme for a feature which is rarely used, if at all. No
hardware AFUs or software is currently using PSL timebase.
This patch still tries to synchronize the PSL timebase when the card
is initialized, but ignores the error if it can't. Instead, it reports
a status via /sys.
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anju T [Sat, 20 Feb 2016 05:02:47 +0000 (10:32 +0530)]
tools/perf: Map the ID values with register names
Map ID values with corresponding register names. These names are then
displayed when user issues perf record with the -I option
followed by perf report/script with -D option.
To test this patchset, Eg:
$ perf record -I ls # record machine state at interrupt
$ perf script -D # read the perf.data file
Sample output obtained for this patch / output looks like as follows:
Signed-off-by: Anju T <anju@linux.vnet.ibm.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anju T [Sat, 20 Feb 2016 05:02:46 +0000 (10:32 +0530)]
powerpc/perf: Add support for sampling interrupt register state
The perf infrastructure uses a bit mask to find out valid registers to
display. Define a register mask for supported registers defined in
uapi/asm/perf_regs.h. The bit positions also correspond to register IDs
which is used by perf infrastructure to fetch the register values.
CONFIG_HAVE_PERF_REGS enables sampling of the interrupted machine state.
Signed-off-by: Anju T <anju@linux.vnet.ibm.com>
[mpe: Add license, use CONFIG_PPC64, fix 32-bit build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anju T [Sat, 20 Feb 2016 05:02:45 +0000 (10:32 +0530)]
powerpc/perf: Assign an id to each powerpc register
The enum definition assigns an 'id' to each register in "struct pt_regs"
of arch/powerpc. The order of these values in the enum definition are
based on the order of members in pt_regs.
Signed-off-by: Anju T <anju@linux.vnet.ibm.com>
[mpe: Rename LNK to LINK, use _UAPI_ASM for include guards] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hari Bathini [Thu, 7 Apr 2016 22:00:34 +0000 (03:30 +0530)]
powerpc/book3s64: Remove __end_handlers marker
The __end_handlers marker was intended to mark down upto code that gets
called from exception prologs. But that hasn't kept pace with code
changes. Case in point, slb_miss_realmode being called from exception
prolog code but isn't below __end_handlers marker. So, __end_handlers
marker is as good as a comment but could be misleading at times if it
isn't in sync with the code, as is the case now. So, let us avoid this
confusion by having a better comment and removing __end_handlers marker
altogether.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hari Bathini [Fri, 15 Apr 2016 12:48:02 +0000 (22:48 +1000)]
powerpc/book3s64: Fix branching to OOL handlers in relocatable kernel
Some of the interrupt vectors on 64-bit POWER server processors are only
32 bytes long (8 instructions), which is not enough for the full
first-level interrupt handler. For these we need to branch to an
out-of-line (OOL) handler. But when we are running a relocatable kernel,
interrupt vectors till __end_interrupts marker are copied down to real
address 0x100. So, branching to labels (ie. OOL handlers) outside this
section must be handled differently (see LOAD_HANDLER()), considering
relocatable kernel, which would need at least 4 instructions.
However, branching from interrupt vector means that we corrupt the
CFAR (come-from address register) on POWER7 and later processors as
mentioned in commit 1707dd16. So, EXCEPTION_PROLOG_0 (6 instructions)
that contains the part up to the point where the CFAR is saved in the
PACA should be part of the short interrupt vectors before we branch out
to OOL handlers.
But as mentioned already, there are interrupt vectors on 64-bit POWER
server processors that are only 32 bytes long (like vectors 0x4f00,
0x4f20, etc.), which cannot accomodate the above two cases at the same
time owing to space constraint. Currently, in these interrupt vectors,
we simply branch out to OOL handlers, without using LOAD_HANDLER(),
which leaves us vulnerable when running a relocatable kernel (eg. kdump
case). While this has been the case for sometime now and kdump is used
widely, we were fortunate not to see any problems so far, for three
reasons:
1. In almost all cases, production kernel (relocatable) is used for
kdump as well, which would mean that crashed kernel's OOL handler
would be at the same place where we end up branching to, from short
interrupt vector of kdump kernel.
2. Also, OOL handler was unlikely the reason for crash in almost all
the kdump scenarios, which meant we had a sane OOL handler from
crashed kernel that we branched to.
3. On most 64-bit POWER server processors, page size is large enough
that marking interrupt vector code as executable (see commit 429d2e83) leads to marking OOL handler code from crashed kernel,
that sits right below interrupt vector code from kdump kernel, as
executable as well.
Let us fix this by moving the __end_interrupts marker down past OOL
handlers to make sure that we also copy OOL handlers to real address
0x100 when running a relocatable kernel.
This fix has been tested successfully in kdump scenario, on an LPAR with
4K page size by using different default/production kernel and kdump
kernel.
Also tested by manually corrupting the OOL handlers in the first kernel
and then kdump'ing, and then causing the OOL handlers to fire - mpe.
Fixes: c1fb6816fb1b ("powerpc: Add relocation on exception vector handlers") Cc: stable@vger.kernel.org Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>