8xx uses a two level page table with two different linux page size
support (4k and 16k). 8xx also support two different hugepage sizes
512k and 8M. In order to support them on linux we define two different
page table layout.
The size of pages is in the PGD entry, using PS field (bits 28-29):
00 : Small pages (4k or 16k)
01 : 512k pages
10 : reserved
11 : 8M pages
For 512K hugepage size a pgd entry have the below format
[<hugepte address >0101] . The hugepte table allocated will contain 8
entries pointing to 512K huge pte in 4k pages mode and 64 entries in
16k pages mode.
For 8M in 16k mode, a pgd entry have the below format
[<hugepte address >1101] . The hugepte table allocated will contain 8
entries pointing to 8M huge pte.
For 8M in 4k mode, multiple pgd entries point to the same hugepte
address and pgd entry will have the below format
[<hugepte address>1101]. The hugepte table allocated will only have one
entry.
For the time being, we do not support CPU15 ERRATA when HUGETLB is
selected
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> (v3, for the generic bits) Signed-off-by: Scott Wood <oss@buserror.net>
Today there are two implementations of hugetlbpages which are managed
by exclusive #ifdefs:
* FSL_BOOKE: several directory entries points to the same single hugepage
* BOOK3S: one upper level directory entry points to a table of hugepages
In preparation of implementation of hugepage support on the 8xx, we
need a mix of the two above solutions, because the 8xx needs both cases
depending on the size of pages:
* In 4k page size mode, each PGD entry covers a 4M bytes area. It means
that 2 PGD entries will be necessary to cover an 8M hugepage while a
single PGD entry will cover 8x 512k hugepages.
* In 16 page size mode, each PGD entry covers a 64M bytes area. It means
that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
hugepages will be covers by one PGD entry.
This patch:
* removes #ifdefs in favor of if/else based on the range sizes
* merges the two huge_pte_alloc() functions as they are pretty similar
* merges the two hugetlbpage_init() functions as they are pretty similar
Today powerpc64 uses a set of pgtable_caches while powerpc32 uses
standard pages when using 4k pages and a single pgtable_cache
if using other size pages.
In preparation of implementing huge pages on the 8xx, this patch
replaces the specific powerpc32 handling by the 64 bits approach.
This is done by:
* moving 64 bits pgtable_cache_add() and pgtable_cache_init()
in a new file called init-common.c
* modifying pgtable_cache_init() to also handle the case
without PMD
* removing the 32 bits version of pgtable_cache_add() and
pgtable_cache_init()
* copying related header contents from 64 bits into both the
book3s/32 and nohash/32 header files
On the 8xx, the following cache sizes will be used:
* 4k pages mode:
- PGT_CACHE(10) for PGD
- PGT_CACHE(3) for 512k hugepage tables
* 16k pages mode:
- PGT_CACHE(6) for PGD
- PGT_CACHE(7) for 512k hugepage tables
- PGT_CACHE(3) for 8M hugepage tables
Nicholas Piggin [Mon, 28 Nov 2016 01:42:26 +0000 (12:42 +1100)]
powerpc/boot: Request no dynamic linker for boot wrapper
The boot wrapper performs its own relocations and does not require
PT_INTERP segment. However currently we don't tell the linker that.
Prior to binutils 2.28 that works OK. But since binutils commit 1a9ccd70f9a7 ("Fix the linker so that it will not silently generate ELF
binaries with invalid program headers. Fix readelf to report such
invalid binaries.") binutils tries to create a program header segment
due to PT_INTERP, and the link fails because there is no space for it:
ld: arch/powerpc/boot/zImage.pseries: Not enough room for program headers, try linking with -N
ld: final link failed: Bad value
So tell the linker not to do that, by passing --no-dynamic-linker.
Cc: stable@vger.kernel.org Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop dependency on ld-version.sh and massage change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Libin [Sun, 6 Dec 2015 02:02:56 +0000 (10:02 +0800)]
powerpc/ftrace: Fix the comments for ftrace_modify_code
There is no need to worry about module and __init text disappearing
case, because that ftrace has a module notifier that is called when a
module is being unloaded and before the text goes away and this code
grabs the ftrace_lock mutex and removes the module functions from the
ftrace list, such that it will no longer do any modifications to that
module's text, the update to make functions be traced or not is done
under the ftrace_lock mutex as well. And by now, __init section codes
should not been modified by ftrace, because it is black listed in
recordmcount.c and ignored by ftrace.
Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Patch to add macros and contants to support the power9 raw
event encoding format. Couple of functions added since some of the
bits fields like PMCxCOMB and THRESH_CMP has different width and location
within MMCR* in power9.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/perf: update attribute_group data structure
Rename the power_pmu and attribute_group variables that
support PowerISA v2.07. Add a cpu feature flag check to pick
the PowerISA v2.07 format structures to support.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().
This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.
This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.
This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.
As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).
This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.
This replaces current->mm with container->mm everywhere except debug
prints.
This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.
DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.
This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.
At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.
This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
vfio/spapr: Add a helper to create default DMA window
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.
Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.
This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
vfio/spapr: Postpone allocation of userspace version of TCE table
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.
As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.
This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/iommu: Stop using @current in mm_iommu_xxx
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.
This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.
This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/iommu: Pass mm_struct to init/cleanup helpers
We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.
This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).
This should not cause any behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 15 Nov 2016 10:59:38 +0000 (21:59 +1100)]
powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit
This is used in poison.h to offset poison values so that they don't
point directly into user space.
The value we choose sits roughly between user and kernel space, which
means on their own the poison values don't point anywhere useful. If an
attacker can cause an access at some offset from the poison value then
we may still be in trouble, but by putting the poison values between
user and kernel space we maximise the required size of that offset.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Wed, 30 Nov 2016 06:45:09 +0000 (17:45 +1100)]
powerpc Don't print misleading facility name in facility unavailable exception
The current facility_strings[] are correct when the trap address is
0xf80 (hypervisor facility unavailable). When the trap address is
0xf60 (facility unavailable) IC (Interruption Cause) a.k.a status in the
code is undefined for values 0 and 1.
Add a check to prevent printing the (misleading) facility name for IC 0
and 1 when we came in via 0xf60. In all cases, print the actual IC
value, to avoid any confusion.
This hasn't been seen on real hardware, on only qemu which was
misreporting an exception.
SPU_FS selects MEMORY_HOTPLUG, which is problematic because
MEMORY_HOTPLUG is user selectable, meaning we can end up with a broken
.config where MEMORY_HOTPLUG is enabled but its dependencies are not,
leading to build breakages.
The select of MEMORY_HOTPLUG for SPU_FS was added back in 2006, in
commit 4da30d15b6d5 ("[POWERPC] spufs: fix memory hotplug dependency").
However we reworked the spufs code and removed the dependency on memory
hotplug in 2007 in commit 78bde53e351b ("[POWERPC] spufs: remove need
for struct page for SPEs").
So drop the select as it's no longer needed and causes problems.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 28 Nov 2016 16:50:45 +0000 (11:50 -0500)]
powerpc/pseries: Use lmb_is_removable() to check removability
We should be using lmb_is_removable() to validate that enough LMBs
are available to remove when doing a remove by count. This will check
that the LMB is owned by the system and it is considered removable.
This patch also adds a pr_info() notification to report the LMB count
to remove was not satisfied.
What we do now is just check that there are enough LMBs owned by the
system when validating there are enough LMBs to remove. This can
lead to situations where there are enough LMBs owned by the system
but not enough that are considered removable. This results in having
to bail out of the remove operation instead of just failing the request
that we should have known wouldn't succeed.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Wed, 30 Nov 2016 08:41:02 +0000 (19:41 +1100)]
powerpc/mm: Fix page table dump build on non-Book3S
In the recent commit 1515ab932156 ("powerpc/mm: Dump hash table") we
added code to dump the hage page table. Currently this can be selected
to build on any platform. However it breaks the build if we're building
for a non-Book3S platform, because none of the hash page table related
defines and so on exist. So restrict it to building only on Book3S.
Similarly in commit 8eb07b187000 ("powerpc/mm: Dump linux pagetables")
we added code to dump the Linux page tables, which uses some constants
which are only defined on Book3S - so guard those with an #ifdef.
Fixes: 1515ab932156 ("powerpc/mm: Dump hash table") Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Geoff Levand [Tue, 29 Nov 2016 18:47:32 +0000 (10:47 -0800)]
powerpc/ps3: Fix system hang with GCC 5 builds
GCC 5 generates different code for this bootwrapper null check that
causes the PS3 to hang very early in its bootup. This check is of
limited value, so just get rid of it.
Cc: stable@vger.kernel.org Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 18 Nov 2016 12:15:42 +0000 (23:15 +1100)]
powerpc/prom: Switch to using structs for ibm_architecture_vec
Now that we've defined structures to describe each of the client
architecture vectors, we can use those to construct the value we pass to
firmware.
This avoids the tricks we previously played with the W() macro, allows
us to properly endian annotate fields, and should help to avoid bugs
introduced by failing to have the correct number of zero pad bytes
between fields.
It also means we can avoid hard coding IBM_ARCH_VEC_NRCORES_OFFSET in
order to update the max_cpus value and instead just set it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 18 Nov 2016 12:15:41 +0000 (23:15 +1100)]
powerpc/prom: Define structs for client architecture vectors
The "client architecture vectors" are a series of structures we pass to
firmware to define various things, such as what processors we support
and many other options.
Each structure is entirely different so we have to define a different
struct for each one, but that's OK.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Generate a system reset NMI on the threads indicated by target.
Values for target:
-1 = target all online threads including the caller
-2 = target all online threads except for the caller
All other negative values: reserved
Positive values: The thread to be targeted, obtained from the value
of the "ibm,ppc-interrupt-server#s" property of the CPU in the OF
device tree.
Semantics:
- Invalid target: return H_Parameter.
- Otherwise: Generate a system reset NMI on target thread(s),
return H_Success.
This will be used by crash/debug code to get stuck CPUs into a known
state.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds the support code needed for implementing
kexec_file_load() on powerpc.
This consists of functions to load the ELF kernel, either big or little
endian, and setup the purgatory enviroment which switches from the first
kernel to the second kernel.
None of this code is built yet, as it depends on CONFIG_KEXEC_FILE which
we have not yet defined. Although we could define CONFIG_KEXEC_FILE in
this patch, we'd then have a window in history where the kconfig symbol
is present but the syscall is not, which would be awkward.
powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.
Commit 2965faa5e03d ("kexec: split kexec_load syscall from kexec core
code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether
the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE
means whether the kexec_file_load system call should be compiled-in.
These options can be set independently from each other.
Since until now powerpc only supported kexec_load, CONFIG_KEXEC and
CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we
need to make a distinction. Almost all places where CONFIG_KEXEC was
being used should be using CONFIG_KEXEC_CORE instead, since
kexec_file_load also needs that code compiled in.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.
kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
implementation to find free memory for the purgatory stack.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file: Change kexec_add_buffer to take kexec_buf as argument.
This is done to simplify the kexec_add_buffer argument list.
Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer.
In addition, change the type of kexec_buf.buffer from char * to void *.
There is no particular reason for it to be a char *, and the change
allows us to get rid of 3 existing casts to char * in the code.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file: Allow arch-specific memory walking for kexec_add_buffer
Allow architectures to specify a different memory walking function for
kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
PowerPC uses the memblock subsystem.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Wed, 30 Nov 2016 00:35:36 +0000 (11:35 +1100)]
powerpc/mm: Fix no execute fault handling on pre-POWER5
Aneesh/Ben reported that the change to do_page_fault() we made in commit 1d18ad026844 ("powerpc/mm: Detect instruction fetch denied and report")
needs to handle the case where CPU_FTR_COHERENT_ICACHE is missing but we
have CPU_FTR_NOEXECUTE. In those cases the check added for
SRR1_ISI_N_OR_G might trigger a false positive.
This patch adds a check for CPU_FTR_COHERENT_ICACHE in addition to the
MSR value.
Fixes: 1d18ad026844 ("powerpc/mm: Detect instruction fetch denied and report") Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 21 Nov 2016 10:14:35 +0000 (21:14 +1100)]
powerpc/boot: Fix rebuild when changing kernel endian
Now that we don't set ARCH incorrectly when calling the boot Makefile,
we can use the generic cpp_lds_S rule for converting our zImage.lds.S
into zImage.lds.
The main advantage of using the generic rule is that it correctly uses
if_changed, which means we correctly regenerate the linker script when
switching endian. Fixing that means we are finally able to build one
endian and then rebuild the other endian without requiring to clean
between builds.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 21 Nov 2016 10:14:33 +0000 (21:14 +1100)]
powerpc: Stop passing ARCH=ppc64 to boot Makefile
Back in 2005 when the ppc/ppc64 merge started, we used to build the
kernel code in arch/powerpc but use the boot code from arch/ppc or
arch/ppc64 depending on whether we were building for 32 or 64-bit.
Originally we called the boot Makefile passing ARCH=$(OLDARCH), where
OLDARCH was ppc or ppc64.
In commit 20f629549b30 ("powerpc: Make building the boot image work for
both 32-bit and 64-bit") (2005-10-11) we split the call for 32/64-bit
using an ifeq check, because the two Makefiles took different targets,
and explicitly passed ARCH=ppc64 for the 64-bit case and ARCH=ppc for
the 32-bit case.
Then in commit 94b212c29f68 ("powerpc: Move ppc64 boot wrapper code over
to arch/powerpc") (2005-11-16) we moved the boot code into arch/powerpc
and dropped the ppc case, but kept passing ARCH=ppc64 to
arch/powerpc/boot/Makefile.
Since then there have been several more boot targets added, all of which
have copied the ARCH=ppc64 setting, such that now we have four targets
using it.
Currently it seems that nothing actually uses the ARCH value, but that's
basically just luck, and in particular it prevents us from using the
generic cpp_lds_S rule. It's also clearly wrong, ARCH=ppc64 is dead,
buried and cremated.
Fix it by dropping the setting of ARCH completely, the correct value is
exported by the top level Makefile.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: update radix__pte_update to not do full mm tlb flush
When we are updating a pte, we just need to flush the tlb mapping
that pte. Right now we do a full mm flush because we don't track page
size. Now that we have page size details in pte use that to do the
optimized flush
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: update radix__ptep_set_access_flag to not do full mm tlb flush
When we are updating a pte, we just need to flush the tlb mapping
that pte. Right now we do a full mm flush because we don't track the page
size. Now that we have page size details in pte use that to do the
optimized flush
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds a new software defined pte bit. We use the reserved
fields of ISA 3.0 pte definition since we will only be using this on DD1
code paths. We can possibly look at removing this code later.
The software bit will be used to differentiate between 64K/4K and 2M
ptes. This helps in finding the page size mapping by a pte so that we
can do efficient tlb flush.
We don't support 1G hugetlb pages yet. So we add a DEBUG WARN_ON to
catch wrong usage.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Wed, 23 Nov 2016 13:02:07 +0000 (00:02 +1100)]
powerpc/64e: Convert cmpi to cmpwi in head_64.S
From 80f23935cadb ("powerpc: Convert cmp to cmpd in idle enter sequence"):
PowerPC's "cmp" instruction has four operands. Normally people write
"cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently
people forget, and write "cmp" with just three operands.
With older binutils this is silently accepted as if this was "cmpw",
while often "cmpd" is wanted. With newer binutils GAS will complain
about this for 64-bit code. For 32-bit code it still silently assumes
"cmpw" is what is meant.
In this case, cmpwi is called for, so this is just a build fix for
new toolchains.
Cc: stable@vger.kernel.org # v3.0+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Tue, 15 Nov 2016 06:56:16 +0000 (17:56 +1100)]
powerpc/mm/radix: Prevent kernel execution of user space
ISA 3 defines new encoded access authority that allows instruction
access prevention in privileged mode and allows normal access
to problem state. This patch just enables IAMR (Instruction Authority
Mask Register), enabling AMR would require more work.
I've tested this with a buggy driver and a simple payload. The payload
is specific to the build I've tested.
mpe: Also tested with LKDTM:
# echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
lkdtm: Performing direct entry EXEC_USERSPACE
lkdtm: attempting ok execution at c0000000005bf560
lkdtm: attempting bad execution at 00003fff8d940000
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x3fff8d940000
Oops: Kernel access of bad area, sig: 11 [#1]
NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000
REGS: c0000000f1fcf900 TRAP: 0400 Not tainted (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a)
MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48002222 XER: 00000000
...
Call Trace:
lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable)
lkdtm_do_action+0x3c/0x80
direct_entry+0x100/0x1b0
full_proxy_write+0x94/0x100
__vfs_write+0x3c/0x1b0
vfs_write+0xcc/0x230
SyS_write+0x60/0x110
system_call+0x38/0xfc
Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Tue, 15 Nov 2016 06:56:15 +0000 (17:56 +1100)]
powerpc/mm: Detect instruction fetch denied and report
ISA 3 allows for prevention of instruction fetch and execution
of user mode pages. If such an error occurs, SRR1 bit 35 reports the
error. We catch and report the error in do_page_fault().
Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online
Ensure that PSSCR is set to a safe value corresponding to no
state-loss each time a POWER9 CPU comes online.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Acked-By: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/xmon: Add 'dt' command to dump trace buffers
There is a nice interface for asking ftrace to dump all its tracing
buffers. The only down side for use in xmon is that it uses printk.
Depending on circumstances printk may not work when in xmon, but it also
may, so add a 'dt' command which dumps the ftrace buffers, and add a
note to the help to mentiont that it uses printk.
Calling this routine also disables tracing, which is problematic if you
return from xmon and expect the system to keep operating normally. So
after we do the dump turn tracing back on.
Both functions already have nop versions defined for when ftrace is not
enabled, so we don't need any extra #ifdefs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:30 +0000 (16:40 +0200)]
soc/qman: Handle endianness of h/w descriptors
The hardware descriptors have big endian (BE) format.
Provide proper endianness handling for the remaining
descriptor fields, to ensure they are correctly
accessed by non-BE CPUs too.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Andrew Donnellan [Tue, 22 Nov 2016 10:13:27 +0000 (21:13 +1100)]
cxl: Fix coccinelle warnings
Fix the following coccinelle warnings:
drivers/misc/cxl/debugfs.c:46:0-23: WARNING: fops_io_x64 should be
defined with DEFINE_DEBUGFS_ATTRIBUTE
drivers/misc/cxl/guest.c:890:5-26: WARNING: Comparison to bool
drivers/misc/cxl/irq.c:107:3-23: WARNING: Assignment of bool to 0/1
drivers/misc/cxl/native.c:57:2-3: Unneeded semicolon
drivers/misc/cxl/native.c:170:2-3: Unneeded semicolon
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Tue, 22 Nov 2016 10:49:32 +0000 (11:49 +0100)]
powerpc/32: Change the stack protector canary value per task
Partially copied from commit df0698be14c66 ("ARM: stack protector:
change the canary value per task")
A new random value for the canary is stored in the task struct whenever
a new task is forked. This is meant to allow for different canary values
per task. On powerpc, GCC expects the canary value to be found in a global
variable called __stack_chk_guard. So this variable has to be updated
with the value stored in the task struct whenever a task switch occurs.
Because the variable GCC expects is global, this cannot work on SMP
unfortunately. So, on SMP, the same initial canary value is kept
throughout, making this feature a bit less effective although it is still
useful.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is the very basic stuff without the changing canary upon
task switch yet. Just the Kconfig option and a constant canary
value initialized at boot time.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/pseries/ibmebus: Remove legacy suspend/resume support
There are no ibmebus driver that make use of legacy suspend/resume. This
patch removes the support for it from ibmebus framework, new ibmebus
driver (as unlikely as they are) wanting to use suspend/resume should
use dev_pm_ops.
Since there aren't any special bus specific things to do during
suspend/resume and since the PM core will automatically fallback
directly to using the device's PM ops if no bus PM ops are specified
there is no need to have any special ibmebus PM ops at all.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Naveen N. Rao [Mon, 21 Nov 2016 17:06:41 +0000 (22:36 +0530)]
powerpc/kprobes: Invoke handlers directly
Invoke the kprobe handlers directly rather than through notify_die(), to
reduce path taken for handling kprobes. Similar to commit 6f6343f53d13
("kprobes/x86: Call exception handlers directly from do_int3/do_debug").
While at it, rename post_kprobe_handler() to kprobe_post_handler() for
more uniform naming.
Reported-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Naveen N. Rao [Mon, 21 Nov 2016 17:06:40 +0000 (22:36 +0530)]
powerpc: Remove extraneous header from asm-prototypes.h
Commit 03465f899bda ("powerpc: Use kprobe blacklist for exception
handlers") removed __kprobes annotation from some of the prototypes,
but left the kprobes header include directive unchanged. Remove it as it
is no longer needed.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:28 +0000 (16:40 +0200)]
soc/qman: Change remaining contextB into context_b
There are multiple occurences of both contextB and context_b
in different h/w descriptors, referring to the same descriptor
field known as "Context B". Stick with the "context_b" naming,
for obvious reasons including consistency (see also context_a).
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:25 +0000 (16:40 +0200)]
soc/qman: Fix accesses to fqid, cleanup
Preventively mask every access to the 'fqid' h/w field,
since it is defined as a 24-bit field, for every h/w
descriptor. Add generic accessors for this field to
ensure correct access.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:24 +0000 (16:40 +0200)]
soc/qman: Remove unused struct qm_mcc* layouts
1. qm_mcc_querywq layout not used for now, so drop it;
2. queryfq, queryfq_np and alterfq are used only for accesses to
the 'fqid' field, so replace these with a generic 'fq' layout.
As a consequence, 'querycgr' turns into 'cgr' following the
same reasoning above and for consistent naming.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:22 +0000 (16:40 +0200)]
soc/qman: test: Don't use dummy platform device for dma mapping
Replace dummy platform device hack with a reference to a portal's
platform device, in order to dma map the test frame for this
small unit test. The 2 qman symbols need to be exported because
this self test is a kernel module.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:21 +0000 (16:40 +0200)]
soc/qman: Don't add a new platform device for dma mapping
The qman portals are platform devices themselves, so they should
handle dma mappings. Creating a dummy platform device in order to
support dma mapping operations is not justified (and not portable).
Instead, do the mapping against the first portal that has been
initialised.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Claudiu Manoil [Wed, 16 Nov 2016 14:40:20 +0000 (16:40 +0200)]
soc/qman: test: Fix implementation of fd_cmp()
This function must only return the truth value of whether
two frame descriptors are different or not.
It does NOT have to compute some obscure difference between
fd fields and return it as an int, making sparse complain
about type conversions in the process.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
qman_query_fq*() may return other error codes apart from
-ERANGE, in which cases the error handling done by the
resource cleanup callers would be wrong. The patch
fixes the handling of those cases, and cleans up related
code inside the resource cleanup & release handlers (i.e.
replace hardcoded fqid value with corresponding define).
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Scott Wood <oss@buserror.net>
Heiner Kallweit [Sat, 29 Oct 2016 14:29:04 +0000 (16:29 +0200)]
powerpc/fsl_soc: improve and simplify get_baudrate
Use of_property_read_u32 instead of the generic of_get_property to
simplify the code. In addition move the declaration of fs_baudrate
into get_baudrate because it's private to this function.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Scott Wood <oss@buserror.net>
Heiner Kallweit [Sat, 29 Oct 2016 14:28:25 +0000 (16:28 +0200)]
powerpc/fsl_soc: improve and simplify get_brgfreq
Use of_property_read_u32 instead of the generic of_get_property to
simplify the code. In addition move the declaration of brgfreq
into get_brgfreq because it's private to this function.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
[scottwood: minor whitespace fixes] Signed-off-by: Scott Wood <oss@buserror.net>
Heiner Kallweit [Sat, 29 Oct 2016 14:27:14 +0000 (16:27 +0200)]
powerpc/fsl_soc: improve and simplify fsl_get_sys_freq
Use of_property_read_u32 instead of the generic of_get_property to
simplify the code. In addition move the declaration of sysfreq
into fsl_get_sys_freq because it's private to this function.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Scott Wood <oss@buserror.net>
David Engraf [Tue, 15 Nov 2016 08:24:27 +0000 (09:24 +0100)]
powerpc/85xx/qemu: Enable CONFIG_E500 and CONFIG_PPC_E500MC
The QEMU e500 board needs to enable CONFIG_E500 to correctly boot. QEMU
for ppc64 uses e5500/e6500 emulation, thus CONFIG_PPC_E500MC is required
as well.
Signed-off-by: David Engraf <david.engraf@sysgo.com> Signed-off-by: Scott Wood <oss@buserror.net>
Michael Ellerman [Tue, 22 Nov 2016 03:50:39 +0000 (14:50 +1100)]
powerpc/reg: Add definition for LPCR_PECE_HVEE
ISA 3.0 defines a new PECE (Power-saving mode Exit Cause Enable) field
in the LPCR (Logical Partitioning Control Register), called
LPCR_PECE_HVEE (Hypervisor Virtualization Exit Enable).
KVM code will need to know about this bit, so add a definition for it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>