Gavin says: I missed the fact that it affects the PCI passthrou path as
reported by Alexey: When passing GPU (0003:01:00.0) which seats behind
the root port, the reset request is routed to skiboot in original code.
In skiboot, the link bouncing events are masked during the reset. So we
don't see EEH (freeze all) error even link bouncing happens. With the
changes included, the reset is done by kernel and the link bouncing
events aren't masked by altering content of PHB3 (or P7IOC) specific
hardware registers which are invisible to kernel (skiboot hides the
hardware specific). It means the link bouncing is seen by the root port
and it causes a EEH (freeze all) error. The PCI passthrough on GPU
device cannot work.
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.
In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.
In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.
This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.
This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.
This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.
This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.
This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.
As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au>
[mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:
1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.
2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.
This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.
While we are here, add TCE cache invalidation after enabling bypass.
The upcoming NVLink passthrough support will require NPU code to cope
with two DMA windows.
This adds a pnv_npu_set_window() helper which programs 32bit window to
the hardware. This also adds multilevel TCE support.
This adds a pnv_npu_unset_window() helper which removes the DMA window
from the hardware. This does not make difference now as the caller -
pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next
patch will use it to manage TCE table lists for TCE Kill handling.
NPU devices are emulated in firmware and mainly used for NPU NVLink
training; one NPU device is per a hardware link. Their DMA/TCE setup
must match the GPU which is connected via PCIe and NVLink so any changes
to the DMA/TCE setup on the GPU PCIe device need to be propagated to
the NVLink device as this is what device drivers expect and it doesn't
make much sense to do anything else.
This makes NPU DMA setup explicit.
pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
made static and prints warning as dma_set_mask() should never be called
on this function as in any case it will not configure GPU; so we make
this explicit.
Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
remove), we test every PCI device if there are corresponding NVLink
devices. If there are any, we propagate bypass mode to just found NPU
devices by calling the setup helper directly (which takes @bypass) and
avoid guessing (i.e. calculating from DMA mask) whether we need bypass
or not on NPU devices. Since DMA setup happens in very rare occasion,
this will not slow down booting or VFIO start/stop much.
This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
more clear what the function really does which is programming 32bit
table address to the TVT ("disabling bypass" means writing zeroes to
the TVT).
This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
the DMA configuration on NPU does not matter until dma_set_mask() is
called on GPU and that will do the NPU DMA configuration.
This removes phb->dma_dev_setup initialization for NPU as
pnv_pci_ioda_dma_dev_setup is no-op for it anyway.
This stops using npe->tce_bypass_base as it never changes and values
other than zero are not supported.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
so let's reuse the existing code for NPU. The only bit missing is
a helper to reset the entire TCE cache so this moves such a helper
from NPU code and renames it.
Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
This adds an explicit comment for workaround for invalidating NPU TCE
cache.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
the specific PE rather than the entire cache, rename it to
pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
a proper pnv_pci_ioda2_tce_invalidate_entire().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Gavin Shan [Tue, 3 May 2016 05:41:45 +0000 (15:41 +1000)]
powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()
The function pnv_pci_reset_secondary_bus() is called like below.
It's impossible for call the function on root bus. So it's safe
to remove the root bus case in the function. No functional changes
introduced.
Gavin Shan [Tue, 3 May 2016 05:41:44 +0000 (15:41 +1000)]
powerpc/powernv: Simplify pnv_eeh_reset()
This drops unnecessary nested if statements in pnv_eeh_reset() to
improve the code readability. After the changes, the unused local
variable "ret" is dropped as well. No logical changes introduced.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Gavin Shan [Tue, 3 May 2016 05:41:43 +0000 (15:41 +1000)]
powerpc/pci: Don't scan empty slot
In hotplug case, function pci_add_pci_devices() is called to rescan
the specified PCI bus, which might not have any child devices. Access
to the PCI bus's child device node will cause kernel crash without
exception.
This adds one more check to skip scanning PCI bus that doesn't have
any subordinate devices from device-tree, in order to avoid kernel
crash.
Gavin Shan [Tue, 3 May 2016 05:41:42 +0000 (15:41 +1000)]
powerpc/pci: Export pci_traverse_device_nodes()
This renames traverse_pci_devices() to pci_traverse_device_nodes().
The function traverses all subordinate device nodes of the specified
one. Also, below cleanup applied to the function. No logical changes
introduced.
* Rename "pre" to "fn".
* Avoid assignment in if condition reported from checkpatch.pl.
This implements and exports pci_remove_device_node_info(). It's
used to remove the pdn (struct pci_dn) for the indicated device
node. The function is going to be used by PowerNV PCI hotplug
driver.
Gavin Shan [Tue, 3 May 2016 05:41:40 +0000 (15:41 +1000)]
powerpc/pci: Export pci_add_device_node_info()
This renames update_dn_pci_info() to pci_add_device_node_info()
with corresponding adjustment on the parameter type and exports it.
The function is used to create pdn (struct pci_dn) for the indicated
device node. Another function add_pdn(), almost wrapper of
pci_add_device_node_info(), to be used in traverse_pci_devices(). No
logical changes introduced.
Gavin Shan [Tue, 3 May 2016 05:41:39 +0000 (15:41 +1000)]
powerpc/pci: Move pci_find_bus_by_node() around
This moves pci_find_bus_by_node() from arch/powerpc/platforms/
pseries/pci_dlpar.c to arch/powerpc/kernel/pci-hotplug.c so that
the function can be used by pSeries and PowerNV platform at the
same time. Also, below cleanup applied. No functional changes
introduced.
* Remove variable "busdn" in find_bus_among_children()
* Use PCI_DN() to convert device node to pci_dn
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Gavin Shan [Tue, 3 May 2016 05:41:38 +0000 (15:41 +1000)]
powerpc/pci: Rename pcibios_find_pci_bus()
This renames pcibios_find_pci_bus() to pci_find_bus_by_node() to
avoid conflicts with those PCI subsystem weak function names, which
have prefix "pcibios". No logical changes introduced.
This renames pcibios_{add,remove}_pci_devices() to avoid conflicts
with names of the weak functions in PCI subsystem, which have the
prefix "pcibios". No logical changes introduced.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-By: Alistair Popple <alistair@popple.id.au> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Gavin Shan [Tue, 3 May 2016 05:41:36 +0000 (15:41 +1000)]
powerpc/powernv: Use PE instead of number during setup and release
In current implementation, the PEs that are allocated or picked
from the reserved list are identified by PE number. The PE instance
has to be picked according to the PE number eventually. We have
same issue when PE is released.
For pnv_ioda_pick_m64_pe() and pnv_ioda_alloc_pe(), this returns
PE instance so that pnv_ioda_setup_bus_PE() can use the allocated
or reserved PE instance directly. Also, pnv_ioda_setup_bus_PE()
returns the reserved/allocated PE instance to be used in subsequent
patches. On the other hand, pnv_ioda_free_pe() uses PE instance
(not number) as its argument. No logical changes introduced.
In current implementation, the DMA32 segments required by one specific
PE isn't calculated with the information hold in the PE independently.
It conflicts with the PCI hotplug design: PE centralized, meaning the
PE's DMA32 segments should be calculated from the information hold in
the PE independently.
This introduces an array (@dma32_segmap) for every PHB to track the
DMA32 segmeng usage. Besides, this moves the logic calculating PE's
consumed DMA32 segments to pnv_pci_ioda1_setup_dma_pe() so that PE's
DMA32 segments are calculated/allocated from the information hold in
the PE (DMA32 weight). Also the logic is improved: we try to allocate
as much DMA32 segments as we can. It's acceptable that number of DMA32
segments less than the expected number are allocated.
Gavin Shan [Tue, 3 May 2016 05:41:34 +0000 (15:41 +1000)]
powerpc/powernv: Remove DMA32 PE list
PEs are put into PHB DMA32 list (phb->ioda.pe_dma_list) according
to their DMA32 weight. The PEs on the list are iterated to setup
their TCE32 tables at system booting time. The list is used for
once at boot time and no need to keep it.
This moves the logic calculating DMA32 weight of PHB and PE to
pnv_ioda_setup_dma() to drop PHB's DMA32 list. Also, every PE
traces the consumed DMA32 segment by @tce32_seg and @tce32_segcount
are useless and they're removed.
Currently, there is one macro (TCE32_TABLE_SIZE) representing the
TCE table size for one DMA32 segment. The constant representing
the DMA32 segment size (1 << 28) is still used in the code.
This defines PNV_IODA1_DMA32_SEGSIZE representing one DMA32
segment size. the TCE table size can be calcualted when the page
has fixed 4KB size. So all the related calculation depends on one
macro (PNV_IODA1_DMA32_SEGSIZE). No logical changes introduced.
This renames pnv_pci_ioda_setup_dma_pe() to pnv_pci_ioda1_setup_dma_pe()
as it's the counter-part of IODA2's pnv_pci_ioda2_setup_dma_pe().
No logical changes introduced.
Gavin Shan [Thu, 5 May 2016 02:02:13 +0000 (12:02 +1000)]
powerpc/powernv/ioda1: M64 support on P7IOC
This enables M64 window on P7IOC, which has been enabled on PHB3.
Different from PHB3 where 16 M64 BARs are supported and each of
them can be owned by one particular PE# exclusively or divided
evenly to 256 segments, every P7IOC PHB has 16 M64 BARs and each
of them are divided to 8 segments. So every P7IOC PHB supports
128 M64 segments in total. P7IOC has M64DT, which helps mapping
one particular M64 segment# to arbitrary PE#. PHB3 doesn't have
M64DT, indicating that one M64 segment can only be pinned to the
fixed PE#.
In order to unified M64 support M64 on P7IOC and PHB3, we just
provide 128 M64 segments on every P7IOC PHB and each of them is
pinned to the fixed PE# by bypassing the function of M64DT. In
turn, we just need different phb->init_m64() for P7IOC and PHB3
and maps M64 segment in pnv_ioda_reserve_m64_pe() for P7IOC, most
of the code are shared by them.
Gavin Shan [Tue, 3 May 2016 05:41:30 +0000 (15:41 +1000)]
powerpc/powernv: Rename M64 related functions
This renames those functions picking PE number based on consumed
M64 segments, mapping M64 segments to PEs as those functions are
going to be shared by IODA1/IODA2 in next patch. No logical changes
introduced.
Gavin Shan [Tue, 3 May 2016 05:41:29 +0000 (15:41 +1000)]
powerpc/powernv: Track M64 segment consumption
When unplugging PCI devices, their parent PEs might be offline.
The consumed M64 resource by the PEs should be released at that
time. As we track M32 segment consumption, this introduces an
array to the PHB to track the mapping between M64 segment and
PE number.
Note: M64 mapping isn't covered by pnv_ioda_setup_pe_seg() as
IODA2 doesn't support the mapping explicitly while it's supported
on IODA1. Until now, no M64 is supported on IODA1 in software.
Gavin Shan [Tue, 3 May 2016 05:41:28 +0000 (15:41 +1000)]
powerpc/powernv: IO and M32 mapping based on PCI device resources
Currently, the IO and M32 segments are mapped to the corresponding
PE based on the windows of the parent bridge of PE's primary bus.
It's not going to work when the windows of root port or upstream
port of the PCIe switch behind root port are extended to PHB's
apertures in order to support hotplug in subsequent patch.
This fixes the issue by mapping IO and M32 segments based on the
resources of the PCI devices included in the PE, instead of the
windows of the parent bridge of the PE's primary bus.
Gavin Shan [Tue, 3 May 2016 05:41:27 +0000 (15:41 +1000)]
powerpc/powernv: Simplify pnv_ioda_setup_pe_seg()
pnv_ioda_setup_pe_seg() associates the IO and M32 segments with the
owner PE. The code mapping segments should be fixed and immune from
logic changes introduced to pnv_ioda_setup_pe_seg().
This moves the code mapping segments to helper pnv_ioda_setup_pe_res().
The data type for @rc is changed to "int64_t". Also, argument @hose is
removed from pnv_ioda_setup_pe() as it can be got from @pe. No functional
changes introduced.
Gavin Shan [Tue, 3 May 2016 05:41:26 +0000 (15:41 +1000)]
powerpc/powernv: Fix initial IO and M32 segmap
There are two arrays for IO and M32 segment maps on every PHB.
The index of the arrays are segment number and the value stored
in the corresponding element is PE number, indicating the segment
is assigned to the PE. Initially, all elements in those two arrays
are zeroes, meaning all segments are assigned to PE#0. It's wrong.
This fixes the initial values in the elements of those two arrays
to IODA_INVALID_PE, meaning all segments aren't assigned to any
PE.
Gavin Shan [Tue, 3 May 2016 05:41:25 +0000 (15:41 +1000)]
powerpc/powernv: Data type unsigned int for PE number
This changes the data type of PE number from "int" to "unsigned int"
in order to match the fact PE number is never negative:
* The number of PE to which the specified PCI device is attached.
* The PE number map for SRIOV VFs.
* The returned PE number from pnv_ioda_alloc_pe().
* The returned PE number from pnv_ioda2_pick_m64_pe().
Gavin Shan [Tue, 3 May 2016 05:41:24 +0000 (15:41 +1000)]
powerpc/powernv: Rename PE# fields in struct pnv_phb
This renames the fields related to PE number in "struct pnv_phb"
for better reflecting of their usages as Alexey suggested. No
logical changes introduced.
Gavin Shan [Tue, 3 May 2016 05:41:21 +0000 (15:41 +1000)]
powerpc/powernv: Cleanup on pci_controller_ops instances
This cleans up on below data struct instances to use tab instead of
space indent of statement to avoid complains from scripts/checkpatch.pl.
No logical changes introduced.
Gavin Shan [Tue, 3 May 2016 05:41:20 +0000 (15:41 +1000)]
powerpc/pci: Cleanup on struct pci_controller_ops
Each PHB has one instance of "struct pci_controller_ops" that includes
various callbacks called by PCI subsystem. In the definition of this
struct, some callbacks have explicit names for its arguments, but the
left don't have.
This adds all explicit names of the arguments to the callbacks in
"struct pci_controller_ops" so that the code looks consistent. Also,
argument name @dev is replaced by @pdev as the later one is the
preferred name for PCI device.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Wed, 23 Dec 2015 05:49:54 +0000 (16:49 +1100)]
selftests/powerpc: Add test to check if TM SPRs are corrupted
Testing that the TM SPRs are behaving the way they should. Uses more
threads than cpus to see if the following register values persist with
context switching:
- the FS (failure summary) flag in TEXASR
- TFIAR and TFHAR
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Wed, 23 Dec 2015 05:49:53 +0000 (16:49 +1100)]
selftests/powerpc: Add TM test to check if TAR is corrupted
If the transaction is aborted, the TAR should be rolled back to the
checkpointed value before the transaction began. The value written to the
TAR when the transaction is suspended should only remain there if the
transaction completes successfully.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Wed, 23 Dec 2015 05:49:50 +0000 (16:49 +1100)]
selftests/powerpc: Make reg.h common to all powerpc selftests
Currently there is a reg.h in pmu/ebb that has defines that are useful
in other powerpc selftests so move this up into selftests/powerpc
folder. Also include in utils.h - as this is often used in self tests.
Add in some other useful register defines.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: Rename machine_check_pSeries_early() to powernv
The routine machine_check_pSeries_early() is only used on powernv, not
pseries. Hence rename machine_check_pSeries_early() to
machine_check_powernv_early().
Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cxl: Check periodically the coherent platform function's state
In the PowerVM environment, the PHYP CoherentAccel component manages
the state of the Coherent Accelerator Processor Interface adapter and
virtualizes CAPI resources, handles CAPP, PSL, PSL Slice errors - and
interrupts - and provides a new set of hcalls for the OS APIs to utilize
Accelerator Function Unit (AFU).
During the course of operation, a coherent platform function can
encounter errors. Some possible reason for errors are:
• Hardware recoverable and unrecoverable errors
• Transient and over-threshold correctable errors
PHYP implements its own state model for the coherent platform function.
The state of the AFU is available through a hcall.
The current implementation of the cxl driver, for the PowerVM
environment, checks this state of the AFU only when an action is
requested - open a device, ioctl command, memory map, attach/detach a
process - from an external driver - cxlflash, libcxl. If an error is
detected the cxl driver handles the error according the content of the
Power Architecture Platform Requirements document.
But in case of low-level troubles (or error injection), the PHYP
component may reset the card and change the AFU state. The PHYP
interface doesn't provide any way to be notified when that happens thus
implies that the cxl driver:
• cannot handle immediatly the state change of the AFU.
• cannot notify other drivers (cxlflash, ...)
The purpose of this patch is to wake up the cpu periodically to check
the current state of each AFU and to see if we need to enter an error
recovery path.
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Fri, 6 May 2016 07:46:36 +0000 (17:46 +1000)]
cxl: Add kernel API to allow a context to operate with relocate disabled
cxl devices typically access memory using an MMU in much the same way as
the CPU, and each context includes a state register much like the MSR in
the CPU. Like the CPU, the state register includes a bit to enable
relocation, which we currently always enable.
In some cases, it may be desirable to allow a device to access memory
using real addresses instead of effective addresses, so this adds a new
API, cxl_set_translation_mode, that can be used to disable relocation
on a given kernel context. This can allow for the creation of a special
privileged context that the device can use if it needs relocation
disabled, and can use regular contexts at times when it needs relocation
enabled.
This interface is only available to users of the kernel API for obvious
reasons, and will never be supported in a virtualised environment.
This will be used by the upcoming cxl support in the mlx5 driver.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 4 May 2016 04:52:58 +0000 (14:52 +1000)]
cxl: Ensure PSL interrupt is configured for contexts with no AFU IRQs
In the cxl kernel API, it is possible to create a context and start it
without allocating any interrupts. Since we assign or allocate the PSL
interrupt when allocating AFU interrupts this will lead to a situation
where we start the context with no means to take any faults.
The user API is not affected as it always goes through the cxl interrupt
allocation code paths and will have the PSL interrupt allocated or
assigned, even if no AFU interrupts were requested.
This checks that at least one interrupt is configured at the time of
attach, and if not it will assign the multiplexed PSL interrupt for
powernv, or allocate a single interrupt for PowerVM.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 4 May 2016 04:48:32 +0000 (14:48 +1000)]
cxl: Remove duplicate #defines
These defines are not used, but other equivalent definitions
(CXL_SPA_SW_CMD_*) are used. Remove the unused defines.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 4 May 2016 04:46:30 +0000 (14:46 +1000)]
cxl: Handle num_of_processes larger than can fit in the SPA
num_of_process is a 16 bit field, theoretically allowing an AFU to
support 16K processes, however the scheduled process area currently has
a maximum size of 1MB, which limits the maximum number of processes to
7704.
Some AFUs may not necessarily care what the limit is and just want to be
able to use the maximum by setting the field to 16K. To allow these to
work, detect this situation and use the maximum size for the SPA.
Downgrade the WARN_ON to a dev_warn.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Gavin Shan [Fri, 26 Feb 2016 00:26:26 +0000 (11:26 +1100)]
powerpc/mm: Improve readability of update_mmu_cache()
The function is used to update the MMU with software PTE. It can
be called by data access exception handler (0x300) or instruction
access exception handler (0x400). If the function is called by
0x400 handler, the local variable @access is set to _PAGE_EXEC
to indicate the software PTE should have that flag set. When the
function is called by 0x300 handler, @access is set to zero.
This improves the readability of the function by replacing if
statements with switch. No logical changes introduced.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The zone that contains the top of memory will be either ZONE_NORMAL
or ZONE_HIGHMEM depending on the kernel config. There are two functions
that require this information and both of them use an #ifdef to set
a local variable (top_zone). This is a little silly so lets just make it
a constant.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Cc: linux-mm@kvack.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a switch fallthough in instr_analyze() which can cause an
invalid instruction to be emulated as a different, valid, instruction.
The rld* (opcode 30) case extracts a sub-opcode from bits 3:1 of the
instruction word. However, the only valid values of this field are 001
and 000. These cases are correctly handled, but the others are not which
causes execution to fall through into case 31.
Breaking out of the switch causes the instruction to be marked as
unknown and allows the caller to deal with the invalid instruction in a
manner consistent with other invalid instructions.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit be96f63375a1 ("powerpc: Split out instruction analysis part of
emulate_step()") introduced ldarx and stdcx into the instructions in
sstep.c, which are not accepted by the assembler on powerpcspe, but does
seem to be accepted by the normal powerpc assembler even in 32 bit mode.
Wrap these two instructions in a __powerpc64__ check like it is
everywhere else in the file.
Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of emulate_step()") Signed-off-by: Len Sorensen <lsorense@csclub.uwaterloo.ca> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Wed, 13 Apr 2016 11:31:24 +0000 (21:31 +1000)]
powerpc/xmon: Fix SPR read/write commands and add command to dump SPRs
xmon has commands for reading and writing SPRs, but they don't work
currently for several reasons. They attempt to synthesize a small
function containing an mfspr or mtspr instruction and call it. However,
the instructions are on the stack, which is usually not executable.
Also, for 64-bit we set up a procedure descriptor, which is fine for the
big-endian ABIv1, but not correct for ABIv2. Finally, the code uses the
infrastructure for catching memory errors, but that only catches data
storage interrupts and machine check interrupts, but a failed
mfspr/mtspr can generate a program interrupt or a hypervisor emulation
assist interrupt, or be a no-op.
Instead of trying to synthesize a function on the fly, this adds two new
functions, xmon_mfspr() and xmon_mtspr(), which take an SPR number as an
argument and read or write the SPR. Because there is no Power ISA
instruction which takes an SPR number in a register, we have to generate
one of each possible mfspr and mtspr instruction, for all 1024 possible
SPRs. Thus we get just over 8k bytes of code for each of xmon_mfspr()
and xmon_mtspr(). However, this 16kB of code pales in comparison to the
> 130kB of PPC opcode tables used by the xmon disassembler.
To catch interrupts caused by the mfspr/mtspr instructions, we add a new
'catch_spr_faults' flag. If an interrupt occurs while it is set, we come
back into xmon() via program_check_interrupt(), _exception() and die(),
see that catch_spr_faults is set and do a longjmp to bus_error_jmp, back
into read_spr() or write_spr().
This adds a couple of other nice features: first, a "Sa" command that
attempts to read and print out the value of all 1024 SPRs. If any mfspr
instruction acts as a no-op, then the SPR is not implemented and not
printed.
Secondly, the Sr and Sw commands detect when an SPR is not
implemented (i.e. mfspr is a no-op) and print a message to that effect
rather than printing a bogus value.
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On some architectures (powerpc in particular), the number of registers
exceeds what can be represented in an integer bitmask. Ensure we
generate the proper bitmask on such platforms.
Fixes: 71ad0f5e4 ("perf tools: Support for DWARF CFI unwinding on post processing") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With perf regs support enabled for powerpc, in commit ed4a4ef85cf5
("powerpc/perf: Add support for sampling interrupt register state"),
the support for obtaining perf user stack dump is already enabled. This
patch declares the support for same and also updates documentation to
mark the support for perf-regs and perf-stackdump.
Signed-off-by: Chandan Kumar <chandan.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/hash64: Fix subpage protection with 4K HPTE config
With Linux page size of 64K and hardware only supporting 4K HPTE, if we
use subpage protection, we always fail for the subpage 0 as shown
below (using the selftest subpage_prot test):
520175565: (4520111850): Failed at 0x3fffad4b0000 (p=13,sp=0,w=0), want=fault, got=pass ! 4520890210: (4520826495): Failed at 0x3fffad5b0000 (p=29,sp=0,w=0), want=fault, got=pass ! 4521574251: (4521510536): Failed at 0x3fffad6b0000 (p=45,sp=0,w=0), want=fault, got=pass ! 4522258324: (4522194609): Failed at 0x3fffad7b0000 (p=61,sp=0,w=0), want=fault, got=pass !
This is because hash preload wrongly inserts the HPTE entry for subpage
0 without looking at the subpage protection information.
Fix it by teaching should_hash_preload() not to preload if we have
subpage protection configured for that range.
It appears this has been broken since it was introduced in 2008.
Fixes: fa28237cfcc5 ("[POWERPC] Provide a way to protect 4k subpages when using 64k pages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rework into should_hash_preload() to avoid build fails w/SLICES=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/hash64: Factor out hash preload psize check
Currently we have a check in hash_preload() against the psize, which is
only included when CONFIG_PPC_MM_SLICES is enabled. We want to expand
this check in a subsequent patch, so factor it out to allow that. As a
bonus it removes the #ifdef in the C code.
Unfortunately we can't put this in the existing CONFIG_PPC_MM_SLICES
block because it would require a forward declaration.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc: Update of_remove_property() call sites to remove null checking
After obtaining a property from of_find_property() and before calling
of_remove_property() most code checks to ensure that the property
returned from of_find_property() is not null. The previous patch moved
this check to the start of the function of_remove_property() in order to
avoid the case where this check isn't done and a null value is passed.
This ensures the check is always conducted before taking locks and
attempting to remove the property. Thus it is no longer necessary to
perform a check for null values before invoking of_remove_property().
Update of_remove_property() call sites in order to remove redundant
checking for null property value as check is now performed within the
of_remove_property function().
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
[mpe: Unbreak some lines which are just >80 chars for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
drivers/of: Add check for null property in of_remove_property()
The validity of the property input argument to of_remove_property() is
never checked within the function and thus it is possible to pass a null
value. It happens that this will be picked up in __of_remove_property()
as no matching property of the device node will be found and thus an
error will be returned, however once again there is no explicit check
for a null value. By the time this is detected 2 locks have already been
acquired which is completely unnecessary if the property to remove is
null.
Add an explicit check in the function of_remove_property() for a null
property value and return -ENODEV in this case, this is consistent with
what the previous return value would have been when the null value was
not detected and passed to __of_remove_property().
By moving an explicit check for the property paramenter into the
of_remove_property() function, this will remove the need to perform this
check in calling code before invocation of the of_remove_property()
function.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/pseries: Add null property check to pseries_discover_pic()
The return value of of_get_property() isn't checked before it is passed
to the strstr() function, if it happens that the return value is null
then this will result in a null pointer being dereferenced.
Add a check to see if the return value of of_get_property() is null and
if it is continue straight on to the next node.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Chris Smart <chris@distroguy.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv/pci: Fix cfg_dbg() & replace with pr_devel()
When cfg_dbg() is enabled (i.e. mapped to printk()), gcc produces
errors as the __func__ parameter is missing (pnv_pci_cfg_read() has one);
this adds the missing parameter.
cfg_dbg() is just an inferior version of pr_devel() so use the latter
instead.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Chris Smart [Mon, 2 May 2016 03:51:38 +0000 (13:51 +1000)]
selftests/powerpc: Test cp_abort during context switch
Test that performing a copy paste sequence in userspace on P9 does not
result in a leak of the copy into the paste of another process.
This is based on Anton Blanchard's context_switch benchmarking code. It
sets up two processes tied to the same CPU, one which copies and one
which pastes.
The paste should never succeed and the test fails if it does.
This is a test for commit, "8a64904 powerpc: Add support for userspace
P9 copy paste."
Patch created with much assistance from Michael Neuling
<mikey@neuling.org>
Signed-off-by: Chris Smart <chris@distroguy.com> Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Chris Smart [Mon, 2 May 2016 06:00:58 +0000 (16:00 +1000)]
powerpc: Remove unnecessary CONFIG_SMP #ifdefs
The code in machine_restart/power_off/halt() includes #ifdefs around
calls to smp_send_stop(), however these are not required as
include/linux/smp.h includes an empty version of this function for
CONFIG_SMP=n builds.
Signed-off-by: Chris Smart <chris@distroguy.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Support for the A2 cpu was removed in commit fb5a515704d7 ("powerpc:
Remove platforms/wsp and associated pieces"), and the externs:
__setup_cpu_a2 and __restore_cpu_a2 are still around and unused, so
remove them.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The usage in mm mmu_context_nohash.c is bogus, because we set the
context.id value to MMU_NO_CONTEXT 4 lines previously in the same
function, meaning slice_mm_new_context() will always be true.
The book3s 64 usage was removed in the previous commit. So remove it as
unused.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/subpage: Initialise user psize correctly
As part of the radix support we switched Book3s64 to use a value of ~0
for MMU_NO_CONTEXT. That is because id 0 is special on radix.
However that broke the logic in init_new_context(). The code there needs
to differentiate between a newly allocated context and one inherited via
fork. Previously it worked because a newly allocated context has an id
of zero (because it was just memset() to zero), which used to match
MMU_NO_CONTEXT, and therefore slice_mm_new_context() did the right
thing.
Instead check against a context.id value of zero instead of using
slice_mm_new_context().
Without this patch we never call slice_set_user_psize(), and end up with
a slice psize value of zero and we always end up using 4K HPTE.
powerpc/mm/radix: Use firmware feature to enable Radix MMU
We use the existing "ibm,pa-features" device-tree property to enable
Radix MMU mode. This means we default to hash mode unless firmware tells
us it's OK to start using Radix mode.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Add THP support for 4K linux page size
This adds THP support for 4K Linux page size config with radix. We still
don't do THP with 4K Linux page size and hash page table. Hash page
table needs a 16MB hugepage and we can't do THP with 16MM hugepage and
4K Linux page size.
We add missing functions to 4K hash config to get it to build and
hash__has_transparent_hugepage() makes sure we don't enable THP for 4K
hash config. To catch wrong usage of THP related with 4K config, we add
BUG() in those dummy functions we added to get it compile.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The deposited pgtable_t is a pte fragment hence we cannot use page->lru
for linking then together. We use the first two 64 bits for pte fragment
as list_head type to link all deposited fragments together. On withdraw
we properly zero then out.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In this patch we make the number of pte fragments per level 4 page table
page a variable. Radix level 4 table size is 256 bytes and hence we can
have 256 fragments per level 4 page. We don't update the fragment count
in this patch. We need to do performance measurements to find the right
value for fragment count.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: vmalloc abstraction in preparation for radix
The vmalloc range differs between hash and radix config. Hence make
VMALLOC_START and related constants a variable which will be runtime
initialized depending on whether hash or radix mode is active.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix missing init of ioremap_bot in pgtable_64.c for ppc64e] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Make 4K and 64K use pte_t for pgtable_t
This patch switches 4K Linux page size config to use pte_t * type
instead of struct page * for pgtable_t. This simplifies the code a lot
and helps in consolidating both 64K and 4K page allocator routines. The
changes should not have any impact, because we already store physical
address in the upper level page table tree and that implies we already
do struct page * to physical address conversion.
One change to note here is we move the pgtable_page_dtor() call for
nohash to pte_fragment_free_mm(). The nohash related change is due to
the related changes in pgtable_64.c.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Revert changes made to nohash pgalloc-64.h
This reverts pgalloc related changes WRT implementing 4-level page
table for 64K Linux page size and storing of physical address in higher
level page tables since they are only applicable to book3s64 variant
and we now have a separate copy for book3s64. This helps to keep these
headers simpler.
Cc: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Make a copy of pgalloc.h for 32 and 64 book3s
This patch start to make a book3s variant for pgalloc headers. We have
multiple book3s specific changes such as:
* 4 level page table
* store physical address in higher level table
* use pte_t * for pgtable_t
Having a book3s64 specific variant helps to keep code simpler and remove
lots of #ifdef around code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Pick the address layout for radix config
Hash needs special get_unmapped_area() handling because of limitations
around base page size, so we have to set HAVE_ARCH_UNMAPPED_AREA.
With radix we don't have such restrictions, so we could use the generic
code. But because we've set HAVE_ARCH_UNMAPPED_AREA (for hash), we have
to re-implement the same logic as the generic code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to add asm changes in the follow up patches. Add the
feature bit now so that we can get it all build.
mpe: When CONFIG_PPC_RADIX_MMU=n we omit MMU_FTR_RADIX from the
MMU_FTRS_POSSIBLE mask. This allows the compiler to work out that those
checks will always be false and so the code can be elided completely.
Note we do *not* define MMU_FTR_RADIX to 0 in the RADIX_MMU=n case,
because that doesn't work with the ASM_FTR patching. In particular an
IF_SET section will result in a mask and value of zero, which is always
true, meaning the section *won't* be patched, which is the opposite of
what we want.
Michael Ellerman [Wed, 11 May 2016 05:30:47 +0000 (15:30 +1000)]
powerpc/mm: Add mask of possible MMU features
Follow the example of the cpu feature code, and add a mask of possible
MMU features, MMU_FTRS_POSSIBLE.
This is used in mmu_has_feature(), which allows the possible mask to act
as a shortcut for any features that are not possible, but still allows
the feature bit itself to be defined.
We will use this in the next commit to allow MMU_FTR_RADIX checks to be
elided when MMU_FTR_RADIX is not possible.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Core kernel doesn't track the page size of the VA range that we are
invalidating. Hence we end up flushing TLB for the entire mm here. Later
patches will improve this.
We also don't flush page walk cache separetly instead use RIC=2 when
flushing TLB, because we do a MMU gather flush after freeing page table.
MMU_NO_CONTEXT is updated for hash.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>