powerpc/mm: Add memory barrier in __hugepte_alloc()
__hugepte_alloc() uses kmem_cache_zalloc() to allocate a zeroed PTE
and proceeds to use the newly allocated PTE. Add a memory barrier to
make sure that the other CPUs see a properly initialized PTE.
Based on a fix suggested by James Dykman.
Reported-by: James Dykman <jdykman@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Tested-by: James Dykman <jdykman@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 19 Jul 2016 04:48:31 +0000 (14:48 +1000)]
powerpc/modules: Never restore r2 for a mprofile-kernel style mcount() call
In the module loader we process relocations, and for long jumps we
generate trampolines (aka stubs). At the call site for one of these
trampolines we usually need to generate a load instruction to restore
the TOC pointer into r2.
There is one exception however, which is calls to mcount() using the
mprofile-kernel ABI, they handle the TOC inside the stub, and so for
them we do not generate a TOC load.
The bug is in how the code in restore_r2() decides if it needs to
generate the TOC load. It does so by looking for a nop following the
branch, and if it sees a nop, it replaces it with the load. In general
the compiler has no reason to generate a nop following the mcount()
call and so that check works OK.
However if we combine a jump label at the start of a function, with an
early return, such that GCC applies the shrink-wrapping optimisation, we
can then end up with an mcount call followed immediately by a nop.
However the nop is not there for a TOC load, it is for the jump label.
That confuses restore_r2() into replacing the jump label nop with a TOC
load, which in turn confuses ftrace into replacing the mcount call with
a b +8 (fixed in the previous commit). The end result is we jump over
the jump label, which if it was supposed to return means we incorrectly
run the body of the function.
We have seen this in practice with some yet-to-be-merged patches that
use jump labels more extensively.
The fix is relatively simple, in restore_r2() we check for an
mprofile-kernel style mcount() call first, before looking for the
presence of a nop.
Fixes: 153086644fd1 ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 19 Jul 2016 04:48:30 +0000 (14:48 +1000)]
powerpc/ftrace: Separate the heuristics for checking call sites
In __ftrace_make_nop() (the 64-bit version), we have code to deal with
two ftrace ABIs. There is the original ABI, which looks mostly like a
function call, and then the mprofile-kernel ABI which is just a branch.
The code tries to handle both cases, by looking for the presence of a
load to restore the TOC pointer (PPC_INST_LD_TOC). If we detect the TOC
load, we assume the call site is for an mcount() call using the old ABI.
That means we patch the mcount() call with a b +8, to branch over the
TOC load.
However if the kernel was built with mprofile-kernel, then there will
never be a call site using the original ftrace ABI. If for some reason
we do see a TOC load, then it's there for a good reason, and we should
not jump over it.
So split the code, using the existing CC_USING_MPROFILE_KERNEL. Kernels
built with mprofile-kernel will only look for, and expect, the new ABI,
and similarly for the original ABI.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
mpe: Add a/p/k/setup.h to contain the prototypes and empty versions of
functions we need, rather than using weak functions. Add a few other
empty versions to avoid as many #ifdefs as possible in the code.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc: Re-order the call to smp_setup_cpu_maps()
It makes more sense to do it before intializing xmon() as xmon might
use the info in there. We do want to register the console early
though in case we want some functioning printk's in the cpu map setup.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc: Move 32-bit probe() machine to later in the boot process
This converts all the 32-bit platforms to use the expanded device-tree
which is a pretty mechanical change. Unlike 64-bit, the 32-bit kernel
didn't rely on platform initializations to setup the MMU since it
sets it up entirely before probe_machine() so the move has comparatively
less consequences though it's a bigger patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/64: Move 64-bit probe_machine() to later in the boot process
We no long need the machine type that early, so we can move probe_machine()
to after the device-tree has been expanded. This will allow further
consolidation.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Move hash table ops to a separate structure
Moving probe_machine() to after mmu init will cause the ppc_md
fields relative to the hash table management to be overwritten.
Since we have essentially disconnected the machine type from
the hash backend ops, finish the job by moving them to a different
structure.
The only callback that didn't quite fix is update_partition_table
since this is not specific to hash, so I moved it to a standalone
variable for now. We can revisit later if needed.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fix ppc64e build failure in kexec] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pmac_declare_of_platform_devices() is already a machine initcall, thus
it won't be called on a non-powermac machine. Testing for chrp there
is pointless.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/hash64: Don't test for machine type to detect HEA special case
Instead, check for FW_FEATURE_SPLPAR. This should be roughtly equivalent
as all pseries machiens that can have an HEA also support SPLPAR and
no other machine type does.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/pmac: Remove early allocation of the SMU command buffer
The SMU command buffer needs to be allocated below 2G using memblock.
In the past, this had to be done very early from the arch code as
memblock wasn't available past that point. That is no longer the
case though, smu_init() is called from setup_arch() when memblock
is still functional these days. So move the allocation to the
SMU driver itself.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc: Put exception configuration in a common place
The various calls to establish exception endianness and AIL are
now done from a single point using already established CPU and FW
feature bits to decide what to do.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc: Move FW feature probing out of pseries probe()
We move the function itself to pseries/firmware.c and call it along
with almost all other flat device-tree parsers from early_init_devtree()
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Move #ifdefs into the header by providing pseries_probe_fw_features()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of punching a hole in the linear mapping, just use normal
cachable memory, and apply the flush sequence documented in the
CPC625 (aka U3) user manual.
This allows us to remove quite a bit of code related to the early
allocation of the DART and the hole in the linear mapping. We can
also get rid of the copy of the DART for suspend/resume as the
original memory can just be saved/restored now, as long as we
properly sync the caches.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Integrate dart_init() fix to return ENODEV when DART disabled] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Make it part of early_setup() as we really want the feature fixups
to be applied before we turn on the MMU since they can have an impact
on the various assembly path related to MMU management and interrupts.
This makes 64-bit match what 32-bit does.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This provides an equivalent of of_fdt_match() for non-flat trees.
This is more practical than matching an array of of_device_id structs
when converting a bunch of existing users of of_fdt_match().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PCI: rpaphp: Fix slot registration for multiple slots under a PHB
The underlying slot hotplug registration code assumed multiple slots, but
the actual implementation is broken for multiple slots.
This went unnoticed for years do to the fact that PowerVM seems to only
ever provide a single hotplug slot per PHB.
Under qemu/kvm the hotplug slot model aligns more with x86 where
multiple slots are presented under a single PHB. As seen in the
following each additional slot after the first fails to register due to
each slot always being compared against the first child node of the PHB
in the device tree.
rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
rpaphp: Slot [Slot 0] registered
rpaphp: pci_hp_register failed with error -16
rpaphp: pci_hp_register failed with error -16
rpaphp: pci_hp_register failed with error -16
rpaphp: pci_hp_register failed with error -16
The registration logic is fixed so that each slot is compared
against the existing child devices of the PHB in the device tree to
determine present slots vs empty slots.
Kevin Hao [Wed, 13 Jul 2016 01:14:40 +0000 (09:14 +0800)]
powerpc/32: Remove RELOCATABLE_PPC32
It is seldom used in the kernel code and can be easily replaced by
either RELOCATABLE or PPC32. So there is no reason to keep a separate
kernel option for this.
Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Kevin Hao [Wed, 13 Jul 2016 01:14:38 +0000 (09:14 +0800)]
powerpc/32/booke: Fix the build error when CRASH_DUMP is enabled
In the current code, the RELOCATABLE will be forcedly enabled when
enabling CRASH_DUMP. But for ppc32, the RELOCABLE also depend on
ADVANCED_OPTIONS and select NONSTATIC_KERNEL. This will cause the
following build error when CRASH_DUMP=y && ADVANCED_OPTIONS=n because
the select of NONSTATIC_KERNEL doesn't take effect.
arch/powerpc/include/asm/io.h: In function 'virt_to_phys':
arch/powerpc/include/asm/page.h:113:26: error: 'virt_phys_offset' undeclared (first use in this function)
#define VIRT_PHYS_OFFSET virt_phys_offset
^
It doesn't have any strong reasons to make the RELOCATABLE depend on
ADVANCED_OPTIONS. So remove this dependency to fix this issue.
Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
John Allen [Thu, 7 Jul 2016 15:05:52 +0000 (10:05 -0500)]
powerpc/pseries: Use kernel hotplug queue for PowerVM hotplug events
The sysfs interface used to handle PowerVM hotplug events should use the
hotplug queue as well. PRRN events will soon be placing many hotplug
events on the queue at once and we will need ordinary hotplug events to
use the queue as well in order to ensure these events will still be handled
and that proper serialization is maintained during the PRRN event.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
John Allen [Thu, 7 Jul 2016 15:03:44 +0000 (10:03 -0500)]
powerpc/pseries: Add support for hotplug interrupt source
Add handler for new hotplug interrupt. For memory and CPU hotplug events,
we will add the hotplug errorlog to the hotplug workqueue. Since PCI
hotplug is not currently supported in the kernel, PCI hotplug events are
written to the rtas_log_bug and are handled by rtas_errd.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
John Allen [Thu, 7 Jul 2016 15:00:34 +0000 (10:00 -0500)]
powerpc/pseries: Add pseries hotplug workqueue
In support of PAPR changes to add a new hotplug interrupt, introduce a
hotplug workqueue to avoid processing hotplug events in interrupt context.
We will also take advantage of the queue on PowerVM to ensure hotplug
events initiated from different sources (HMC and PRRN events) are handled
and serialized properly.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Fri, 15 Jul 2016 07:20:36 +0000 (17:20 +1000)]
cxl: fix potential NULL dereference in free_adapter()
If kzalloc() fails when allocating adapter->guest in
cxl_guest_init_adapter(), we call free_adapter() before erroring out.
free_adapter() in turn attempts to dereference adapter->guest, which in
this case is NULL.
In free_adapter(), skip the adapter->guest cleanup if adapter->guest is
NULL.
Fixes: 14baf4d9c739 ("cxl: Add guest-specific code") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Mon, 18 Jul 2016 04:52:57 +0000 (14:52 +1000)]
cxl: remove dead Kconfig options
Remove the CXL_KERNEL_API and CXL_EEH Kconfig options, as they were only
needed to coordinate the merging of the cxlflash driver. Also remove the
stub implementation of cxl_perst_reloads_same_image() in cxlflash which is
only used if CXL_EEH isn't defined (i.e. never).
Suggested-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Acked-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Tue, 19 Jul 2016 02:33:35 +0000 (12:33 +1000)]
powerpc/powernv: Fix pci-cxl.c build when CONFIG_MODULES=n
pnv_cxl_enable_phb_kernel_api() grabs a reference to the cxl module to
prevent it from being unloaded after the PHB has been switched to CX4
mode. This breaks the build when CONFIG_MODULES=n as module_mutex
doesn't exist.
However, if we don't have modules, we don't need to protect against the
case of the cxl module being unloaded. As such, split the relevant code
out into a function surrounded with #if IS_MODULE(CXL) so we don't try
to compile it if cxl isn't being compiled as a module.
Fixes: 5918dbc9b4ec ("powerpc/powernv: Add support for the cxl kernel api on the real phb") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Add a kernel command line to disable radix
This patch adds the kernel command line disable_radix which disable
the radix MMU mode even if firmware indicates radix support via
ibm,pa-features device tree node.
This helps in testing different MMU mode easily.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm/radix: Update machine call back to support new HCALL.
This update the machine dep callback such that we can use the same
callback to register process table. The interface is updated such that
we can easily call H_REGISTER_PROC_TBL hcall. The HCALL itself is
introduced in a later patch.
No functionality change introduced by this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Update the PID switch as per ISA doc. slbia is needed in radix to
invalidate any implementation specific lookaside information.
We use the .long format due to build errors with the below compiler
version.
gcc (Ubuntu 5.3.1-14ubuntu2.1) 5.3.1 20160413
GNU assembler (GNU Binutils for Ubuntu) 2.26
CC arch/powerpc/mm//mmu_context_book3s64.o
{standard input}: Assembler messages:
{standard input}:506: Error: junk at end of line: `0x7'
scripts/Makefile.build:291: recipe for target 'arch/powerpc/mm//mmu_context_book3s64.o' failed
make[1]: *** [arch/powerpc/mm//mmu_context_book3s64.o] Error 1 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Clear top 16 bits of va only on older cpus
As per ISA, we need to do this only for architecture version 2.02 and
earlier. This continued to work even for 2.07. But let's not do this for
anything after 2.02. ISA 3.0 requires these top bits to be not cleared.
powerpc/mm: Compile out radix related functions if RADIX_MMU is disabled
Currently we depend on mmu_has_feature to evalute to zero based on
MMU_FTRS_POSSIBLE mask. In a later patch, we want to update
radix_enabled() to runtime update the conditional operation to a jump
instruction. This implies we cannot depend on MMU_FTRS_POSSIBLE mask.
Instead define radix_enabled to return 0 if RADIX_MMU is not enabled.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: use _raw variant of page table accessors
This switch few of the page table accessor to use the __raw variant
and does the cpu to big endian conversion of constants. This helps in
generating better code.
For ex: a pgd_none(pgd) check with and without fix is listed below
PowerISA 3.0 requires the MMU mode (radix vs. hash) of the hypervisor
to be mirrored in the LPCR register, in addition to the partition table.
This is done to avoid fetching from the table when deciding, among other
things, how to perform transitions to HV mode on some interrupts.
So let's set it up appropriately
powerpc/mm: Fix .long's in tlb-radix.c to more meaningful
The .longs with the shifts are harder to read, use more meaningful names
for the opcodes. PPC_TLBIE_5 is introduced for the 5 opcode variation of
the instruction due to an existing op-code for the 2 opcode variant.
powerpc/pci: Don't try to allocate resources that will be reassigned
When we know we will reassign all resources, trying (and failing)
to allocate them initially is fairly pointless and leads to a lot
of scary messages in the kernel log
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv/pci: Check status of a PHB before using it
If the firmware encounters an error (internal or HW) during initialization
of a PHB, it might leave the device-node in the tree but mark it disabled
using the "status" property. We should check it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv/pci: Use the device-tree to get available range of M64's
M64's are the configurable 64-bit windows that cover the 64-bit MMIO
space. We used to hard code 16 windows. Newer chips might have a
variable number and might need to reserve some as well (for example
on PHB4/POWER9, M32 and M64 are actually unified and we use M64#0
to map the 32-bit space).
So newer OPALs will provide a property we can use to know what range
of windows is available. The property is named so that it can
eventually support multiple ranges but we only use the first one for
now.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv/pci: Rework accessing the TCE invalidate register
It's architected, always in a known place, so there is no need
to keep a separate pointer to it, we use the existing "regs",
and we complement it with a real mode variant.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
# Conflicts:
# arch/powerpc/platforms/powernv/pci-ioda.c
# arch/powerpc/platforms/powernv/pci.h Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv/pci: Remove SWINV constants and obsolete TCE code
We have some obsolete code in pnv_pci_p7ioc_tce_invalidate()
to handle some internal lab tools that have stopped being
useful a long time ago. Remove that along with the definition
and test for the TCE_PCI_SWINV_* flags whose value is basically
always the same.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The TCE invalidation functions are fairly implementation specific,
and while the IODA specs more/less describe the register, in practice
various implementation workarounds may be required. So name the
functions after the target PHB.
Note today and for the foreseeable future, there's a 1:1 relationship
between an IODA version and a PHB implementation. There exist another
variant of IODA1 (Torrent) but we never supported in with OPAL and
never will.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/irq: Add mechanism to force a replay of interrupts
Calling this function with interrupts soft-disabled will cause
a replay of the external interrupt vector when they are re-enabled.
This will be used by the OPAL XICS backend (and latter by the native
XIVE code) to handle EOI signaling that there are more interrupts to
fetch from the hardware since the hardware won't issue another HW
interrupt in that case.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/irq: Add support for HV virtualization interrupts
This will be delivering external interrupts from the XIVE to the
Hypervisor. We treat it as a normal external interrupt for the
lazy irq disable code (so it will be replayed as a 0x500) and
route it to do_IRQ.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/book64s: Move a few exception common handlers to make room
This moves the CBE RAS and facility unavailable "common" handlers
down to after the FWNMI page.
This frees up some space in the very demanded spaces before the
relocation-on vectors and before the FWNMI page. They are still
within 64K of __start, so CONFIG_RELOCATABLE should still work.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OPAL provides an emulated XICS interrupt controller to
use as a fallback on newer processors that don't have a
XICS. It's meant as a way to provide backward compatibility
with future processors. Add the corresponding interfaces.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: Use deepest stop state when cpu is offlined
If hardware supports stop state, use the deepest stop state when
the cpu is offlined.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cpuidle/powernv: Add support for POWER ISA v3 idle states
POWER ISA v3 defines a new idle processor core mechanism. In summary,
a) new instruction named stop is added.
b) new per thread SPR named PSSCR is added which controls the behavior
of stop instruction.
Supported idle states and value to be written to PSSCR register to enter
any idle state is exposed via ibm,cpu-idle-state-names and
ibm,cpu-idle-state-psscr respectively. To enter an idle state,
platform provided power_stop() needs to be invoked with the appropriate
PSSCR value.
This patch adds support for this new mechanism in cpuidle powernv driver.
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com> Cc: linux-pm@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: linuxppc-dev@lists.ozlabs.org Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Use stack instead of kzalloc'ed memory for variables while probing
device tree for idle states.
- Set cap for number of idle states that can be added to
cpuidle_state_table
- Minor change in way we check of_property_read_u32_array for error
for sake of consistency
- Drop unnecessary "&" while assigning function pointer
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: linux-pm@vger.kernel.org Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cpuidle/powernv: Use CPUIDLE_STATE_MAX instead of MAX_POWERNV_IDLE_STATES
Use cpuidle's CPUIDLE_STATE_MAX macro instead of powernv specific
MAX_POWERNV_IDLE_STATES.
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: linux-pm@vger.kernel.org Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: Add platform support for stop instruction
POWER ISA v3 defines a new idle processor core mechanism. In summary,
a) new instruction named stop is added. This instruction replaces
instructions like nap, sleep, rvwinkle.
b) new per thread SPR named Processor Stop Status and Control Register
(PSSCR) is added which controls the behavior of stop instruction.
PSSCR key fields:
Bits 0:3 - Power-Saving Level Status. This field indicates the lowest
power-saving state the thread entered since stop instruction was last
executed.
Bit 42 - Enable State Loss
0 - No state is lost irrespective of other fields
1 - Allows state loss
Bits 44:47 - Power-Saving Level Limit
This limits the power-saving level that can be entered into.
Bits 60:63 - Requested Level
Used to specify which power-saving level must be entered on executing
stop instruction
This patch adds support for stop instruction and PSSCR handling.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: abstraction for saving SPRs before entering deep idle states
Create a function for saving SPRs before entering deep idle states.
This function can be reused for POWER9 deep idle states.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: Make pnv_powersave_common more generic
pnv_powersave_common does common steps needed before entering idle
state and eventually changes MSR to MSR_IDLE and does rfid to
pnv_enter_arch207_idle_mode.
Move the updation of HSTATE_HWTHREAD_STATE to pnv_powersave_common
from pnv_enter_arch207_idle_mode and make it more generic by passing the
rfid address as a function parameter.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: Rename reusable idle functions to hardware agnostic names
Functions like power7_wakeup_loss, power7_wakeup_noloss,
power7_wakeup_tb_loss are used by POWER7 and POWER8 hardware. They can
also be used by POWER9. Hence rename these functions hardware agnostic
names.
Suggested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/powernv: Rename idle_power7.S to idle_book3s.S
idle_power7.S handles idle entry/exit for POWER7, POWER8 and in next
patch for POWER9. Rename the file to a non-hardware specific
name.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/kvm: make hypervisor state restore a function
In the current code, when the thread wakes up in reset vector, some
of the state restore code and check for whether a thread needs to
branch to kvm is duplicated. Reorder the code such that this
duplication is avoided.
At a higher level this is what the change looks like-
Before this patch -
power7_wakeup_tb_loss:
restore hypervisor state
if (thread needed by kvm)
goto kvm_start_guest
restore nvgprs, cr, pc
rfid to process context
power7_wakeup_loss:
restore nvgprs, cr, pc
rfid to process context
reset vector:
if (waking from deep idle states)
goto power7_wakeup_tb_loss
else
if (thread needed by kvm)
goto kvm_start_guest
goto power7_wakeup_loss
After this patch -
power7_wakeup_tb_loss:
restore hypervisor state
return
power7_restore_hyp_resource():
if (waking from deep idle states)
goto power7_wakeup_tb_loss
return
power7_wakeup_loss:
restore nvgprs, cr, pc
rfid to process context
reset vector:
power7_restore_hyp_resource()
if (thread needed by kvm)
goto kvm_start_guest
goto power7_wakeup_loss
Reviewed-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Wed, 29 Jun 2016 17:20:30 +0000 (12:20 -0500)]
powerpc/pseries: Remove call to memblock_add()
The call to memblock_add is not needed, this is already done by
memory_add(). This patch removes this call which shrinks
dlpar_add_lmb_memory() enough that it can be merged into dlpar_add_lmb().
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Wed, 29 Jun 2016 17:19:14 +0000 (12:19 -0500)]
powerpc/pseries: Auto-online hotplugged memory
A recent update (commit id 31bc3858ea3) allows for automatically
onlining memory that is added. This patch sets the config option
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y for pseries and updates the
pseries memory hotplug code so that DLPAR added memory can be
automatically onlined instead of explicitly onlining the memory.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 20 Jun 2016 14:01:38 +0000 (09:01 -0500)]
powerpc/pseries: Dynamic add entires to associativity lookup array
Dynamically add entries to the associativity lookup array
The ibm,associativity-lookup-arrays property may only contain
associativity arrays for LMBs present at boot time. When hotplug
adding a LMB its associativity array may not be in the associativity
lookup array, this patch adds the ability to add new entries to the
associativity lookup array.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 20 Jun 2016 14:00:39 +0000 (09:00 -0500)]
powerpc/pseries: Move property cloning into its own routine
Move property cloning code into its own routine
Split the pieces of dlpar_clone_drconf_property() that create a copy of
the property struct into its own routine. This allows for creating
clones of more than just the ibm,dynamic-memory property used in memory
hotplug.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 28 Jun 2016 05:02:46 +0000 (15:02 +1000)]
powerpc/pseries: HVC early debug options should depend on HVC_CONSOLE
The pseries HVC early debug options, CONFIG_PPC_EARLY_DEBUG_LPAR and
CONFIG_PPC_EARLY_DEBUG_LPAR_HVSI both require code that is part of the
hvc driver. If we turn them on but not CONFIG_HVC_CONSOLE then we get:
arch/powerpc/kernel/built-in.o: In function `.udbg_early_init':
arch/powerpc/kernel/built-in.o:(.debug_addr+0x9a00): undefined reference to `udbg_init_debug_lpar'
Similarly for HVSI. So make them both depend on CONFIG_HVC_CONSOLE.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 6 Jul 2016 04:58:06 +0000 (14:58 +1000)]
powerpc/tm: Fix stack pointer corruption in __tm_recheckpoint()
At the start of __tm_recheckpoint() we save the kernel stack pointer
(r1) in SPRG SCRATCH0 (SPRG2) so that we can restore it after the
trecheckpoint.
Unfortunately, the same SPRG is used in the SLB miss handler. If an
SLB miss is taken between the save and restore of r1 to the SPRG, the
SPRG is changed and hence r1 is also corrupted. We can end up with
the following crash when we start using r1 again after the restore
from the SPRG:
Daniel Axtens [Tue, 12 Jul 2016 00:54:52 +0000 (10:54 +1000)]
powerpc: Make ppc_md.{halt, restart} __noreturn
powernv marks it's halt and restart calls as __noreturn. However,
ppc_md does not have this annotation. Add the annotation to ppc_md,
and then to every halt/restart function that is missing it.
Additionally, I have verified that all of these functions do not
return. Occasionally I have added a spin loop to be sure.
Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 12 Jul 2016 00:54:51 +0000 (10:54 +1000)]
powerpc/sparse: Pass endianness to sparse
Explicitly give sparse an endianness in the Makefile, so that it
doesn't get confused.
Normally we have #ifdef one and #else the other, so it doesn't usually
matter, but we have been bitten by it before, and indeed this patch
fixes a number of sparse errors.
Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 12 Jul 2016 00:54:48 +0000 (10:54 +1000)]
powerpc/kvm: Clarify __user annotations
kvmppc_h_put_tce_indirect labels a u64 pointer as __user. It also
labelled the u64 where get_user puts the result as __user. This isn't
a pointer and so doesn't need to be labelled __user.
Split the u64 value definition onto a new line to make it clear that
it doesn't get the annotation.
Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The FROZEN transitions are used when a CPU suspends/resumes. In case
of a suspend/resume, only the up prepare (CPU_UP_PREPARE_FROZEN) is
handled. The error handling transition CPU_UP_CANCELED_FROZEN as well
as the CPU_ONLINE_FROZEN transition are not handled.
Masking the switch case action argument with ~CPU_TASKS_FROZEN, to
handle all FROZEN tasks the same way than the corresponding non frozen
tasks.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Wed, 13 Jul 2016 21:17:14 +0000 (07:17 +1000)]
cxl: Add cxl_check_and_switch_mode() API to switch bi-modal cards
Add a new API, cxl_check_and_switch_mode() to allow for switching of
bi-modal CAPI cards, such as the Mellanox CX-4 network card.
When a driver requests to switch a card to CAPI mode, use PCI hotplug
infrastructure to remove all PCI devices underneath the slot. We then write
an updated mode control register to the CAPI VSEC, hot reset the card, and
reprobe the card.
As the card may present a different set of PCI devices after the mode
switch, use the infrastructure provided by the pnv_php driver and the OPAL
PCI slot management facilities to ensure that:
* the old devices are removed from both the OPAL and Linux device trees
* the new devices are probed by OPAL and added to the OPAL device tree
* the new devices are added to the Linux device tree and probed through
the regular PCI device probe path
As such, introduce a new option, CONFIG_CXL_BIMODAL, with a dependency on
the pnv_php driver.
Refactor existing code that touches the mode control register in the
regular single mode case into a new function, setup_cxl_protocol_area().
Co-authored-by: Ian Munsie <imunsie@au1.ibm.com> Cc: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Wed, 13 Jul 2016 21:17:13 +0000 (07:17 +1000)]
PCI/hotplug: pnv_php: handle OPAL_PCI_SLOT_OFFLINE power state
When calling pnv_php_set_slot_power_state() with state ==
OPAL_PCI_SLOT_OFFLINE, remove devices from the device tree as if we're
dealing with OPAL_PCI_SLOT_POWER_OFF.
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com> Cc: linux-pci@vger.kernel.org Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Wed, 13 Jul 2016 21:17:12 +0000 (07:17 +1000)]
PCI/hotplug: pnv_php: export symbols and move struct types needed by cxl
The cxl driver will use infrastructure from pnv_php to handle device tree
updates when switching bi-modal CAPI cards into CAPI mode.
To enable this, export pnv_php_find_slot() and
pnv_php_set_slot_power_state(), and add corresponding declarations, as well
as the definition of struct pnv_php_slot, to asm/pnv-pci.h.
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com> Cc: linux-pci@vger.kernel.org Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since the kernel API allocates a default context very early during
device init that will almost certainly get Process Element ID 0 there is
no easy way for us to extend the API to allow the Mellanox to inform us
of this limitation ahead of time.
Instead, work around the issue by extending the XSL structure to include
a minimum PE to allocate. Although the bug is not in the XSL, it is the
easiest place to work around this limitation given that the CX4 is
currently the only card that uses an XSL.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:10 +0000 (07:17 +1000)]
cxl: Add support for interrupts on the Mellanox CX4
The Mellanox CX4 in cxl mode uses a hybrid interrupt model, where
interrupts are routed from the networking hardware to the XSL using the
MSIX table, and from there will be transformed back into an MSIX
interrupt using the cxl style interrupts (i.e. using IVTE entries and
ranges to map a PE and AFU interrupt number to an MSIX address).
We want to hide the implementation details of cxl interrupts as much as
possible. To this end, we use a special version of the MSI setup &
teardown routines in the PHB while in cxl mode to allocate the cxl
interrupts and configure the IVTE entries in the process element.
This function does not configure the MSIX table - the CX4 card uses a
custom format in that table and it would not be appropriate to fill that
out in generic code. The rest of the functionality is similar to the
"Full MSI-X mode" described in the CAIA, and this could be easily
extended to support other adapters that use that mode in the future.
The interrupts will be associated with the default context. If the
maximum number of interrupts per context has been limited (e.g. by the
mlx5 driver), it will automatically allocate additional kernel contexts
to associate extra interrupts as required. These contexts will be
started using the same WED that was used to start the default context.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:09 +0000 (07:17 +1000)]
cxl: Add preliminary workaround for CX4 interrupt limitation
The Mellanox CX4 has a hardware limitation where only 4 bits of the
AFU interrupt number can be passed to the XSL when sending an interrupt,
limiting it to only 15 interrupts per context (AFU interrupt number 0 is
invalid).
In order to overcome this, we will allocate additional contexts linked
to the default context as extra address space for the extra interrupts -
this will be implemented in the next patch.
This patch adds the preliminary support to allow this, by way of adding
a linked list in the context structure that we use to keep track of the
contexts dedicated to interrupts, and an API to simultaneously iterate
over the related context structures, AFU interrupt numbers and hardware
interrupt numbers. The point of using a single API to iterate these is
to hide some of the details of the iteration from external code, and to
reduce the number of APIs that need to be exported via base.c to allow
built in code to call.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:08 +0000 (07:17 +1000)]
cxl: Add kernel APIs to get & set the max irqs per context
These APIs will be used by the Mellanox CX4 support. While they function
standalone to configure existing behaviour, their primary purpose is to
allow the Mellanox driver to inform the cxl driver of a hardware
limitation, which will be used in a future patch.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:07 +0000 (07:17 +1000)]
cxl: Add support for using the kernel API with a real PHB
This hooks up support for using the kernel API with a real PHB. After
the AFU initialisation has completed it calls into the PHB code to pass
it the AFU that will be used by other peer physical functions on the
adapter.
The cxl_pci_to_afu API is extended to work with peer PCI devices,
retrieving the peer AFU from the PHB. This API may also now return an
error if it is called on a PCI device that is not associated with either
a cxl vPHB or a peer PCI device to an AFU, and this error is propagated
down.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:06 +0000 (07:17 +1000)]
powerpc/powernv: Add support for the cxl kernel api on the real phb
This adds support for the peer model of the cxl kernel api to the
PowerNV PHB, in which physical function 0 represents the cxl function on
the card (an XSL in the case of the CX4), which other physical functions
will use for memory access and interrupt services. It is referred to as
the peer model as these functions are peers of one another, as opposed
to the Virtual PHB model which forms a hierarchy.
This patch exports APIs to enable the peer mode, check if a PCI device
is attached to a PHB in this mode, and to set and get the peer AFU for
this mode.
The cxl driver will enable this mode for supported cards by calling
pnv_cxl_enable_phb_kernel_api(). This will set a flag in the PHB to note
that this mode is enabled, and switch out it's controller_ops for the
cxl version.
The cxl version of the controller_ops struct implements it's own
versions of the enable_device_hook and release_device to handle
refcounting on the peer AFU and to allocate a default context for the
device.
Once enabled, the cxl kernel API may not be disabled on a PHB. Currently
there is no safe way to disable cxl mode short of a reboot, so until
that changes there is no reason to support the disable path.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:05 +0000 (07:17 +1000)]
cxl: Do not create vPHB if there are no AFU configuration records
The vPHB model of the cxl kernel API is a hierarchy where the AFU is
represented by the vPHB, and it's AFU configuration records are exposed
as functions under that vPHB. If there are no AFU configuration records
we will create a vPHB with nothing under it, which is a waste of
resources and will opt us into EEH handling despite not having anything
special to handle.
This also does not make sense for cards using the peer model of the cxl
kernel API, where the other functions of the device are exposed via
additional peer physical functions rather than AFU configuration
records. This model will also not work with the existing EEH handling in
the cxl driver, as that is designed around the vPHB model.
Skip creating the vPHB for AFUs without any AFU configuration records,
and opt out of EEH handling for them.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 13 Jul 2016 21:17:04 +0000 (07:17 +1000)]
cxl: Allow a default context to be associated with an external pci_dev
The cxl kernel API has a concept of a default context associated with
each PCI device under the virtual PHB. The Mellanox CX4 will also use
the cxl kernel API, but it does not use a virtual PHB - rather, the AFU
appears as a physical function as a peer to the networking functions.
In order to allow the kernel API to work with those networking
functions, we will need to associate a default context with them as
well. To this end, refactor the corresponding code to do this in vphb.c
and export it so that it can be called from the PHB code.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>