Keith Busch [Wed, 5 Oct 2016 20:32:45 +0000 (16:32 -0400)]
nvme: don't schedule multiple resets
The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:
WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.
This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg<sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Wed, 12 Oct 2016 15:22:16 +0000 (09:22 -0600)]
nvme: Delete created IO queues on reset
The driver was decrementing the online_queues prior to attempting to
delete those IO queues, so the driver ended up not requesting the
controller delete any. This patch saves the online_queues prior to
suspending them, and adds that parameter for deleting io queues.
Fixes: c21377f8 ("nvme: Suspend all queues before deletion") Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Tue, 11 Oct 2016 17:31:58 +0000 (13:31 -0400)]
nvme: Stop probing a removed device
There is no reason the nvme controller can ever return all 1's from
reading the CSTS register. This patch returns an error if we observe
that status. Without this, we may incorrectly proceed with controller
initialization and unnecessarilly rely on error handling to clean this.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Tomasz Majchrzak [Wed, 12 Oct 2016 10:23:08 +0000 (12:23 +0200)]
badblocks: fix overlapping check for clearing
Current bad block clear implementation assumes the range to clear
overlaps with at least one bad block already stored. If given range to
clear precedes first bad block in a list, the first entry is incorrectly
updated.
Check not only if stored block end is past clear block end but also if
stored block start is before clear block end.
Linus Torvalds [Mon, 10 Oct 2016 00:32:20 +0000 (17:32 -0700)]
Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-block
Pull blk-mq CPU hotplug update from Jens Axboe:
"This is the conversion of blk-mq to the new hotplug state machine"
* 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
blk-mq: fixup "Convert to new hotplug state machine"
blk-mq: Convert to new hotplug state machine
blk-mq/cpu-notif: Convert to new hotplug state machine
Linus Torvalds [Mon, 10 Oct 2016 00:29:33 +0000 (17:29 -0700)]
Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
Linus Torvalds [Mon, 10 Oct 2016 00:16:18 +0000 (17:16 -0700)]
Merge tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- various fixes and cleanups for request-based DM core
- add support for delaying the requeue of requests; used by DM
multipath when all paths have failed and 'queue_if_no_path' is
enabled
- DM cache improvements to speedup the loading metadata and the writing
of the hint array
- fix potential for a dm-crypt crash on device teardown
- remove dm_bufio_cond_resched() and just using cond_resched()
- change DM multipath to return a reservation conflict error
immediately; rather than failing the path and retrying (potentially
indefinitely)
* tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
dm mpath: always return reservation conflict without failing over
dm bufio: remove dm_bufio_cond_resched()
dm crypt: fix crash on exit
dm cache metadata: switch to using the new cursor api for loading metadata
dm array: introduce cursor api
dm btree: introduce cursor api
dm cache policy smq: distribute entries to random levels when switching to smq
dm cache: speed up writing of the hint array
dm array: add dm_array_new()
dm mpath: delay the requeue of blk-mq requests while all paths down
dm mpath: use dm_mq_kick_requeue_list()
dm rq: introduce dm_mq_kick_requeue_list()
dm rq: reduce arguments passed to map_request() and dm_requeue_original_request()
dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests
dm: convert wait loops to use autoremove_wake_function()
dm: use signal_pending_state() in dm_wait_for_completion()
dm: rename task state function arguments
dm: add two lockdep_assert_held() statements
dm rq: simplify dm_old_stop_queue()
dm mpath: check if path's request_queue is dying in activate_path()
...
Linus Torvalds [Mon, 10 Oct 2016 00:04:33 +0000 (17:04 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull main rdma updates from Doug Ledford:
"This is the main pull request for the rdma stack this release. The
code has been through 0day and I had it tagged for linux-next testing
for a couple days.
Summary:
- updates to mlx5
- updates to mlx4 (two conflicts, both minor and easily resolved)
- updates to iw_cxgb4 (one conflict, not so obvious to resolve,
proper resolution is to keep the code in cxgb4_main.c as it is in
Linus' tree as attach_uld was refactored and moved into
cxgb4_uld.c)
- improvements to uAPI (moved vendor specific API elements to uAPI
area)
- add hns-roce driver and hns and hns-roce ACPI reset support
- conversion of all rdma code away from deprecated
create_singlethread_workqueue
- security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
staging)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
staging/lustre: Disable InfiniBand support
iw_cxgb4: add fast-path for small REG_MR operations
cxgb4: advertise support for FR_NSMR_TPTE_WR
IB/core: correctly handle rdma_rw_init_mrs() failure
IB/srp: Fix infinite loop when FMR sg[0].offset != 0
IB/srp: Remove an unused argument
IB/core: Improve ib_map_mr_sg() documentation
IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
IB/mthca: Move user vendor structures
IB/nes: Move user vendor structures
IB/ocrdma: Move user vendor structures
IB/mlx4: Move user vendor structures
IB/cxgb4: Move user vendor structures
IB/cxgb3: Move user vendor structures
IB/mlx5: Move and decouple user vendor structures
IB/{core,hw}: Add constant for node_desc
ipoib: Make ipoib_warn ratelimited
IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
IB/ipoib: Remove deprecated create_singlethread_workqueue
...
Linus Torvalds [Sun, 9 Oct 2016 23:30:03 +0000 (16:30 -0700)]
Merge tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull more rdma updates from Doug Ledford:
"Minor updates for rxe driver"
[ Starting to do merge window pulls again - the current -git tree does
appear to have some netfilter use-after-free issues, but I've sent
off the report to the proper channels, and I don't want to delay merge
window activity any more ]
* tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/rxe: improved debug prints & code cleanup
rdma_rxe: Ensure rdma_rxe init occurs at correct time
IB/rxe: Properly honor max IRD value for rd/atomic.
IB/{rxe,core,rdmavt}: Fix kernel crash for reg MR
IB/rxe: Fix sending out loopback packet on netdev interface.
IB/rxe: Avoid scheduling tasklet for userspace QP
Linus Torvalds [Sat, 8 Oct 2016 04:38:00 +0000 (21:38 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
- fsnotify updates
- ocfs2 updates
- all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
console: don't prefer first registered if DT specifies stdout-path
cred: simpler, 1D supplementary groups
CREDITS: update Pavel's information, add GPG key, remove snail mail address
mailmap: add Johan Hovold
.gitattributes: set git diff driver for C source code files
uprobes: remove function declarations from arch/{mips,s390}
spelling.txt: "modeled" is spelt correctly
nmi_backtrace: generate one-line reports for idle cpus
arch/tile: adopt the new nmi_backtrace framework
nmi_backtrace: do a local dump_stack() instead of a self-NMI
nmi_backtrace: add more trigger_*_cpu_backtrace() methods
min/max: remove sparse warnings when they're nested
Documentation/filesystems/proc.txt: add more description for maps/smaps
mm, proc: fix region lost in /proc/self/smaps
proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
proc: add LSM hook checks to /proc/<tid>/timerslack_ns
proc: relax /proc/<tid>/timerslack_ns capability requirements
meminfo: break apart a very long seq_printf with #ifdefs
seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
proc: faster /proc/*/status
...
Linus Torvalds [Sat, 8 Oct 2016 04:34:49 +0000 (21:34 -0700)]
Merge tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC late DT updates from Arnd Bergmann:
"These updates have been kept in a separate branch mostly because they
rely on updates to the respective clk drivers to keep the shared
header files in sync.
- The Renesas r8a7796 (R-Car M3-W) platform gets added, this is an
automotive SoC similar to the ⅹ8a7795 chip we already support, but
the dts changes rely on a clock driver change that has been merged
for v4.9 through the clk tree.
- The Amlogic meson-gxbb (S905) platform gains support for a few
drivers merged through our tree, in particular the network and usb
driver changes are required and included here, and also the clk
tree changes.
- The Allwinner platforms have seen a large-scale change to their clk
drivers and the dts file updates must come after that. This
includes the newly added Nextthing GR8 platform, which is derived
from sun5i/A13.
- Some integrator (arm32) changes rely on clk driver changes.
- A single patch for lpc32xx has no such dependency but wasn't added
until just before the merge window"
* tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (99 commits)
ARM: dts: lpc32xx: add device node for IRAM on-chip memory
ARM: dts: sun8i: Add accelerometer to polaroid-mid2407pxe03
ARM: dts: sun8i: enable UART1 for iNet D978 Rev2 board
ARM: dts: sun8i: add pinmux for UART1 at PG
dts: sun8i-h3: add I2C0-2 peripherals to H3 SOC
dts: sun8i-h3: add pinmux definitions for I2C0-2
dts: sun8i-h3: associate exposed UARTs on Orange Pi Boards
dts: sun8i-h3: split off RTS/CTS for UART1 in seperate pinmux
dts: sun8i-h3: add pinmux definitions for UART2-3
ARM: dts: sun9i: a80-optimus: Disable EHCI1
ARM: dts: sun9i: cubieboard4: Add AXP806 PMIC device node and regulators
ARM: dts: sun9i: a80-optimus: Add AXP806 PMIC device node and regulators
ARM: dts: sun9i: cubieboard4: Declare AXP809 SW regulator as unused
ARM: dts: sun9i: a80-optimus: Declare AXP809 SW regulator as unused
ARM: dts: sun8i: Add touchscreen node for sun8i-a33-ga10h
ARM: dts: sun8i: Add touchscreen node for sun8i-a23-polaroid-mid2809pxe04
ARM: dts: sun8i: Add touchscreen node for sun8i-a23-polaroid-mid2407pxe03
ARM: dts: sun8i: Add touchscreen node for sun8i-a23-inet86dz
ARM: dts: sun8i: Add touchscreen node for sun8i-a23-gt90h
ARM64: dts: meson-gxbb-vega-s95: Enable USB Nodes
...
Linus Torvalds [Sat, 8 Oct 2016 04:32:39 +0000 (21:32 -0700)]
Merge tag 'armsoc-dt64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM 64-bit DT updates from Arnd Bergmann:
"The 64-bit DT changes are surprisingly small this time, we only add
two SoC platforms: the ZTE ZX296718 Set-top-box SoC and the SocioNext
UniPhier LD11 TV SoC, each with their reference boards.
There are three new machines added for existing SoC platforms:
- The Marvell Armada 8040 development board is an impressive
quad-core Cortex-A72 machine with three 10gbit ethernet interfaces
- Qualcomms DragonBoard 820c single-board computer is their current
high-end phone platform in the 96boards form factor
- Rockchip: Tronsmart Orion r86 set-top-box is a popular mid-range
Android box based on the 8-core rk3368 SoC"
* tag 'armsoc-dt64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (91 commits)
arm64: dts: berlin4ct: Add L2 cache topology
arm64: dts: berlin4ct: enable all wdt nodes unconditionally
arm64: dts: berlin4ct: switch to Cortex-A53 specific pmu nodes
arm64: dts: Add ZTE ZX296718 SoC dts and Makefile
arm64: dts: apm: Add DT node for APM X-Gene 2 CPU clocks
arm64: dts: apm: Add X-Gene SoC hwmon to device tree
arm64: dts: apm: Fix interrupt polarity for X-Gene PCIe legacy interrupts
arm64: dts: apm: Add APM X-Gene v2 SoC PMU DTS entries
arm64: dts: apm: Add APM X-Gene SoC PMU DTS entries
arm64: dts: marvell: enable MSI for PCIe on Armada 7K/8K
arm64: dts: ls2080a: Add 'dma-coherent' for ls2080a PCI nodes
arm64: dts: rockchip: add Type-C phy for RK3399
arm64: dts: rockchip: enable the gmac for rk3399 evb board
arm64: dts: rockchip: add the gmac needed node for rk3399
arm64: dts: rockchip: support the pmu node for rk3399
arm64: dts: rockchip: change all interrupts cells to 4 on rk3399 SoCs
arm64: dts: rockchip: add the tcpc for rk3399 power domain
arm64: dts: rockchip: add efuse0 device node for rk3399
arm64: dts: rockchip: configure PCIe support for rk3399-evb
arm64: dts: rockchip: add the PCIe controller support for RK3399
...
Linus Torvalds [Sat, 8 Oct 2016 04:29:04 +0000 (21:29 -0700)]
Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM DT updates from Arnd Bergmann:
"These are as usual a very large number of mostly boring updates to
enable devices in existing machines, or to fix minor bugs. Notably, an
ongoing treewide effort to fix warnings caused by an update to the
device tree compiler. These are enabled with "make W=1" at the moment
but can hopefully become the default once all issues have been
addressed.
No new SoC platform is added this time around (Armada 395 and Orion
mv88f5181 are slight variations of existing ones), but a significant
number of new dts files are added, which I list by platform:
- Allwinner: Empire Electronix M712 and iNet d978 Rev2 tablets,
Orange Pi PC Plus, Orange Pi 2, Orange Pi Plus 2E, Orange Pi Lite,
Olimex A33-Olinuxino, and Nano Pi Neo single-board computers
- ARM Realview: all supported machines (ported from board files)
- Broadcom: BCM958525er, BCM958522er, BCM988312hr, BCM958623hr and
BCM958622hr reference boards for Northstar platform, Raspberry Pi
Zero single-board computer
- Marvell EBU: Netgear WNR854T router (ported from board file),
Armada 395 SoC platform and GP board Armada 390 DB development
board
- NXP i.MX: imx7s Warp7 reference board, Gateworks Ventana GW553x
single-board computer, Technologic Systems TS-4900 and Engicam
IMX6UL GEA M6UL computer-on-module, Inverse Path USB armory board
- Qualcomm: LG Nexus 5 Phone
- Renesas: r8a7792/wheat and r7s72100/rskrza1 development boards
- ST Microelectronics STi: B2260 (96boards) single-board computer
- TI Davinci: OMAP-L138 LCDK Development kit
- TI OMAP: beagleboard-x15 rev B1 single-board computer"
* tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (390 commits)
ARM: dts: sony-nsz-gs7: add missing unit name to /memory node
ARM: dts: chromecast: add missing unit name to /memory node
ARM: dts: berlin2q-marvell-dmp: add missing unit name to /memory node
ARM: dts: berlin2: Add missing unit name to /soc node
ARM: dts: berlin2cd: Add missing unit name to /soc node
ARM: dts: berlin2q: Add missing unit name to /soc node
ARM: dts: berlin2: Remove skeleton.dtsi inclusion
ARM: dts: berlin2cd: Remove skeleton.dtsi inclusion
ARM: dts: berlin2q: Remove skeleton.dtsi inclusion
arm: dts: berlin2q: enable all wdt nodes unconditionally
arm: dts: berlin2: enable all wdt nodes unconditionally
ARM: dts: omap5-igep0050.dts: Use tabs for indentation
ARM: dts: Fix igepv5 power button GPIO direction
ARM: dts: am335x-evmsk: Add blue-and-red-wiring -property to lcdc node
ARM: dts: am335x-evmsk: Whitespace cleanup of lcdc related nodes
ARM: dts: am335x-evm: Add blue-and-red-wiring -property to lcdc node
ARM: dts: s3c64xx: Use macros for pinctrl configuration
ARM: dts: s3c2416: Use macros for pinctrl configuration
ARM: dts: s5pv210: Use macros for pinctrl configuration
ARM: dts: s3c64xx: Use common macros for pinctrl configuration
...
Linus Torvalds [Sat, 8 Oct 2016 04:23:40 +0000 (21:23 -0700)]
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC driver updates from Arnd Bergmann:
"Driver updates for ARM SoCs, including a couple of newly added
drivers:
- The Qualcomm external bus interface 2 (EBI2), used in some of their
mobile phone chips for connecting flash memory, LCD displays or
other peripherals
- Secure monitor firmware for Amlogic SoCs, and an NVMEM driver for
the EFUSE based on that firmware interface.
- Perf support for the AppliedMicro X-Gene performance monitor unit
- Reset driver for STMicroelectronics STM32
- Reset driver for SocioNext UniPhier SoCs
Aside from these, there are minor updates to SoC-specific bus,
clocksource, firmware, pinctrl, reset, rtc and pmic drivers"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (50 commits)
bus: qcom-ebi2: depend on HAS_IOMEM
pinctrl: mvebu: orion5x: Generalise mv88f5181l support for 88f5181
clk: mvebu: Add clk support for the orion5x SoC mv88f5181
dt-bindings: EXYNOS: Add Exynos5433 PMU compatible
clocksource: exynos_mct: Add the support for ARM64
perf: xgene: Add APM X-Gene SoC Performance Monitoring Unit driver
Documentation: Add documentation for APM X-Gene SoC PMU DTS binding
MAINTAINERS: Add entry for APM X-Gene SoC PMU driver
bus: qcom: add EBI2 driver
bus: qcom: add EBI2 device tree bindings
rtc: rtc-pm8xxx: Add support for pm8018 rtc
nvmem: amlogic: Add Amlogic Meson EFUSE driver
firmware: Amlogic: Add secure monitor driver
soc: qcom: smd: Reset rx tail rather than tx
memory: atmel-sdramc: fix a possible NULL dereference
reset: hi6220: allow to compile test driver on other architectures
reset: zynq: add driver Kconfig option
reset: sunxi: add driver Kconfig option
reset: stm32: add driver Kconfig option
reset: socfpga: add driver Kconfig option
...
Linus Torvalds [Sat, 8 Oct 2016 04:22:01 +0000 (21:22 -0700)]
Merge tag 'armsoc-arm64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC 64-bit updates from Arnd Bergmann:
"Changes to platform code for 64-bit ARM platforms.
Nearly all of these are defconfig updates to enable new drivers or old
drivers still used on these 64-bit platforms.
Aside from that, we gain initial support for two set-top-box
platforms, both of which already have 32-bit support in arch/arm:
- Broadcom adds abstract support for the bcm7xxx/brcmstb platform,
presumably the respective dts files and more information will
follow at a later point.
- The ZTE ZX296718 SoC for set-top-boxes, a relative of the 32-bit
ZX296702 SoC that we already support"
* tag 'armsoc-arm64' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
arm64: add ZTE ZX SoC family
arm64: defconfig: enable ZTE ZX related config
arm64: defconfig: enable common modules for power management
arm64: defconfig: enable meson I2C
arm64: defconfig: enable meson SPI as module
arm64: defconfig: enable meson WDT as modules
arm64: defconfig: enable HW random as module
arm64: defconfig: Enable SDHI and GPIO_REGULATOR
arm64: configs: enable PCIe driver for Aardvark
Kconfig: ARCH_HISI: Add PINCTRL to HISI platform
arm64: defconfig: enable bluetooth supports as modules
arm64: defconfig: enable CONFIG_INPUT_HISI_POWERKEY for HiKey
arm64: defconfig: Enable HiSilicon kirin drm, adv7533 for HiKey
arm64: defconfig: Enable Hisi SAS and HNS
arm64: defconfig: Enable QDF2432 config options
arm64: sunxi: Kconfig: add essential pinctrl driver
arm64: defconfig: Add Renesas R-Car HSUSB driver support as module
arm64: Add Broadcom Set Top Box Kconfig entry point
arm64: defconfig: enable xhci-platform
Linus Torvalds [Sat, 8 Oct 2016 04:20:33 +0000 (21:20 -0700)]
Merge tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC defconfig updates from Arnd Bergmann:
"Defconfig additions, removals, etc. Most of these are small changes
adding the options for newly upstreamed drivers, or drivers needed for
new board support. Nothing specifically sticks out this time"
* tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (25 commits)
ARM: multi_v7_defconfig: enable CONFIG_EFI
ARM: multi_v7_defconfig: Build Atmel maXTouch driver as a module
ARM: defconfig: update the Integrator defconfig
ARM: keystone: defconfig: Fix USB configuration
ARM: imx_v6_v7_defconfig: Select the wm8960 codec driver
ARM: omap2plus_defconfig: switch to the IIO BMP085 driver
ARM: mvebu_v5_defconfig: use MV88E6XXX
ARM: davinci_all_defconfig: Enable some UBI modules
ARM: davinci_all_defconfig: Enable AEMIF as a module
ARM: multi_v7_defconfig: Enable SECCOMP
ARM: exynos_defconfig: Enable SECCOMP
ARM: imx_v6_v7_defconfig: Add CONFIG_MPL3115
ARM: imx_v6_v7_defconfig: Enable GPU support
ARM: s3c2410_defconfig: Remove CONFIG_IPV6_PRIVACY
ARM: exynos_defconfig: Enable PM_DEBUG
ARM: exynos_defconfig: Enable bus frequency scaling with devfreq
ARM: imx_v6_v7_defconfig: enable more USB configurations
ARM: davinci_all_defconfig: enable SMSC ethernet PHY
ARM: davinci_all_defconfig: enable RTC driver as module
ARM: multi_v7_defconfig: Enable ARM_IMX6Q_CPUFREQ
...
Linus Torvalds [Sat, 8 Oct 2016 04:18:42 +0000 (21:18 -0700)]
Merge tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC platform updates from Arnd Bergmann:
"These are updates for platform specific code on 32-bit ARM machines,
essentially anything that can not (yet) be expressed using DT files.
Noteworthy changes include:
- We get support for running in big-endian mode on two platforms:
sunxi (Allwinner) and s3c24xx (old Samsung).
- The recently added Uniphier platform now uses standard PSCI methods
for SMP booting and we remove support for old bootloader versions
that did not support it yet.
- In sunxi, we gain support for the "Nextthing GR8" SoC, which is a
close relative of the Allwinner A13 and R8 chips.
- PXA completes its move over to the generic dmaengine framework and
removes its old private API
- mach-bcm gains support for BCM47189/BCM53573, their first ARM SoC
with integrated 802.11ac wireless networking"
* tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (54 commits)
ARM: imx legacy: pca100: move peripheral initialization to .init_late
ARM: imx legacy: mx27ads: move peripheral initialization to .init_late
ARM: imx legacy: mx21ads: move peripheral initialization to .init_late
ARM: imx legacy: pcm043: move peripheral initialization to .init_late
ARM: imx legacy: mx35-3ds: move peripheral initialization to .init_late
ARM: imx legacy: mx27-3ds: move peripheral initialization to .init_late
ARM: imx legacy: imx27-visstrim-m10: move peripheral initialization to .init_late
ARM: imx legacy: vpr200: move peripheral initialization to .init_late
ARM: imx legacy: mx31moboard: move peripheral initialization to .init_late
ARM: imx legacy: armadillo5x0: move peripheral initialization to .init_late
ARM: imx legacy: qong: move peripheral initialization to .init_late
ARM: imx legacy: mx31-3ds: move peripheral initialization to .init_late
ARM: imx legacy: pcm037: move peripheral initialization to .init_late
ARM: imx legacy: mx31lilly: move peripheral initialization to .init_late
ARM: imx legacy: mx31ads: move peripheral initialization to .init_late
ARM: imx legacy: mx31lite: move peripheral initialization to .init_late
ARM: imx legacy: kzm: move peripheral initialization to .init_late
MAINTAINERS: update list of Oxnas maintainers
ARM: orion5x: remove extraneous NO_IRQ
ARM: orion: simplify orion_ge00_switch_init
...
Linus Torvalds [Sat, 8 Oct 2016 04:16:16 +0000 (21:16 -0700)]
Merge tag 'armsoc-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC cleanups from Arnd Bergmann:
"The cleanups for v4.9 are a little larger that usual, but thankfully
that is almost exclusively due to removing a significant number of
files that have become obsolete after the still ongoing conversion of
old board files to devicetree.
- for mach-omap2, which is still the largest platform in arch/arm/,
the conversion to DT is finally complete after the Nokia N900 is
now fully supported there, along with the omap3 LDP, and we can
remove those two board files. If no regressions are found, another
large cleanup for the platform will happen as a follow-up, removing
dead code and restructuring the platform based on being DT-only.
- In mach-imx, similar work is ongoing, but has not come that far.
This time, we remove the obsolete board file for the i.MX1
generation, which like i.MX25, i.MX5, i.MX6, and i.MX7 is now
DT-only. The remaining board files are for i.MX2 and i.MX3 machines
based on old ARM926 or ARM1136 cores that should work with DT in
principle.
- realview has just been converted from board files to DT, and a lot
of code gets removed in the process. This is the last
ARM/Keil/Versatile derived platform that was still using board
files, the other ones being integrator, versatile and vexpress. We
can probably merge the remaining code into a single directory in
the near future.
- clps711x had completed the conversion in v4.8, but we accidentally
left the files in place that should have been deleted then"
* tag 'armsoc-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (21 commits)
ARM: select PCI_DOMAINS config from ARCH_MULTIPLATFORM
ARM: stop *MIGHT_HAVE_PCI* config from being selected redundantly
ARM: imx: (trivial) fix typo and grammar
ARM: clps711x: remove extraneous files
ARM: imx: use IS_ENABLED() instead of checking for built-in or module
ARM: OMAP2+: use IS_ENABLED() instead of checking for built-in or module
ARM: OMAP1: use IS_ENABLED() instead of checking for built-in or module
ARM: imx: remove platform-mxc_rnga
ARM: realview: imply device tree boot
ARM: realview: no need to select SMP_ON_UP explicitly
ARM: realview: delete the RealView board files
ARM: imx: no need to select SMP_ON_UP explicitly
ARM: i.MX: Move SOC_IMX1 into 'Device tree only'
ARM: i.MX: Remove i.MX1 non-DT support
ARM: i.MX: Remove i.MX1 Synertronixx SCB9328 board support
ARM: i.MX: Remove i.MX1 Armadeus APF9328 board support
ARM: mxs: remove obsolete startup code for TX28
ARM: i.MX31 iomux: remove duplicates with alternate name
ARM: i.MX31 iomux: remove plain duplicates
ARM: OMAP2+: Drop legacy board file for LDP
...
Linus Torvalds [Sat, 8 Oct 2016 03:50:37 +0000 (20:50 -0700)]
Merge branch 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"Changes include:
- Fix boot of 32bit SMP kernel (initial kernel mapping was too small)
- Added hardened usercopy checks
- Drop bootmem and switch to memblock and NO_BOOTMEM implementation
- Drop the BROKEN_RODATA config option (and thus remove the relevant
code from the generic headers and files because parisc was the last
architecture which used this config option)
- Improve segfault reporting by printing human readable error strings
- Various smaller changes, e.g. dwarf debug support for assembly
code, update comments regarding copy_user_page_asm, switch to
kmalloc_array()"
* 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels
parisc: Drop bootmem and switch to memblock
parisc: Add hardened usercopy feature
parisc: Add cfi_startproc and cfi_endproc to assembly code
parisc: Move hpmc stack into page aligned bss section
parisc: Fix self-detected CPU stall warnings on Mako machines
parisc: Report trap type as human readable string
parisc: Update comment regarding implementation of copy_user_page_asm
parisc: Use kmalloc_array() in add_system_map_addresses()
parisc: Check return value of smp_boot_one_cpu()
parisc: Drop BROKEN_RODATA config option
Linus Torvalds [Sat, 8 Oct 2016 03:48:45 +0000 (20:48 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
Pull avr32 update from Hans-Christian Noren Egtvedt.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
avr32: migrate exception table users off module.h and onto extable.h
Linus Torvalds [Sat, 8 Oct 2016 03:19:31 +0000 (20:19 -0700)]
Merge tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Major rework of Book3S 64-bit exception vectors (Nicholas Piggin)
- Use gas sections for arranging exception vectors et. al.
- Large set of TM cleanups and selftests (Cyril Bur)
- Enable transactional memory (TM) lazily for userspace (Cyril Bur)
- Support for XZ compression in the zImage wrapper (Oliver
O'Halloran)
- Add support for bpf constant blinding (Naveen N. Rao)
- Beginnings of upstream support for PA Semi Nemo motherboards
(Darren Stevens)
Fixes:
- Ensure .mem(init|exit).text are within _stext/_etext (Michael
Ellerman)
- xmon: Don't use ld on 32-bit (Michael Ellerman)
- vdso64: Use double word compare on pointers (Anton Blanchard)
- powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui)
- powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy)
- powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K
(Aneesh Kumar K.V)
- Fix memory leak in queue_hotplug_event() error path (Andrew
Donnellan)
- Replay hypervisor maintenance interrupt first (Nicholas Piggin)
Various performance optimisations (Anton Blanchard):
- Align hot loops of memset() and backwards_memcpy()
- During context switch, check before setting mm_cpumask
- Remove static branch prediction in atomic{, 64}_add_unless
- Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little
endian
- Set default CPU type to POWER8 for little endian builds
Cleanups & features:
- Sparse fixes/cleanups (Daniel Axtens)
- Preserve CFAR value on SLB miss caused by access to bogus address
(Paul Mackerras)
- Radix MMU fixups for POWER9 (Aneesh Kumar K.V)
- Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU)
(Simon Guo)
- Optimise syscall entry for virtual, relocatable case (Nicholas
Piggin)
- Optimise MSR handling in exception handling (Nicholas Piggin)
- Support for kexec with Radix MMU (Benjamin Herrenschmidt)
- powernv EEH fixes (Russell Currey)
- Suprise PCI hotplug support for powernv (Gavin Shan)
- Endian/sparse fixes for powernv PCI (Gavin Shan)
- Defconfig updates (Anton Blanchard)
- KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh)
- cxl: Flush PSL cache before resetting the adapter (Frederic Barrat)
- cxl: replace loop with for_each_child_of_node(), remove unneeded
of_node_put() (Andrew Donnellan)
- Fix HV facility unavailable to use correct handler (Nicholas
Piggin)
- Remove unnecessary syscall trampoline (Nicholas Piggin)
- fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael
Ellerman)
- Quieten EEH message when no adapters are found (Anton Blanchard)
- powernv: Add PHB register dump debugfs handle (Russell Currey)
- Use kprobe blacklist for exception handlers & asm functions
(Nicholas Piggin)
- Document the syscall ABI (Nicholas Piggin)
- MAINTAINERS: Update cxl maintainers (Michael Neuling)
- powerpc: Remove all usages of NO_IRQ (Michael Ellerman)
Minor cleanups:
- Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur,
Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng,
Simon Guo"
* tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits)
powerpc/bpf: Add support for bpf constant blinding
powerpc/bpf: Implement support for tail calls
powerpc/bpf: Introduce accessors for using the tmp local stack space
powerpc/fadump: Fix build break when CONFIG_PROC_VMCORE=n
powerpc: tm: Enable transactional memory (TM) lazily for userspace
powerpc/tm: Add TM Unavailable Exception
powerpc: Remove do_load_up_transact_{fpu,altivec}
powerpc: tm: Rename transct_(*) to ck(\1)_state
powerpc: tm: Always use fp_state and vr_state to store live registers
selftests/powerpc: Add checks for transactional VSXs in signal contexts
selftests/powerpc: Add checks for transactional VMXs in signal contexts
selftests/powerpc: Add checks for transactional FPUs in signal contexts
selftests/powerpc: Add checks for transactional GPRs in signal contexts
selftests/powerpc: Check that signals always get delivered
selftests/powerpc: Add TM tcheck helpers in C
selftests/powerpc: Allow tests to extend their kill timeout
selftests/powerpc: Introduce GPR asm helper header file
selftests/powerpc: Move VMX stack frame macros to header file
selftests/powerpc: Rework FPU stack placement macros and move to header file
selftests/powerpc: Check for VSX preservation across userspace preemption
...
Paul Burton [Sat, 8 Oct 2016 00:03:15 +0000 (17:03 -0700)]
console: don't prefer first registered if DT specifies stdout-path
If a device tree specifies a preferred device for kernel console output
via the stdout-path or linux,stdout-path chosen node properties or the
stdout alias then the kernel ought to honor it & output the kernel
console to that device. As it stands, this isn't the case. Whilst we
parse the stdout-path properties & set an of_stdout variable from
of_alias_scan(), and use that from of_console_check() to determine
whether to add a console device as a preferred console whilst
registering it, we also prefer the first registered console if no other
has been selected at the time of its registration.
This means that if a console other than the one the device tree selects
via stdout-path is registered first, we will switch to using it & when
the stdout-path console is later registered the call to
add_preferred_console() via of_console_check() is too late to do
anything useful. In practice this seems to mean that we switch to the
dummy console device fairly early & see no further console output:
Fix this by not automatically preferring the first registered console if
one is specified by the device tree. This allows consoles to be
registered but not enabled, and once the driver for the console selected
by stdout-path calls of_console_check() the driver will be added to the
list of preferred consoles before any other console has been enabled.
When that console is then registered via register_console() it will be
enabled as expected.
Link: http://lkml.kernel.org/r/20160809151937.26118-1-paul.burton@imgtec.com Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Tejun Heo <tj@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Ivan Delalande <colona@arista.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Sat, 8 Oct 2016 00:03:12 +0000 (17:03 -0700)]
cred: simpler, 1D supplementary groups
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.
If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes). But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.
2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).
All of the above is unnecessary. Switch to the usual
trailing-zero-len-array scheme. Memory is allocated with
kmalloc/vmalloc() and only as much as needed. Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).
Maximum number of gids is 65536 which translates to 256KB+8 bytes. I
think kernel can handle such allocation.
On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!
Nice side effects:
- "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,
- fix little mess in net/ipv4/ping.c
should have been using GROUP_AT macro but this point becomes moot,
- aux group allocation is persistent and should be accounted as such.
Jean Delvare [Sat, 8 Oct 2016 00:03:04 +0000 (17:03 -0700)]
.gitattributes: set git diff driver for C source code files
Git can be told to apply language-specific rules when generating diffs.
Enable this for C source code files (*.c and *.h) so that function names
are printed right. Specifically, doing so prevents "git diff" from
mistakenly considering unindented goto labels as function names.
Link: http://lkml.kernel.org/r/20160907143403.1449324f@endymion Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uprobes: remove function declarations from arch/{mips,s390}
The declarations of arch-specific functions have been moved to a common
header in commit 3820b4d2789f ('uprobes: Move function declarations out
of arch'), but MIPS and S390 has added them to their own trees later.
Remove the unnecessary duplicates.
Link: http://lkml.kernel.org/r/1472804384-17830-1-git-send-email-marcin.nowakowski@imgtec.com Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Metcalf [Sat, 8 Oct 2016 00:02:55 +0000 (17:02 -0700)]
nmi_backtrace: generate one-line reports for idle cpus
When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".
We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.
This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.
Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm] Tested-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Metcalf [Sat, 8 Oct 2016 00:02:52 +0000 (17:02 -0700)]
arch/tile: adopt the new nmi_backtrace framework
Previously tile was rolling its own method of capturing backtrace data
in the NMI handlers, but it was relying on running printk() from the NMI
handler, which is not always safe. So adopt the nmi_backtrace model
(with the new cpumask extension) instead.
So we can call the nmi_backtrace code directly from the nmi handler,
move the nmi_enter()/exit() into the top-level tile NMI handler.
The semantics of the routine change slightly since it is now synchronous
with the remote cores completing the backtraces. Previously it was
asynchronous, but with protection to avoid starting a new remote
backtrace if the old one was still in progress.
Link: http://lkml.kernel.org/r/1472487169-14923-4-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> [arm] Cc: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Metcalf [Sat, 8 Oct 2016 00:02:49 +0000 (17:02 -0700)]
nmi_backtrace: do a local dump_stack() instead of a self-NMI
Currently on arm there is code that checks whether it should call
dump_stack() explicitly, to avoid trying to raise an NMI when the
current context is not preemptible by the backtrace IPI. Similarly, the
forthcoming arch/tile support uses an IPI mechanism that does not
support generating an NMI to self.
Accordingly, move the code that guards this case into the generic
mechanism, and invoke it unconditionally whenever we want a backtrace of
the current cpu. It seems plausible that in all cases, dump_stack()
will generate better information than generating a stack from the NMI
handler. The register state will be missing, but that state is likely
not particularly helpful in any case.
Or, if we think it is helpful, we should be capturing and emitting the
current register state in all cases when regs == NULL is passed to
nmi_cpu_backtrace().
Link: http://lkml.kernel.org/r/1472487169-14923-3-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm] Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Aaron Tomlin <atomlin@redhat.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Metcalf [Sat, 8 Oct 2016 00:02:45 +0000 (17:02 -0700)]
nmi_backtrace: add more trigger_*_cpu_backtrace() methods
Patch series "improvements to the nmi_backtrace code" v9.
This patch series modifies the trigger_xxx_backtrace() NMI-based remote
backtracing code to make it more flexible, and makes a few small
improvements along the way.
The motivation comes from the task isolation code, where there are
scenarios where we want to be able to diagnose a case where some cpu is
about to interrupt a task-isolated cpu. It can be helpful to see both
where the interrupting cpu is, and also an approximation of where the
cpu that is being interrupted is. The nmi_backtrace framework allows us
to discover the stack of the interrupted cpu.
I've tested that the change works as desired on tile, and build-tested
x86, arm, mips, and sparc64. For x86 I confirmed that the generic
cpuidle stuff as well as the architecture-specific routines are in the
new cpuidle section. For arm, mips, and sparc I just build-tested it
and made sure the generic cpuidle routines were in the new cpuidle
section, but I didn't attempt to figure out which the platform-specific
idle routines might be. That might be more usefully done by someone
with platform experience in follow-up patches.
This patch (of 4):
Currently you can only request a backtrace of either all cpus, or all
cpus but yourself. It can also be helpful to request a remote backtrace
of a single cpu, and since we want that, the logical extension is to
support a cpumask as the underlying primitive.
This change modifies the existing lib/nmi_backtrace.c code to take a
cpumask as its basic primitive, and modifies the linux/nmi.h code to use
the new "cpumask" method instead.
The existing clients of nmi_backtrace (arm and x86) are converted to
using the new cpumask approach in this change.
The other users of the backtracing API (sparc64 and mips) are converted
to use the cpumask approach rather than the all/allbutself approach.
The mips code ignored the "include_self" boolean but with this change it
will now also dump a local backtrace if requested.
Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm] Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Berg [Sat, 8 Oct 2016 00:02:42 +0000 (17:02 -0700)]
min/max: remove sparse warnings when they're nested
Currently, when min/max are nested within themselves, sparse will warn:
warning: symbol '_min1' shadows an earlier one
originally declared here
warning: symbol '_min1' shadows an earlier one
originally declared here
warning: symbol '_min2' shadows an earlier one
originally declared here
This also immediately happens when min3() or max3() are used.
Since sparse implements __COUNTER__, we can use __UNIQUE_ID() to
generate unique variable names, avoiding this.
Robert Ho [Sat, 8 Oct 2016 00:02:39 +0000 (17:02 -0700)]
Documentation/filesystems/proc.txt: add more description for maps/smaps
Add some more description on the limitations for smaps/maps readings, as
well as some guaruntees we can make.
Link: http://lkml.kernel.org/r/1475296958-27652-2-git-send-email-robert.hu@intel.com Signed-off-by: Robert Ho <robert.hu@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Robert Hu <robert.hu@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Actually, this bug is not only for NVDIMM/DAX but also for any other
file systems. This simple test case abstracted from nvml can easily
reproduce this bug in common environment:
int retval = 0; /* assume false until proven otherwise */
char line[PROCMAXLEN]; /* for fgets() */
char *lo = NULL; /* beginning of current range in smaps file */
char *hi = NULL; /* end of current range in smaps file */
int needmm = 0; /* looking for mm flag for current range */
while (fgets(line, PROCMAXLEN, fp) != NULL) {
static const char vmflags[] = "VmFlags:";
static const char mm[] = " wr";
/* check for range line */
if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
if (needmm) {
/* last range matched, but no mm flag found */
printf("never found mm flag.\n");
break;
} else if (caddr < lo) {
/* never found the range for caddr */
printf("#######no match for addr %p.\n", caddr);
break;
} else if (caddr < hi) {
/* start address is in this range */
size_t rangelen = (size_t)(hi - caddr);
/* remember that matching has started */
needmm = 1;
/* calculate remaining range to search for */
if (len > rangelen) {
len -= rangelen;
caddr += rangelen;
printf("matched %zu bytes in range "
"%p-%p, %zu left over.\n",
rangelen, lo, hi, len);
} else {
len = 0;
printf("matched all bytes in range "
"%p-%p.\n", lo, hi);
}
}
} else if (needmm && strncmp(line, vmflags,
sizeof(vmflags) - 1) == 0) {
if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
printf("mm flag found.\n");
if (len == 0) {
/* entire range matched */
retval = 1;
break;
}
needmm = 0; /* saw what was needed */
} else {
/* mm flag not set for some or all of range */
printf("range has no mm flag.\n");
break;
}
}
}
/* kick off NTHREAD threads */
for (int i = 0; i < NTHREAD; i++)
pthread_create(&threads[i], NULL, worker, &ret[i]);
/* wait for all the threads to complete */
for (int i = 0; i < NTHREAD; i++)
pthread_join(threads[i], NULL);
/* verify that all the threads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}
printf("%d", ret[0]);
return 0;
}
It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process
It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:
if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}
However, VMA will be lost if the last VMA is gone, e.g:
The process VMA list is A->B->C->D
CPU 0 CPU 1
read() system call
handle VMA B
version = B
return to userspace
unmap VMA B
issue read() again to continue to get
the region info
find_vma(version) will get VMA C
m_next_vma(C) will get VMA D
handle D
!!! VMA C is lost !!!
In order to fix this bug, we make 'file->version' indicate the end address
of the current VMA. m_start will then look up a vma which with vma_start
< last_vm_end and moves on to the next vma if we found the same or an
overlapping vma. This will guarantee that we will not miss an exclusive
vma but we can still miss one if the previous vma was shrunk. This is
acceptable because guaranteeing "never miss a vma" is simply not feasible.
User has to cope with some inconsistencies if the file is not read in one
go.
[mhocko@suse.com: changelog fixes] Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Robert Hu <robert.hu@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Stultz [Sat, 8 Oct 2016 00:02:33 +0000 (17:02 -0700)]
proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
== current, but the CAP_SYS_NICE doesn't.
Thus while the previous commit was intended to loosen the needed
privileges to modify a processes timerslack, it needlessly restricted a
task modifying its own timerslack via the proc/<tid>/timerslack_ns
(which is permitted also via the PR_SET_TIMERSLACK method).
This patch corrects this by checking if p == current before checking the
CAP_SYS_NICE value.
This patch applies on top of my two previous patches currently in -mm
Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Stultz [Sat, 8 Oct 2016 00:02:29 +0000 (17:02 -0700)]
proc: add LSM hook checks to /proc/<tid>/timerslack_ns
As requested, this patch checks the existing LSM hooks
task_getscheduler/task_setscheduler when reading or modifying the task's
timerslack value.
Previous versions added new get/settimerslack LSM hooks, but since they
checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
suggested we just use the existing ones.
Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: James Morris <jmorris@namei.org> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.
So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.
After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.
So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.
Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.
Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Nick Kralevich <nnk@google.com> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vineet Gupta [Sat, 8 Oct 2016 00:02:10 +0000 (17:02 -0700)]
atomic64: no need for CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
This came to light when implementing native 64-bit atomics for ARCv2.
The atomic64 self-test code uses CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
to check whether atomic64_dec_if_positive() is available. It seems it
was needed when not every arch defined it. However as of current code
the Kconfig option seems needless
- for CONFIG_GENERIC_ATOMIC64 it is auto-enabled in lib/Kconfig and a
generic definition of API is present lib/atomic64.c
- arches with native 64-bit atomics select it in arch/*/Kconfig and
define the API in their headers
The macro PAGE_ALIGNED() is prone to cause error because it doesn't
follow convention to parenthesize parameter @addr within macro body, for
example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for
the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is
desired but the actual one is (unsigned long)ptr + 16.
It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition.
zhong jiang [Sat, 8 Oct 2016 00:02:01 +0000 (17:02 -0700)]
mm: remove unnecessary condition in remove_inode_hugepages
When the huge page is added to the page cahce (huge_add_to_page_cache),
the page private flag will be cleared. since this code
(remove_inode_hugepages) will only be called for pages in the page
cahce, PagePrivate(page) will always be false.
The patch remove the code without any functional change.
Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Sat, 8 Oct 2016 00:01:58 +0000 (17:01 -0700)]
mm: warn about allocations which stall for too long
Currently we do warn only about allocation failures but small
allocations are basically nofail and they might loop in the page
allocator for a long time. Especially when the reclaim cannot make any
progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a
different context to make a forward progress in case there is a lot
memory used by filesystems.
Give us at least a clue when something like this happens and warn about
allocations which take more than 10s. Print the basic allocation
context information along with the cumulative time spent in the
allocation as well as the allocation stack. Repeat the warning after
every 10 seconds so that we know that the problem is permanent rather
than ephemeral.
Link: http://lkml.kernel.org/r/20160929084407.7004-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Sat, 8 Oct 2016 00:01:55 +0000 (17:01 -0700)]
mm: consolidate warn_alloc_failed users
warn_alloc_failed is currently used from the page and vmalloc
allocators. This is a good reuse of the code except that vmalloc would
appreciate a slightly different warning message. This is already
handled by the fmt parameter except that
is printed anyway. This might be quite misleading because it might be a
vmalloc failure which leads to the warning while the page allocator is
not the culprit here. Fix this by always using the fmt string and only
print the context that makes sense for the particular context (e.g.
order makes only very little sense for the vmalloc context).
Rename the function to not miss any user and also because a later patch
will reuse it also for !failure cases.
Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.
When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:
- find a page with index=0xffffffff, since index>=end, this page won't
be truncated
- index++, and index become 0
- the page with index=0xffffffff will be found again
The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.
Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:
INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0
The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.
Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com Signed-off-by: Wei Fang <fangwei1@huawei.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yisheng Xie [Sat, 8 Oct 2016 00:01:49 +0000 (17:01 -0700)]
arm64 Kconfig: select gigantic page
Arm64 supports gigantic pages after commit 084bd29810a5 ("ARM64: mm:
HugeTLB support.") however, it can only be allocated at boottime and
can't be freed.
This patch selects ARCH_HAS_GIGANTIC_PAGE to make gigantic pages can be
allocated and freed at runtime for arch arm64.
Link: http://lkml.kernel.org/r/1475227569-63446-3-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So all nodes but Node 0 have a lot of free memory which should suggest
that there is an available memory especially when mems_allowed=0-7. One
could speculate that a massive process has managed to terminate and free
up a lot of memory while racing with the above allocation request.
Although this is highly unlikely it cannot be ruled out.
A further debugging, however shown that the faulting process had
mempolicy (not cpuset) to bind to Node 0. We cannot see that
information from the report though. mems_allowed turned out to be more
confusing than really helpful.
Fix this by always priting the nodemask. It is either mempolicy mask
(and non-null) or the one defined by the cpusets. The new output for
the above oom report would be
This patch doesn't touch show_mem and the node filtering based on the
cpuset node mask because mempolicy is always a subset of cpusets and
seeing the full cpuset oom context might be helpful for tunning more
specific mempolicies inside cpusets (e.g. when they turn out to be too
restrictive). To prevent from ugly ifdefs the mask is printed even for
!NUMA configurations but this should be OK (a single node will be
printed).
Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb
The old code was always doing:
vma->vm_end = next->vm_end
vma_rb_erase(next) // in __vma_unlink
vma->vm_next = next->vm_next // in __vma_unlink
next = vma->vm_next
vma_gap_update(next)
The new code still does the above for remove_next == 1 and 2, but for
remove_next == 3 it has been changed and it does:
next->vm_start = vma->vm_start
vma_rb_erase(vma) // in __vma_unlink
vma_gap_update(next)
In the latter case, while unlinking "vma", validate_mm_rb() is told to
ignore "vma" that is being removed, but next->vm_start was reduced
instead. So for the new case, to avoid the false positive from
validate_mm_rb, it should be "next" that is ignored when "vma" is
being unlinked.
"vma" and "next" in the above comment, considered pre-swap().
Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Aditya Mandaleeka <adityam@microsoft.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case
mm->highest_vm_end doesn't need any update.
After finally removing the oddness from vma_merge case 8 that was
causing:
1) constant risk of trouble whenever anybody would check vma fields
from rmap_walks, like it happened when page migration was
introduced and it read the vma->vm_page_prot from a rmap_walk
2) the callers of vma_merge to re-initialize any value different from
the current vma, instead of vma_merge() more reliably returning a
vma that already matches all fields passed as parameter
.. it is also worth to take the opportunity of cleaning up superfluous
code in vma_adjust(), that if not removed adds up to the hard
readability of the function.
Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zhong jiang [Sat, 8 Oct 2016 00:01:19 +0000 (17:01 -0700)]
mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()
According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
in rare cases cause a hung task warning.
At present, if alloc_stable_node() allocation fails, two break_cows may
want to allocate a couple of pages, and the issue will come up when free
memory is under pressure.
We fix it by adding __GFP_HIGH to GFP, to grant access to memory
reserves, increasing the likelihood of allocation success.
Gerald Schaefer [Sat, 8 Oct 2016 00:01:13 +0000 (17:01 -0700)]
mm/hugetlb: improve locking in dissolve_free_huge_pages()
For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
the page is not huge at all or a hugepage that is in-use.
Improve this by doing the PageHuge() and page_count() checks already in
dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In
dissolve_free_huge_page(), when holding the spinlock, those checks need
to be revalidated.
Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gerald Schaefer [Sat, 8 Oct 2016 00:01:10 +0000 (17:01 -0700)]
mm/hugetlb: check for reserved hugepages during memory offline
In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.
Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.
Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/hugetlb: memory offline issues with hugepages", v4.
This addresses several issues with hugepages and memory offline. While
the first patch fixes a panic, and is therefore rather important, the
last patch is just a performance optimization.
The second patch fixes a theoretical issue with reserved hugepages,
while still leaving some ugly usability issue, see description.
This patch (of 3):
dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
list corruption and addressing exception when trying to set a memory
block offline that is part (but not the first part) of a "gigantic"
hugetlb page with a size > memory block size.
When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
will trigger directly. In the other case we will run into an addressing
exception later, because dissolve_free_huge_page() will not work on the
head page of the compound hugetlb page which will result in a NULL
hstate from page_hstate().
To fix this, first remove the VM_BUG_ON() because it is wrong, and then
use the compound head page in dissolve_free_huge_page(). This means
that an unused pre-allocated gigantic page that has any part of itself
inside the memory block that is going offline will be dissolved
completely. Losing an unused gigantic hugepage is preferable to failing
the memory offline, for example in the situation where a (possibly
faulty) memory DIMM needs to go offline.
Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rui Teng <rui.teng@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanlong Gao [Sat, 8 Oct 2016 00:01:04 +0000 (17:01 -0700)]
mm: nobootmem: move the comment of free_all_bootmem
Commit b4def3509d18 ("mm, nobootmem: clean-up of free_low_memory_core_early()")
removed the unnecessary nodeid argument, after that, this comment
becomes more confused. We should move it to the right place.
Fixes: b4def3509d18c1db9 ("mm, nobootmem: clean-up of free_low_memory_core_early()") Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.com Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.
Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.
[akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baoyou Xie [Sat, 8 Oct 2016 00:00:55 +0000 (17:00 -0700)]
mm: move phys_mem_access_prot_allowed() declaration to pgtable.h
We get 1 warning when building kernel with W=1:
drivers/char/mem.c:220:12: warning: no previous prototype for 'phys_mem_access_prot_allowed' [-Wmissing-prototypes]
int __weak phys_mem_access_prot_allowed(struct file *file,
In fact, its declaration is spreading to several header files in
different architecture, but need to be declare in common header file.
So this patch moves phys_mem_access_prot_allowed() to pgtable.h.
Tetsuo Handa [Sat, 8 Oct 2016 00:00:49 +0000 (17:00 -0700)]
mm: don't emit warning from pagefault_out_of_memory()
Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
to warn when we raced with disabling the OOM killer.
Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend
properly" introduced a timeout for oom_killer_disable(). Even if we
raced with disabling the OOM killer and the system is OOM livelocked,
the OOM killer will be enabled eventually (in 20 seconds by default) and
the OOM livelock will be solved. Therefore, we no longer need to warn
when we raced with disabling the OOM killer.
Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:46 +0000 (17:00 -0700)]
mm, compaction: restrict fragindex to costly orders
Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
heuristic to prevent excessive compaction for costly orders (i.e. THP).
It's unlikely to make any difference for non-costly orders, especially
with the default threshold. But we cannot afford any uncertainty for
the non-costly orders where the only alternative to successful
reclaim/compaction is OOM. After the recent patches we are guaranteed
maximum effort without heuristics from compaction before deciding OOM,
and fragindex is the last remaining heuristic. Therefore skip fragindex
altogether for non-costly orders.
Suggested-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:43 +0000 (17:00 -0700)]
mm, compaction: ignore fragindex from compaction_zonelist_suitable()
The compaction_zonelist_suitable() function tries to determine if
compaction will be able to proceed after sufficient reclaim, i.e.
whether there are enough reclaimable pages to provide enough order-0
freepages for compaction.
This addition of reclaimable pages to the free pages works well for the
order-0 watermark check, but in the fragmentation index check we only
consider truly free pages. Thus we can get fragindex value close to 0
which indicates failure do to lack of memory, and wrongly decide that
compaction won't be suitable even after reclaim.
Instead of trying to somehow adjust fragindex for reclaimable pages,
let's just skip it from compaction_zonelist_suitable().
Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:40 +0000 (17:00 -0700)]
mm, page_alloc: pull no_progress_loops update to should_reclaim_retry()
The should_reclaim_retry() makes decisions based on no_progress_loops,
so it makes sense to also update the counter there. It will be also
consistent with should_compact_retry() and compaction_retries. No
functional change.
[hillf.zj@alibaba-inc.com: fix missing pointer dereferences] Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:37 +0000 (17:00 -0700)]
mm, compaction: make full priority ignore pageblock suitability
Several people have reported premature OOMs for order-2 allocations
(stack) due to OOM rework in 4.7. In the scenario (parallel kernel
build and dd writing to two drives) many pageblocks get marked as
Unmovable and compaction free scanner struggles to isolate free pages.
Joonsoo Kim pointed out that the free scanner skips pageblocks that are
not movable to prevent filling them and forcing non-movable allocations
to fallback to other pageblocks. Such heuristic makes sense to help
prevent long-term fragmentation, but premature OOMs are relatively more
urgent problem. As a compromise, this patch disables the heuristic only
for the ultimate compaction priority.
Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz Reported-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com> Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com> Reported-by: Olaf Hering <olaf@aepfle.de> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:34 +0000 (17:00 -0700)]
mm, compaction: restrict full priority to non-costly orders
The new ultimate compaction priority disables some heuristics, which may
result in excessive cost. This is fine for non-costly orders where we
want to try hard before resulting for OOM, but might be disruptive for
costly orders which do not trigger OOM and should generally have some
fallback. Thus, we disable the full priority for costly orders.
Suggested-by: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:31 +0000 (17:00 -0700)]
mm, compaction: more reliably increase direct compaction priority
During reclaim/compaction loop, compaction priority can be increased by
the should_compact_retry() function, but the current code is not
optimal. Priority is only increased when compaction_failed() is true,
which means that compaction has scanned the whole zone. This may not
happen even after multiple attempts with a lower priority due to
parallel activity, so we might needlessly struggle on the lower
priorities and possibly run out of compaction retry attempts in the
process.
After this patch we are guaranteed at least one attempt at the highest
compaction priority even if we exhaust all retries at the lower
priorities.
Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 8 Oct 2016 00:00:28 +0000 (17:00 -0700)]
Revert "mm, oom: prevent premature OOM killer invocation for high order request"
Patch series "reintroduce compaction feedback for OOM decisions".
After several people reported OOM's for order-2 allocations in 4.7 due
to Michal Hocko's OOM rework, he reverted the part that considered
compaction feedback [1] in the decisions to retry reclaim/compaction.
This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
while mmotm had an almost complete solution that instead improved
compaction reliability.
This series completes the mmotm solution and reintroduces the compaction
feedback into OOM decisions. The first two patches restore the state of
mmotm before the temporary solution was merged, the last patch should be
the missing piece for reliability. The third patch restricts the
hardened compaction to non-costly orders, since costly orders don't
result in OOMs in the first place.
Commit 6b4e3181d7bd ("mm, oom: prevent premature OOM killer invocation
for high order request") was intended as a quick fix of OOM regressions
for 4.8 and stable 4.7.x kernels. For a better long-term solution, we
still want to consider compaction feedback, which should be possible
after some more improvements in the following patches.
Huang Ying [Sat, 8 Oct 2016 00:00:24 +0000 (17:00 -0700)]
mm: remove page_file_index
After using the offset of the swap entry as the key of the swap cache,
the page_index() becomes exactly same as page_file_index(). So the
page_file_index() is removed and the callers are changed to use
page_index() instead.
Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Sat, 8 Oct 2016 00:00:21 +0000 (17:00 -0700)]
mm, swap: use offset of swap entry as key of swap cache
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0. Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device. If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture. For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11. But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4. The increased height causes unnecessary radix tree
descending and increased cache footprint.
This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache. In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.
Dan Williams [Sat, 8 Oct 2016 00:00:18 +0000 (17:00 -0700)]
mm: fix cache mode tracking in vm_insert_mixed()
vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
fails to check the pgprot_t it uses for the mapping against the one
recorded in the memtype tracking tree. Add the missing call to
track_pfn_insert() to preclude cases where incompatible aliased mappings
are established for a given physical address range.
James Morse [Sat, 8 Oct 2016 00:00:12 +0000 (17:00 -0700)]
mm/memcontrol.c: make the walk_page_range() limit obvious
mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
callback, which causes the current implementation to skip non-vma
regions. This is all fine but follow up changes would like to make
walk_page_range more generic so it is better to be explicit about which
range to traverse so let's use highest_vm_end to explicitly traverse
only user mmaped memory.
[mhocko@kernel.org: rewrote changelog] Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aaron Lu [Sat, 8 Oct 2016 00:00:08 +0000 (17:00 -0700)]
thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault. If
THP(Transparent HugePage) is enabled then the global huge zero page is
used. The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.
CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults. This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.
To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
the global counter in two cases:
1 The first time it uses the global huge zero page;
2 The time when mm_user of its mm_struct reaches zero.
Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away. With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.
And with the use of mm_user, the kthread is not eligible to use huge
zero page either. Since no kthread is using huge zero page today, there
is no difference after applying this patch. But if that is not desired,
I can change it to when mm_count reaches zero.
Case used for test on Haswell EP:
usemem -n 72 --readonly -j 0x200000 100G
Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.
CPU cycles from perf report for base commit:
54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
CPU cycles from perf report for this commit:
0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page
Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.
Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.
Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
James Morse [Sat, 8 Oct 2016 00:00:06 +0000 (17:00 -0700)]
fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious
Trying to walk all of virtual memory requires architecture specific
knowledge. On x86_64, addresses must be sign extended from bit 48,
whereas on arm64 the top VA_BITS of address space have their own set of
page tables.
clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
provides a test_walk() callback that only expects to be walking over
VMAs. Currently walk_pmd_range() will skip memory regions that don't
have a VMA, reporting them as a hole.
As this call only expects to walk user address space, make it walk 0 to
'highest_vm_end'.
Tim Chen [Sat, 8 Oct 2016 00:00:02 +0000 (17:00 -0700)]
cpu: fix node state for whether it contains CPU
In current kernel code, we only call node_set_state(cpu_to_node(cpu),
N_CPU) when a cpu is hot plugged. But we do not set the node state for
N_CPU when the cpus are brought online during boot.
So this could lead to failure when we check to see if a node contains
cpu with node_state(node_id, N_CPU).
One use case is in the node_reclaime function:
/*
* Only run node reclaim on the local node or on nodes that do
* not
* have associated processors. This will favor the local
* processor
* over remote processors and spread off node memory allocations
* as wide as possible.
*/
if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
numa_node_id())
return NODE_RECLAIM_NOSCAN;
I instrumented the kernel to call this function after boot and it always
returns 0 on a x86 desktop machine until I apply the attached patch.
int num_cpu_node(void)
{
int i, nr_cpu_nodes = 0;
for_each_node(i) {
if (node_state(i, N_CPU))
++ nr_cpu_nodes;
}
return nr_cpu_nodes;
}
Fix this by checking each node for online CPU when we initialize
vmstat that's responsible for maintaining node state.
Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: <Huang@linux.intel.com> Cc: Ying <ying.huang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Toshi Kani [Fri, 7 Oct 2016 23:59:59 +0000 (16:59 -0700)]
ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings
To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.
Call thp_get_unmapped_area() from f_op->get_unmapped_area.
Note, there is no change in behavior for a non-DAX file.
Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Toshi Kani [Fri, 7 Oct 2016 23:59:56 +0000 (16:59 -0700)]
thp, dax: add thp_get_unmapped_area for pmd mappings
When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
This feature relies on both mmap virtual address and FS block (i.e.
physical address) to be aligned by the pmd page size. Users can use
mkfs options to specify FS to align block allocations. However,
aligning mmap address requires code changes to existing applications for
providing a pmd-aligned address to mmap().
For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
It calls mmap() with a NULL address, which needs to be changed to
provide a pmd-aligned address for testing with DAX pmd mappings.
Changing all applications that call mmap() with NULL is undesirable.
Add thp_get_unmapped_area(), which can be called by filesystem's
get_unmapped_area to align an mmap address by the pmd size for a DAX
file. It calls the default handler, mm->get_unmapped_area(), to find a
range and then aligns it for a DAX file.
The patch is based on Matthew Wilcox's change that allows adding support
of the pud page size easily.
[1]: https://github.com/axboe/fio/blob/master/engines/mmap.c Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simon Guo [Fri, 7 Oct 2016 23:59:40 +0000 (16:59 -0700)]
mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT)
When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
VM_LOCKED flag only.
There is a hole in mlock_fixup() which increase mm->locked_vm twice even
the two operations are on the same vma and both with VM_LOCKED flags.
Then check the increase VmLck field in /proc/pid/status(to 128k).
When vma is set with different vm_flags, and the new vm_flags is with
VM_LOCKED, it is not necessarily be a "new locked" vma. This patch
corrects this bug by prevent mm->locked_vm from increment when old
vm_flags is already VM_LOCKED.
Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alexey Klimov <klimov.linux@gmail.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Guo <wei.guo.simon@gmail.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simon Guo [Fri, 7 Oct 2016 23:59:36 +0000 (16:59 -0700)]
mm: mlock: check against vma for actual mlock() size
In do_mlock(), the check against locked memory limitation has a hole
which will fail following cases at step 3):
1) User has a memory chunk from addressA with 50k, and user mem lock
rlimit is 64k.
2) mlock(addressA, 30k)
3) mlock(addressA, 40k)
The 3rd step should have been allowed since the 40k request is
intersected with the previous 30k at step 2), and the 3rd step is
actually for mlock on the extra 10k memory.
This patch checks vma to caculate the actual "new" mlock size, if
necessary, and ajust the logic to fix this issue.
[akpm@linux-foundation.org: clean up comment layout]
[wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()] Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Cc: Alexey Klimov <klimov.linux@gmail.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Guo <wei.guo.simon@gmail.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 7 Oct 2016 23:59:33 +0000 (16:59 -0700)]
oom: warn if we go OOM for higher order and compaction is disabled
Since the lumpy reclaim is gone there is no source of higher order pages
if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
unreliable for that purpose to say the least. Hitting an OOM for
!costly higher order requests is therefore all not that hard to imagine.
We are trying hard to not invoke OOM killer as much as possible but
there is simply no reliable way to detect whether more reclaim retries
make sense.
Disabling COMPACTION is not widespread but it seems that some users
might have disable the feature without realizing full consequences
(mostly along with disabling THP because compaction used to be THP
mainly thing). This patch just adds a note if the OOM killer was
triggered by higher order request with compaction disabled. This will
help us identifying possible misconfiguration right from the oom report
which is easier than to always keep in mind that somebody might have
disabled COMPACTION without a good reason.
Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Fri, 7 Oct 2016 23:59:30 +0000 (16:59 -0700)]
mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback. But for anonymous pages in the swap cache,
there is no inode writeback. So there is no need to find the pages with
some writeback tags in the radix tree. It is not necessary to touch
radix tree writeback tags for pages in the swap cache.
Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags. The flag is set for swap caches. It may be used for DAX file
systems, etc.
With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device. The improvement comes from
the reduced contention on the swap cache radix tree lock. To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.
zijun_hu [Fri, 7 Oct 2016 23:59:27 +0000 (16:59 -0700)]
mm/bootmem.c: replace kzalloc() by kzalloc_node()
In ___alloc_bootmem_node_nopanic(), replace kzalloc() by kzalloc_node()
in order to allocate memory within given node preferentially when slab
is available
- the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
header and relevant source
- don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
asm/processor.h is preferred over default in linux/bootmem.h
completely since the former header isn't included by the latter
Currently significant amount of memory is reserved only in kernel booted
to capture kernel dump using the fa_dump method.
Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
only certain size memory per node. The certain size takes into account
the dentry and inode cache sizes. Currently the cache sizes are
calculated based on the total system memory including the reserved
memory. However such a kernel when booting the same kernel as fadump
kernel will not be able to allocate the required amount of memory to
suffice for the dentry and inode caches. This results in crashes like
Hence only implement arch_reserved_kernel_pages() for CONFIG_FA_DUMP
configurations. The amount reserved will be reduced while calculating
the large caches and will avoid crashes like the below on large systems
such as 32 TB systems.
The total reserved memory in a system is accounted but not available for
use use outside mm/memblock.c. By exposing the total reserved memory,
systems can better calculate the size of large hashes.
Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently arch specific code can reserve memory blocks but
alloc_large_system_hash() may not take it into consideration when sizing
the hashes. This can lead to bigger hash than required and lead to no
available memory for other purposes. This is specifically true for
systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
One approach to solve this problem would be to walk through the memblock
regions and calculate the available memory and base the size of hash
system on the available memory.
The other approach would be to depend on the architecture to provide the
number of pages that are reserved. This change provides hooks to allow
the architecture to provide the required info.
Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>