Revert "mmc: core: Convert mmc_driver to device_driver"
This reverts commit 6685ac62b2f0 ("mmc: core: Convert mmc_driver to
device_driver")
The reverted commit went too far in simplifing the device driver parts
for mmc.
Let's restore the old mmc_driver to enable driver core to sooner
or later to remove the ->probe(), ->remove() and ->shutdown() callbacks
from the struct device_driver.
Note that, the old ->suspend|resume() callbacks in the struct
mmc_driver don't need to be restored, since the mmc block layer has
converted to the modern system PM ops.
mmc: pwrseq: Fix error code propagation in mmc_pwrseq_simple_alloc()
If the struct mmc_pwrseq_match .alloc function used to allocate a
struct mmc_pwrseq fails, the error is propagated to mmc_of_parse().
But instead of returning the error code in pwrseq, host->pwrseq is
returned which will always be 0. So mmc_of_parse() succeeds even if
the pwrseq .alloc function failed and host->pwrseq is NULL.
This makes the SDIO device to not be powered if the power sequencing
.alloc functions wants to be deferred due a missing resource because
the mmc controller driver probe did wrongly succeed.
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull third hunk of vfs changes from Al Viro:
"This contains the ->direct_IO() changes from Omar + saner
generic_write_checks() + dealing with fcntl()/{read,write}() races
(mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
repeatedly looking at ->f_flags, which can be changed by fcntl(2),
check ->ki_flags - which cannot) + infrastructure bits for dhowells'
d_inode annotations + Christophs switch of /dev/loop to
vfs_iter_write()"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
block: loop: switch to VFS ITER_BVEC
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
NFS: Don't use d_inode as a variable name
VFS: Impose ordering on accesses of d_inode and d_flags
VFS: Add owner-filesystem positive/negative dentry checks
nfs: generic_write_checks() shouldn't be done on swapout...
ocfs2: use __generic_file_write_iter()
mirror O_APPEND and O_DIRECT into iocb->ki_flags
switch generic_write_checks() to iocb and iter
ocfs2: move generic_write_checks() before the alignment checks
ocfs2_file_write_iter: stop messing with ppos
udf_file_write_iter: reorder and simplify
fuse: ->direct_IO() doesn't need generic_write_checks()
ext4_file_write_iter: move generic_write_checks() up
xfs_file_aio_write_checks: switch to iocb/iov_iter
generic_write_checks(): drop isblk argument
blkdev_write_iter: expand generic_file_checks() call in there
...
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota and udf updates from Jan Kara:
"The pull contains quota changes which complete unification of XFS and
VFS quota interfaces (so tools can use either interface to manipulate
any filesystem). There's also a patch to support project quotas in
VFS quota subsystem from Li Xi.
Finally there's a bunch of UDF fixes and cleanups and tiny cleanup in
reiserfs & ext3"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
udf: Update ctime and mtime when directory is modified
udf: return correct errno for udf_update_inode()
ext3: Remove useless condition in if statement.
vfs: Add general support to enforce project quota limits
reiserfs: fix __RASSERT format string
udf: use int for allocated blocks instead of sector_t
udf: remove redundant buffer_head.h includes
udf: remove else after return in __load_block_bitmap()
udf: remove unused variable in udf_table_free_blocks()
quota: Fix maximum quota limit settings
quota: reorder flags in quota state
quota: paranoia: check quota tree root
quota: optimize i_dquot access
quota: Hook up Q_XSETQLIM for id 0 to ->set_info
xfs: Add support for Q_SETINFO
quota: Make ->set_info use structure with neccesary info to VFS and XFS
quota: Remove ->get_xstate and ->get_xstatev callbacks
gfs2: Convert to using ->get_state callback
xfs: Convert to using ->get_state callback
quota: Wire up Q_GETXSTATE and Q_GETXSTATV calls to work with ->get_state
...
Merge branch 'for-4.1/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for 4.1. As with the core bits,
this is a relatively slow round. This pull request contains:
- Various fixes and cleanups for NVMe, from Alexey Khoroshilov, Chong
Yuan, myself, Keith Busch, and Murali Iyer.
- Documentation and code cleanups for nbd from Markus Pargmann.
- Change of brd maintainer to me, from Ross Zwisler. At least the
email doesn't bounce anymore then.
- Two xen-blkback fixes from Tao Chen"
* 'for-4.1/drivers' of git://git.kernel.dk/linux-block: (23 commits)
NVMe: Meta data handling through submit io ioctl
NVMe: Add translation for block limits
NVMe: Remove check for null
NVMe: Fix error handling of class_create("nvme")
xen-blkback: define pr_fmt macro to avoid the duplication of DRV_PFX
xen-blkback: enlarge the array size of blkback name
nbd: Return error pointer directly
nbd: Return error code directly
nbd: Remove fixme that was already fixed
nbd: Restructure debugging prints
nbd: Fix device bytesize type
nbd: Replace kthread_create with kthread_run
nbd: Remove kernel internal header
Documentation: nbd: Add list of module parameters
Documentation: nbd: Reformat to allow more documentation
NVMe: increase depth of admin queue
nvme: Fix PRP list calculation for non-4k system page size
NVMe: Fix blk-mq hot cpu notification
NVMe: embedded iod mask cleanup
NVMe: Freeze admin queue on device failure
...
Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-block
Pull block layer core bits from Jens Axboe:
"This is the core pull request for 4.1. Not a lot of stuff in here for
this round, mostly little fixes or optimizations. This pull request
contains:
- An optimization that speeds up queue runs on blk-mq, especially for
the case where there's a large difference between nr_cpu_ids and
the actual mapped software queues on a hardware queue. From Chong
Yuan.
- Honor node local allocations for requests on legacy devices. From
David Rientjes.
- Cleanup of blk_mq_rq_to_pdu() from me.
- exit_aio() fixup from me, greatly speeding up exiting multiple IO
contexts off exit_group(). For my particular test case, fio exit
took ~6 seconds. A typical case of both exposing RCU grace periods
to user space, and serializing exit of them.
- Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
wait if __GFP_WAIT is set. From Keith Busch.
- blk-mq exports and two added helpers from Mike Snitzer, which will
be used by the dm-mq code.
- Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"
* 'for-4.1/core' of git://git.kernel.dk/linux-block:
blk-mq: reduce unnecessary software queue looping
aio: fix serial draining in exit_aio()
blk-mq: cleanup blk_mq_rq_to_pdu()
blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
block: allocate request memory local to request queue
blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
blk-mq: export blk_mq_run_hw_queues
blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is the usual grab bag of driver updates (lpfc, qla2xxx, storvsc,
aacraid, ipr) plus an assortment of minor updates. There's also a
major update to aic1542 which moves the driver into this millenium"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (106 commits)
change SCSI Maintainer email
sd, mmc, virtio_blk, string_helpers: fix block size units
ufs: add support to allow non standard behaviours (quirks)
ufs-qcom: save controller revision info in internal structure
qla2xxx: Update driver version to 8.07.00.18-k
qla2xxx: Restore physical port WWPN only, when port down detected for FA-WWPN port.
qla2xxx: Fix virtual port configuration, when switch port is disabled/enabled.
qla2xxx: Prevent multiple firmware dump collection for ISP27XX.
qla2xxx: Disable Interrupt handshake for ISP27XX.
qla2xxx: Add debugging info for MBX timeout.
qla2xxx: Add serdes read/write support for ISP27XX
qla2xxx: Add udev notification to save fw dump for ISP27XX
qla2xxx: Add message for sucessful FW dump collected for ISP27XX.
qla2xxx: Add support to load firmware from file for ISP 26XX/27XX.
qla2xxx: Fix beacon blink for ISP27XX.
qla2xxx: Increase the wait time for firmware to be ready for P3P.
qla2xxx: Fix crash due to wrong casting of reg for ISP27XX.
qla2xxx: Fix warnings reported by static checker.
lpfc: Update version to 10.5.0.0 for upstream patch set
lpfc: Update copyright to 2015
...
Merge branch 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration
Pull mailbox updates from Jassi Brar.
* 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
mailbox: arm_mhu: add driver for ARM MHU controller
Mailbox: Restructure and simplify PCC mailbox code
Merge branch 'for-v4.1-rc1' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull DMA-mapping updates from Marek Szyprowski:
"This contains two patches, which clarify abiguity in the dma-mapping
api"
* 'for-v4.1-rc1' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
include/dma-mapping: Clarify output of dma_map_sg
asm/dma-mapping-common: Clarify output of dma_map_sg_attrs
Merge tag 'stable/for-linus-4.1-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen features and fixes from David Vrabel:
- use a single source list of hypercalls, generating other tables etc.
at build time.
- add a "Xen PV" APIC driver to support >255 VCPUs in PV guests.
- significant performance improve to guest save/restore/migration.
- scsiback/front save/restore support.
- infrastructure for multi-page xenbus rings.
- misc fixes.
* tag 'stable/for-linus-4.1-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pci: Try harder to get PXM information for Xen
xenbus_client: Extend interface to support multi-page ring
xen-pciback: also support disabling of bus-mastering and memory-write-invalidate
xen: support suspend/resume in pvscsi frontend
xen: scsiback: add LUN of restored domain
xen-scsiback: define a pr_fmt macro with xen-pvscsi
xen/mce: fix up xen_late_init_mcelog() error handling
xen/privcmd: improve performance of MMAPBATCH_V2
xen: unify foreign GFN map/unmap for auto-xlated physmap guests
x86/xen/apic: WARN with details.
x86/xen: Provide a "Xen PV" APIC driver to support >255 VCPUs
xen/pciback: Don't print scary messages when unsupported by hypervisor.
xen: use generated hypercall symbols in arch/x86/xen/xen-head.S
xen: use generated hypervisor symbols in arch/x86/xen/trace.c
xen: synchronize include/xen/interface/xen.h with xen
xen: build infrastructure for generating hypercall depending symbols
xen: balloon: Use static attribute groups for sysfs entries
xen: pcpu: Use static attribute groups for sysfs entry
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"Here are the core arm64 updates for 4.1.
Highlights include a significant rework to head.S (allowing us to boot
on machines with physical memory at a really high address), an AES
performance boost on Cortex-A57 and the ability to run a 32-bit
userspace with 64k pages (although this requires said userspace to be
built with a recent binutils).
The head.S rework spilt over into KVM, so there are some changes under
arch/arm/ which have been acked by Marc Zyngier (KVM co-maintainer).
In particular, the linker script changes caused us some issues in
-next, so there are a few merge commits where we had to apply fixes on
top of a stable branch.
Other changes include:
- AES performance boost for Cortex-A57
- AArch32 (compat) userspace with 64k pages
- Cortex-A53 erratum workaround for #845719
- defconfig updates (new platforms, PCI, ...)"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (39 commits)
arm64: fix midr range for Cortex-A57 erratum 832075
arm64: errata: add workaround for cortex-a53 erratum #845719
arm64: Use bool function return values of true/false not 1/0
arm64: defconfig: updates for 4.1
arm64: Extract feature parsing code from cpu_errata.c
arm64: alternative: Allow immediate branch as alternative instruction
arm64: insn: Add aarch64_insn_decode_immediate
ARM: kvm: round HYP section to page size instead of log2 upper bound
ARM: kvm: assert on HYP section boundaries not actual code size
arm64: head.S: ensure idmap_t0sz is visible
arm64: pmu: add support for interrupt-affinity property
dt: pmu: extend ARM PMU binding to allow for explicit interrupt affinity
arm64: head.S: ensure visibility of page tables
arm64: KVM: use ID map with increased VA range if required
arm64: mm: increase VA range of identity map
ARM: kvm: implement replacement for ld's LOG2CEIL()
arm64: proc: remove unused cpu_get_pgd macro
arm64: enforce x1|x2|x3 == 0 upon kernel entry as per boot protocol
arm64: remove __calc_phys_offset
arm64: merge __enable_mmu and __turn_mmu_on
...
Merge tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- Numerous minor fixes, cleanups etc.
- More EEH work from Gavin to remove its dependency on device_nodes.
- Memory hotplug implemented entirely in the kernel from Nathan
Fontenot.
- Removal of redundant CONFIG_PPC_OF by Kevin Hao.
- Rewrite of VPHN parsing logic & tests from Greg Kurz.
- A fix from Nish Aravamudan to reduce memory usage by clamping
nodes_possible_map.
- Support for pstore on powernv from Hari Bathini.
- Removal of old powerpc specific byte swap routines by David Gibson.
- Fix from Vasant Hegde to prevent the flash driver telling you it was
flashing your firmware when it wasn't.
- Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
- Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan
Stancek.
- Some fixes for migration from Tyrel Datwyler.
- A new syscall to switch the cpu endian by Michael Ellerman.
- Large series from Wei Yang to implement SRIOV, reviewed and acked by
Bjorn.
- A fix for the OPAL sensor driver from Cédric Le Goater.
- Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
- Large series from Daniel Axtens to make our PCI hooks per PHB rather
than per machine.
- Small patch from Sam Bobroff to explicitly abort non-suspended
transactions on syscalls, plus a test to exercise it.
- Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
- Small patch to enable the hard lockup detector from Anton Blanchard.
- Fix from Dave Olson for missing L2 cache information on some CPUs.
- Some fixes from Michael Ellerman to get Cell machines booting again.
- Freescale updates from Scott: Highlights include BMan device tree
nodes, an MSI erratum workaround, a couple minor performance
improvements, config updates, and misc fixes/cleanup.
* tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (196 commits)
powerpc/powermac: Fix build error seen with powermac smp builds
powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE
powerpc: Remove PPC32 code from pseries specific find_and_init_phbs()
powerpc/cell: Fix iommu breakage caused by controller_ops change
powerpc/eeh: Fix crash in eeh_add_device_early() on Cell
powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
powerpc/perf/hv-24x7: Fail 24x7 initcall if create_events_from_catalog() fails
powerpc/pseries: Correct memory hotplug locking
powerpc: Fix missing L2 cache size in /sys/devices/system/cpu
powerpc: Add ppc64 hard lockup detector support
oprofile: Disable oprofile NMI timer on ppc64
powerpc/perf/hv-24x7: Add missing put_cpu_var()
powerpc/perf/hv-24x7: Break up single_24x7_request
powerpc/perf/hv-24x7: Define update_event_count()
powerpc/perf/hv-24x7: Whitespace cleanup
powerpc/perf/hv-24x7: Define add_event_to_24x7_request()
powerpc/perf/hv-24x7: Rename hv_24x7_event_update
powerpc/perf/hv-24x7: Move debug prints to separate function
powerpc/perf/hv-24x7: Drop event_24x7_request()
powerpc/perf/hv-24x7: Use pr_devel() to log message
...
Commit 9c521a200bc3 ("crypto: api - remove instance when test failed")
tried to grab a module reference count before the module was even set.
Worse, it then goes on to free the module reference count after it is
set so you quickly end up with a negative module reference count which
prevents people from using any instances belonging to that module.
This patch moves the module initialisation before the reference
count.
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...
Joe Perches [Wed, 15 Apr 2015 23:18:22 +0000 (16:18 -0700)]
tracing: remove use of seq_printf return value
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.
See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")
Miscellanea:
o Remove unused return value from trace_lookup_stack
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 15 Apr 2015 23:18:02 +0000 (16:18 -0700)]
ARM: plat-pxa: remove use of seq_printf return value
The seq_printf return value, because it's frequently misused,
(as it is here, it doesn't return # of chars emitted) will
eventually be converted to void.
See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")
Signed-off-by: Joe Perches <joe@perches.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 15 Apr 2015 23:17:48 +0000 (16:17 -0700)]
power: wakeup: remove use of seq_printf return value
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.
See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")
Signed-off-by: Joe Perches <joe@perches.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <len.brown@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The macro BITMAP_LAST_WORD_MASK can be implemented without a conditional,
which will generally lead to slightly better generated code (221 bytes
saved for allmodconfig-GCOV_KERNEL, ~2k with GCOV_KERNEL). As a small
bonus, this also ensures that the nbits parameter is expanded exactly
once.
In BITMAP_FIRST_WORD_MASK, if start is signed gcc is technically allowed
to assume it is positive (or divisible by BITS_PER_LONG), and hence just
do the simple mask. It doesn't seem to use this, and even on an
architecture like x86 where the shift only depends on the lower 5 or 6
bits, and these bits are not affected by the signedness of the expression,
gcc still generates code to compute the C99 mandated value of start %
BITS_PER_LONG. So just use a mask explicitly, also for consistency with
BITMAP_LAST_WORD_MASK.
Work and Home computer had different settings in the mail client. Some
contributions appear as Ricardo Ribalda, others as Ricardo Ribalda Delgado
(and one as just Ricardo).
lib/string_helpers.c: change semantics of string_escape_mem
The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf(). If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).
So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired. Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.
This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst. For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.
In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small. We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.
In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.
In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.
[andriy.shevchenko@linux.intel.com: simplify qword_add]
[andriy.shevchenko@linux.intel.com: add missed curly braces] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When printf is given the format specifier %pE, it needs a way of obtaining
the total output size that would be generated if the buffer was large
enough, and string_escape_mem doesn't easily provide that. This is a
refactorization of string_escape_mem in preparation of changing its
external API to provide that information.
The somewhat ugly early returns and subsequent seemingly redundant
conditionals are to make the following patch touch as little as possible
in string_helpers.c while still preserving the current behaviour of never
outputting partial escape sequences. That behaviour must also change for
%pE to work as one expects from every other printf specifier.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/vsprintf.c: fix potential NULL deref in hex_string
The helper hex_string() is broken in two ways. First, it doesn't
increment buf regardless of whether there is room to print, so callers
such as kasprintf() that try to probe the correct storage to allocate will
get a too small return value. But even worse, kasprintf() (and likely
anyone else trying to find the size of the result) pass NULL for buf and 0
for size, so we also have end == NULL. But this means that the end-1 in
hex_string() is (char*)-1, so buf < end-1 is true and we get a NULL
pointer deref. I double-checked this with a trivial kernel module that
just did a kasprintf(GFP_KERNEL, "%14ph", "CrashBoomBang").
Nobody seems to be using %ph with kasprintf, but we might as well fix it
before it hits someone.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/vsprintf: add %pC{,n,r} format specifiers for clocks
Add format specifiers for printing struct clk:
- '%pC' or '%pCn': name (Common Clock Framework) or address (legacy
clock framework) of the clock,
- '%pCr': rate of the clock.
[akpm@linux-foundation.org: omit code if !CONFIG_HAVE_CLK] Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Turquette <mturquette@linaro.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/vsprintf: Move integer format types to the top
Move the format types for 64-bit integers and configurable size integers
to the top, so they're next to the other integer format types. While at
it, add the missing format types for s32 and u32.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Turquette <mturquette@linaro.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/vsprintf: document %p parameters passed by reference
This patch series improves the documentation for printk() formats, and
adds support for printing clocks. The latter has always been a hassle if
you wanted to support both the common and legacy clock frameworks.
- '%pC' and '%pCn' print the name (Common Clock Framework) or address
(legacy clock framework) of a clock,
- '%pCr' prints the current clock rate.
This patch (of 3):
Make sure all %p extensions that take parameters by references are
documented to do so.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Turquette <mturquette@linaro.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc doesn't merge or overlap const char[] objects with identical contents
(probably language lawyers would also insist that these things have
different addresses), but there's no reason to have the string
"0123456789ABCDEF" occur in multiple places. hex_asc_upper is declared in
kernel.h and defined in lib/hexdump.c, which is unconditionally compiled
in.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At least since the initial git commit, when base was passed as a separate
parameter, number() has only been called with bases 8, 10 and 16. I'm
guessing that 66 was to accommodate 64 0/1, a sign and a '\0', but the
buffer is only used for the actual digits. Octal digits carry 3 bits of
information, so 24 is enough. Spell that 3*sizeof(num) so one less place
needs to be changed should long long ever be 128 bits. Also remove the
commented-out code that would handle an arbitrary base.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since FORMAT_TYPE_INT is simply 1 more than FORMAT_TYPE_UINT, and
similarly for BYTE/UBYTE, SHORT/USHORT, LONG/ULONG, we can eliminate a few
instructions by making SIGN have the value 1 instead of 2, and then use
arithmetic instead of branches for computing the right spec->type. It's a
little hacky, but certainly in the same spirit as SMALL needing to have
the value 0x20. For example for the spec->qualifier == 'l' case, gcc now
generates
Steven Rostedt [Wed, 15 Apr 2015 23:16:59 +0000 (16:16 -0700)]
printk: comment pr_cont() stating it is only to continue a line
KERN_CONT is nicely commented in kern_levels.h, but pr_cont() is now used
more often, and it lacks the comment stating what it is used for. It can
be confused as continuing the log level, but that is not its purpose. Its
purpose is to continue a line that had no newline enclosed. This should
be documented by pr_cont() as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joel Stanley [Wed, 15 Apr 2015 23:16:56 +0000 (16:16 -0700)]
powerpc/powernv: reboot when requested by firmware
Use orderly_reboot so userspace will to shut itself down via the reboot
path. This is required for graceful reboot initiated by the BMC, such as
when a user uses ipmitool to issue a 'chassis power cycle' command.
Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Fabian Frederick <fabf@skynet.be> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joel Stanley [Wed, 15 Apr 2015 23:16:53 +0000 (16:16 -0700)]
kernel/reboot.c: add orderly_reboot for graceful reboot
The kernel has orderly_poweroff which allows the kernel to initiate a
graceful shutdown of userspace, by running /sbin/poweroff. This adds
orderly_reboot that will cause userspace to shut itself down by calling
/sbin/reboot.
This will be used for shutdown initiated by a system controller on
platforms that do not use ACPI.
orderly_reboot() should be used when the system wants to allow userspace
to gracefully shut itself down. For cases where the system may imminently
catch on fire, the existing emergency_restart() provides an immediate
reboot without involving userspace.
Signed-off-by: Joel Stanley <joel@jms.id.au> Cc: Fabian Frederick <fabf@skynet.be> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joel Stanley [Wed, 15 Apr 2015 23:16:50 +0000 (16:16 -0700)]
drivers/sbus/char/envctrl.c: ignore orderly_poweroff return value
orderly_poweroff() unconditionally returns 0, so remove the dead code that
checks the return value.
A future patch will change the return type to void.
Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: David S. Miller <davem@davemloft.net> Cc: Fabian Frederick <fabf@skynet.be> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/hung_task.c: change hung_task.c to use for_each_process_thread()
In check_hung_uninterruptible_tasks() avoid the use of deprecated
while_each_thread().
The "max_count" logic will prevent a livelock - see commit 0c740d0a
("introduce for_each_thread() to replace the buggy while_each_thread()").
Having said this let's use for_each_process_thread().
Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Iulia Manda [Wed, 15 Apr 2015 23:16:41 +0000 (16:16 -0700)]
kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root. For these systems,
supporting multiple users is not necessary.
This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional. It is enabled
under CONFIG_EXPERT menu.
When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.
The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.
Also, groups.c is compiled out completely.
In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.
This change saves about 25 KB on a defconfig build. The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB. (The 25k goes down a bit with allnoconfig, but not that much.
The kernel was booted in Qemu. All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.
Commit 607ca46e97a1 ("UAPI: (Scripted) Disintegrate include/linux") left
behind some empty conditional blocks. Since they are useless and may
cause a reader to wonder whether something is missing, remove them.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/PID/status: show all sets of pid according to ns
If some issues occurred inside a container guest, host user could not know
which process is in trouble just by guest pid: the users of container
guest only knew the pid inside containers. This will bring obstacle for
trouble shooting.
This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:
a) In init_pid_ns, nothing changed;
b) In one pidns, will tell the pid inside containers:
NStgid: 21776 5 1
NSpid: 21776 5 1
NSpgid: 21776 5 1
NSsid: 21729 1 0
** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.
c) If pidns is nested, it depends on which pidns are you in.
NStgid: 5 1
NSpid: 5 1
NSpgid: 5 1
NSsid: 1 0
** Views from level 1
zsmalloc: fix fatal corruption due to wrong size class selection
There is no point in overriding the size class below. It causes fatal
corruption on the next chunk on the 3264-bytes size class, which is the
last size class that is not huge.
For example, if the requested size was exactly 3264 bytes, current
zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
not 4096. User access to this chunk may overwrite head of the next
adjacent chunk.
Here is the panic log captured when freelist was corrupted due to this:
Minchan Kim [Wed, 15 Apr 2015 23:16:18 +0000 (16:16 -0700)]
zsmalloc: remove unnecessary insertion/removal of zspage in compaction
In putback_zspage, we don't need to insert a zspage into list of zspage
in size_class again to just fix fullness group. We could do directly
without reinsertion so we could save some instuctions.
Reported-by: Heesub Shin <heesub.shin@samsung.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Juneho Choi <juno.choi@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A micro-optimization. Avoid additional branching and reduce (a bit)
registry pressure (f.e. s_off += size; d_off += size; may be calculated
twise: first for >= PAGE_SIZE check and later for offset update in "else"
clause).
scripts/bloat-o-meter shows some improvement
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function old new delta
zs_object_copy 550 540 -10
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add Documentation/ABI/obsolete/sysfs-block-zram file and list obsolete and
deprecated attributes there. The patch also adds additional information
to zram documentation and describes the basic strategy:
- the existing RW nodes will be downgraded to WO nodes (in 4.11)
- deprecated RO sysfs nodes will eventually be removed (in 4.11)
Users will be additionally notified about deprecated attr usage by
pr_warn_once() (added to every deprecated attr _show()), as suggested by
Minchan Kim.
User space is advised to use zram<id>/stat, zram<id>/io_stat and
zram<id>/mm_stat files.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-device `zram<id>/mm_stat' file provides mm statistics of a particular
zram device in a format similar to block layer statistics. The file
consists of a single line and represents the following stats (separated by
whitespace):
Per-device `zram<id>/io_stat' file provides accumulated I/O statistics of
particular zram device in a format similar to block layer statistics. The
file consists of a single line and represents the following stats
(separated by whitespace):
failed_reads
failed_writes
invalid_io
notify_free
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Briefly describe exported device stat attrs in zram documentation. We
will eventually get rid of per-stat sysfs nodes and, thus, clean up
Documentation/ABI/testing/sysfs-block-zram file, which is the only source
of information about device sysfs nodes.
Add `num_migrated' description, since there is no independent
`num_migrated' sysfs node (and no corresponding sysfs-block-zram entry),
it will be exported via zram<id>/mm_stat file.
At this point we can provide minimal description, because sysfs-block-zram
still contains detailed information.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use bio generic_start_io_acct() and generic_end_io_acct() to account
device's block layer statistics. This will let users to monitor zram
activities using sysstat and similar packages/tools.
Apart from the usual per-stat sysfs attr, zram IO stats are now also
available in '/sys/block/zram<id>/stat' and '/proc/diskstats' files.
We will slowly get rid of per-stat sysfs files.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram: move compact_store() to sysfs functions area
A cosmetic change. We have a new code layout and keep zram per-device
sysfs store and show functions in one place. Move compact_store() to that
handlers block to conform to current layout.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces rework to zram stats. We have per-stat sysfs nodes,
and it makes things a bit hard to use in user space: it doesn't give an
immediate stats 'snapshot', it requires user space to use more syscalls -
open, read, close for every stat file, with appropriate error checks on
every step, etc.
First, zram now accounts block layer statistics, available in
/sys/block/zram<id>/stat and /proc/diskstats files. So some new stats are
available (see Documentation/block/stat.txt), besides, zram's activities
now can be monitored by sysstat's iostat or similar tools.
Second, group currently exported on per-stat basis nodes into two
categories (files):
-- zram<id>/io_stat
accumulates device's IO stats, that are not accounted by block layer,
and contains:
failed_reads
failed_writes
invalid_io
notify_free
per-stat sysfs nodes are now considered to be deprecated and we plan to
remove them (and clean up some of the existing stat code) in two years (as
of now, there is no warning printed to syslog about deprecated stats being
used). User space is advised to use the above mentioned 3 files.
This patch (of 7):
Remove sysfs `num_migrated' attribute. We are moving away from per-stat
device attrs towards 3 stat files that will accumulate io and mm stats in
a format similar to block layer statistics in /sys/block/<dev>/stat. That
will be easier to use in user space, and reduce the number of syscalls
needed to read zram device statistics.
`num_migrated' will return back in zram<id>/mm_stat file.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 15 Apr 2015 23:15:42 +0000 (16:15 -0700)]
zsmalloc: add fullness into stat
During investigating compaction, fullness information of each class is
helpful for investigating how the compaction works well. With that, we
could know how compaction works well more clear on each size class.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 15 Apr 2015 23:15:39 +0000 (16:15 -0700)]
zsmalloc: record handle in page->private for huge object
We store handle on header of each allocated object so it increases the
size of each object by sizeof(unsigned long).
If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
4104B-class to add handle.
However, 4104B-class has 1-pages_per_zspage so wasted size by internal
fragment is 8192 - 4104, which is terrible.
So this patch records the handle in page->private on such huge object(ie,
pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
object so we could use 4096B-class, not 4104B-class.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 15 Apr 2015 23:15:36 +0000 (16:15 -0700)]
zram: support compaction
Now that zsmalloc supports compaction, zram can use it. For the first
step, this patch exports compact knob via sysfs so user can do compaction
via "echo 1 > /sys/block/zram0/compact".
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 15 Apr 2015 23:15:33 +0000 (16:15 -0700)]
zsmalloc: adjust ZS_ALMOST_FULL
Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
under 1/4 used objects(ie, fullness_threshold_frac). It could make result
in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.
This patch changes the rule so that zsmalloc makes zspage which has above
3/4 used object ZS_ALMOST_FULL so it could make tight packing.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 15 Apr 2015 23:15:30 +0000 (16:15 -0700)]
zsmalloc: support compaction
This patch provides core functions for migration of zsmalloc. Migraion
policy is simple as follows.
for each size class {
while {
src_page = get zs_page from ZS_ALMOST_EMPTY
if (!src_page)
break;
dst_page = get zs_page from ZS_ALMOST_FULL
if (!dst_page)
dst_page = get zs_page from ZS_ALMOST_EMPTY
if (!dst_page)
break;
migrate(from src_page, to dst_page);
}
}
For migration, we need to identify which objects in zspage are allocated
to migrate them out. We could know it by iterating of freed objects in a
zspage because first_page of zspage keeps free objects singly-linked list
but it's not efficient. Instead, this patch adds a tag(ie,
OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
whether the object is allocated easily.
This patch adds another status bit in handle to synchronize between user
access through zs_map_object and migration. During migration, we cannot
move objects user are using due to data coherency between old object and
new object.
[akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Wed, 15 Apr 2015 23:15:23 +0000 (16:15 -0700)]
zsmalloc: decouple handle and object
Recently, we started to use zram heavily and some of issues
popped.
1) external fragmentation
I got a report from Juneho Choi that fork failed although there are plenty
of free pages in the system. His investigation revealed zram is one of
the culprit to make heavy fragmentation so there was no more contiguous
16K page for pgd to fork in the ARM.
2) non-movable pages
Other problem of zram now is that inherently, user want to use zram as
swap in small memory system so they use zRAM with CMA to use memory
efficiently. However, unfortunately, it doesn't work well because zRAM
cannot use CMA's movable pages unless it doesn't support compaction. I
got several reports about that OOM happened with zram although there are
lots of swap space and free space in CMA area.
3) internal fragmentation
zRAM has started support memory limitation feature to limit memory usage
and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
harmonized with zram-swap to stop anonymous page reclaim if zram consumed
memory up to the limit although there are free space on the swap. One
problem for that direction is zram has no way to know any hole in memory
space zsmalloc allocated by internal fragmentation so zram would regard
swap is full although there are free space in zsmalloc. For solving the
issue, zram want to trigger compaction of zsmalloc before it decides full
or not.
This patchset is first step to support above issues. For that, it adds
indirect layer between handle and object location and supports manual
compaction to solve 3th problem first of all.
After this patchset got merged, next step is to make VM aware of zsmalloc
compaction so that generic compaction will move zsmalloced-pages
automatically in runtime.
In my imaginary experiment(ie, high compress ratio data with heavy swap
in/out on 8G zram-swap), data is as follows,
Before =
zram allocated object : 60212066 bytes
zram total used: 140103680 bytes
ratio: 42.98 percent
MemFree: 840192 kB
Compaction
After =
frag ratio after compaction
zram allocated object : 60212066 bytes
zram total used: 76185600 bytes
ratio: 79.03 percent
MemFree: 901932 kB
Juneho reported below in his real platform with small aging.
So, I think the benefit would be bigger in real aging system
for a long time.
- frag_ratio increased 3% (ie, higher is better)
- memfree increased about 6MB
- In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased
frag ratio after swap fragment
used : 156677 kbytes
total: 166092 kbytes
frag_ratio : 94
meminfo before compaction
MemFree: 83724 kB
Node 0, zone Normal 13642 1364 57 10 61 17 9 5 4 0 0
Node 0, zone HighMem 425 29 1 0 0 0 0 0 0 0 0
num_migrated : 23630
compaction done
frag ratio after compaction
used : 156673 kbytes
total: 160564 kbytes
frag_ratio : 97
meminfo after compaction
MemFree: 89060 kB
Node 0, zone Normal 14076 1544 67 14 61 17 9 5 4 0 0
Node 0, zone HighMem 863 50 1 0 0 0 0 0 0 0 0
This patchset adds more logics(about 480 lines) in zsmalloc but when I
tested heavy swapin/out program, the regression for swapin/out speed is
marginal because most of overheads were caused by compress/decompress and
other MM reclaim stuff.
This patch (of 7):
Currently, handle of zsmalloc encodes object's location directly so it
makes support of migration hard.
This patch decouples handle and object via adding indirect layer. For
that, it allocates handle dynamically and returns it to user. The handle
is the address allocated by slab allocation so it's unique and we could
keep object's location in the memory space allocated for handle.
With it, we can change object's position without changing handle itself.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Juneho Choi <juno.choi@lge.com> Cc: Gunho Lee <gunho.lee@lge.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dax: use pfn_mkwrite to update c/mtime + freeze protection
From: Yigal Korman <yigal@plexistor.com>
[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.
A new xfstest is submitted for testing this (generic/080)
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP
This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
get notified when access is a write to a read-only PFN.
This can happen if we mmap() a file then first mmap-read from it to
page-in a read-only PFN, than we mmap-write to the same page.
We need this functionality to fix a DAX bug, where in the scenario above
we fail to set ctime/mtime though we modified the file. An xfstest is
attached to this patchset that shows the failure and the fix. (A DAX
patch will follow)
This functionality is extra important for us, because upon dirtying of a
pmem page we also want to RDMA the page to a remote cluster node.
We define a new pfn_mkwrite and do not reuse page_mkwrite because
1 - The name ;-)
2 - But mainly because it would take a very long and tedious
audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
users. To make sure they do not now CRASH. For example current
DAX code (which this is for) would crash.
If we would want to reuse page_mkwrite, We will need to first
patch all users, so to not-crash-on-no-page. Then enable this
patch. But even if I did that I would not sleep so well at night.
Adding a new vector is the safest thing to do, and is not that
expensive. an extra pointer at a static function vector per driver.
Also the new vector is better for performance, because else we
Will call all current Kernel vectors, so to:
check-ha-no-page-do-nothing and return.
No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.
Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mempools keep allocated objects in reserved for situations when ordinary
allocation may not be possible to satisfy. These objects shouldn't be
accessed before they leave the pool.
This patch poison elements when get into the pool and unpoison when they
leave it. This will let KASan to detect use-after-free of mempool's
elements.
mm: uninline and cleanup page-mapping related helpers
Most-used page->mapping helper -- page_mapping() -- has already uninlined.
Let's uninline also page_rmapping() and page_anon_vma(). It saves us
depending on configuration around 400 bytes in text:
text data bss dec hex filename
660318 99254 410000 1169572 11d8a4 mm/built-in.o-before
659854 99254 410000 1169108 11d6d4 mm/built-in.o
I also tried to make code a bit more clean.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memblock.c: add debug output for memblock_add()
memblock_reserve() calls memblock_reserve_region() which prints debugging
information if 'memblock=debug' was passed on the command line. This
patch adds the same behaviour, but for memblock_add function().
[akpm@linux-foundation.org: s/memblock_memory/memblock_add/ in message] Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Emil Medve <Emilian.Medve@freescale.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are not safe from calling isolate_huge_page() on a hugepage
concurrently, which can make the victim hugepage in invalid state and
results in BUG_ON().
The root problem of this is that we don't have any information on struct
page (so easily accessible) about hugepages' activeness. Note that
hugepages' activeness means just being linked to
hstate->hugepage_activelist, which is not the same as normal pages'
activeness represented by PageActive flag.
Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
before isolation, so let's do similarly for hugetlb with a new
paeg_huge_active().
set/clear_page_huge_active() should be called within hugetlb_lock. But
hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
in these functions set_page_huge_active() is called right after the
hugepage is allocated and no other thread tries to isolate it.
[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
[fengguang.wu@intel.com: set_page_huge_active() can be static] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__put_compound_page() calls __page_cache_release() to do some freeing
work, but it's obviously for thps, not for hugetlb. We don't care because
PageLRU is always cleared and page->mem_cgroup is always NULL for hugetlb.
But it's not correct and has potential risks, so let's make it
conditional.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 15 Apr 2015 23:14:29 +0000 (16:14 -0700)]
mm, selftests: test return value of munmap for MAP_HUGETLB memory
When MAP_HUGETLB memory is unmapped, the length must be hugepage aligned,
otherwise it fails with -EINVAL.
All tests currently behave correctly, but it's better to explcitly test
the return value for completeness and document the requirement, especially
if users copy map_hugetlb.c as a sample implementation.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joern Engel <joern@logfs.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Eric B Munson <emunson@akamai.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 15 Apr 2015 23:14:26 +0000 (16:14 -0700)]
mm, doc: cleanup and clarify munmap behavior for hugetlb memory
munmap(2) of hugetlb memory requires a length that is hugepage aligned,
otherwise it may fail. Add this to the documentation.
This also cleans up the documentation and separates it into logical units:
one part refers to MAP_HUGETLB and another part refers to requirements for
shared memory segments.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joern Engel <joern@logfs.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
thp: do not adjust zone water marks if khugepaged is not started
set_recommended_min_free_kbytes() adjusts zone water marks to be suitable
for khugepaged. We avoid doing this if khugepaged is disabled, but don't
catch the case when khugepaged is failed to start.
Let's address this by checking khugepaged_thread instead of
khugepaged_enabled() in set_recommended_min_free_kbytes().
It's NULL if the kernel thread is stopped or failed to start.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>