hfsplus: add necessary declarations for POSIX ACLs support
This patchset implements POSIX ACLs support in hfsplus driver.
Mac OS X beginning with version 10.4 ("Tiger") support NFSv4 ACLs, which
are part of the NFSv4 standard. HFS+ stores ACLs in the form of
specially named extended attributes (com.apple.system.Security).
But this patchset doesn't use "com.apple.system.Security" extended
attributes. It implements support of POSIX ACLs in the form of extended
attributes with names "system.posix_acl_access" and
"system.posix_acl_default". These xattrs are treated only under Linux.
POSIX ACLs doesn't mean something under Mac OS X. Thereby, this patch
set provides opportunity to use POSIX ACLs under Linux on HFS+
filesystem.
This patch:
Add CONFIG_HFSPLUS_FS_POSIX_ACL kernel configuration option, DBG_ACL_MOD
debugging flag and acl.h file with declaration of essential functions
for support POSIX ACLs in hfsplus driver.
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the composition of devm_request_mem_region and devm_ioremap to a
single call to devm_ioremap_resource. The associated call to
platform_get_resource is also simplified and moved next to the new call
to devm_ioremap_resource.
This was done using a combination of the semantic patches
devm_ioremap_resource.cocci and devm_request_and_ioremap.cocci, found in
the scripts/coccinelle/api directory.
In rtc-lpc32xx.c and rtc-mv.c, the local variable size is no longer needed.
In rtc-ds1511.c the size field of the local structure is not useful any
more, and is deleted.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xianglong Du [Wed, 11 Sep 2013 21:24:23 +0000 (14:24 -0700)]
drivers/rtc/rtc-sirfsoc.c: fix kernel warning during wakeup
enable_irq_wake() might fail, if so, we will see kernel warning in resume
entries due to it always calls disable_irq_wake().
WARNING: at kernel/irq/manage.c:529 irq_set_irq_wake+0xc4/0xf0()
Unbalanced IRQ 52 wake disable
Modules linked in: ipv6 libcomposite configfs
CPU: 0 PID: 1591 Comm: ash Tainted: G W 3.10.0-00854-gdbd86d4-dirty #100
(unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14)
(show_stack+0x10/0x14) from (warn_slowpath_common+0x54/0x68)
(warn_slowpath_common+0x54/0x68) from (warn_slowpath_fmt+0x30/0x40)
(warn_slowpath_fmt+0x30/0x40) from (irq_set_irq_wake+0xc4/0xf0)
(irq_set_irq_wake+0xc4/0xf0) from (sirfsoc_rtc_restore+0x30/0x38)
(sirfsoc_rtc_restore+0x30/0x38) from (platform_pm_restore+0x2c/0x50)
(platform_pm_restore+0x2c/0x50) from (dpm_run_callback.clone.6+0x30/0xb0)
(dpm_run_callback.clone.6+0x30/0xb0) from (device_resume+0x88/0x134)
(device_resume+0x88/0x134) from (dpm_resume+0x114/0x230)
(dpm_resume+0x114/0x230) from (hibernation_snapshot+0x178/0x1d0)
(hibernation_snapshot+0x178/0x1d0) from (hibernate+0x130/0x1dc)
(hibernate+0x130/0x1dc) from (state_store+0xb4/0xc0)
(state_store+0xb4/0xc0) from (kobj_attr_store+0x14/0x20)
(kobj_attr_store+0x14/0x20) from (sysfs_write_file+0xfc/0x17c)
(sysfs_write_file+0xfc/0x17c) from (vfs_write+0xc8/0x194)
(vfs_write+0xc8/0x194) from (SyS_write+0x40/0x6c)
(SyS_write+0x40/0x6c) from (ret_fast_syscall+0x0/0x30)
To avoid unbalanced "IRQ wake disable", ensure that disable_irq_wake() is
called only when enable_irq_wake() have been successfully enabled.
Signed-off-by: Xianglong Du <Xianglong.Du@csr.com> Signed-off-by: Barry Song <Baohua.Song@csr.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/rtc/rtc-palmas.c: support for backup battery charging
Palmas series device like TPS65913, TPS80036 supports the backup battery
for powering the RTC when no other energy source is available.
The backup battery is optional, connected to the VBACKUP pin, and can be
nonrechargeable or rechargeable. The rechargeable battery can be charged
from the system supply using the backup battery charger.
Add support for enabling charging of this backup battery. Also add the DT
binding document and the new properties to have this support.
Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com> Reviewed-by: Felipe Balbi <balbi@ti.com> Acked-by: Kumar Gala <galak@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/rtc/rtc-omap.c: add rtc wakeup support to alarm events
On some platforms (like AM33xx), a special register (RTC_IRQWAKEEN) is
available to enable Alarm Wakeup feature. This register needs to be
properly handled for the rtcwake to work properly.
Platforms using such IP should set "ti,am3352-rtc" in rtc device dt
compatibility node.
Signed-off-by: Hebbar Gururaja <gururaja.hebbar@ti.com> Acked-by: Kevin Hilman <khilman@linaro.org> Acked-by: Sekhar Nori <nsekhar@ti.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Rob Landley <rob@landley.net> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jonas Jensen [Wed, 11 Sep 2013 21:24:17 +0000 (14:24 -0700)]
rtc: add MOXA ART RTC driver
Add RTC driver for MOXA ART SoCs.
Signed-off-by: Jonas Jensen <jonas.jensen@gmail.com> Reviewed-by: Mark Brown <broonie@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s390/kprobes: add support for pc-relative long displacement instructions
With the general-instruction extension facility (z10) a couple of
instructions with a pc-relative long displacement were introduced. The
kprobes support for these instructions however was never implemented.
In result, if anybody ever put a probe on any of these instructions the
result would have been random behaviour after the instruction got executed
within the insn slot.
So lets add the missing handling for these instructions. Since all of the
new instructions have 32 bit signed displacement the easiest solution is
to allocate an insn slot that is within the same 2GB area like the
original instruction and patch the displacement field.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current kpropes insn caches allocate memory areas for insn slots
with module_alloc(). The assumption is that the kernel image and module
area are both within the same +/- 2GB memory area.
This however is not true for s390 where the kernel image resides within
the first 2GB (DMA memory area), but the module area is far away in the
vmalloc area, usually somewhere close below the 4TB area.
For new pc relative instructions s390 needs insn slots that are within
+/- 2GB of each area. That way we can patch displacements of
pc-relative instructions within the insn slots just like x86 and
powerpc.
The module area works already with the normal insn slot allocator,
however there is currently no way to get insn slots that are within the
first 2GB on s390 (aka DMA area).
Therefore this patch set modifies the kprobes insn slot cache code in
order to allow to specify a custom allocator for the insn slot cache
pages. In addition architecure can now have private insn slot caches
withhout the need to modify common code.
Patch 1 unifies and simplifies the current insn and optinsn caches
implementation. This is a preparation which allows to add more
insn caches in a simple way.
Patch 2 adds the possibility to specify a custom allocator.
Patch 3 makes s390 use the new insn slot mechanisms and adds support for
pc-relative instructions with long displacements.
This patch (of 3):
The two insn caches (insn, and optinsn) each have an own mutex and
alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).
Since there is the need for yet another insn cache which satifies dma
allocations on s390, unify and simplify the current implementation:
- Move the per insn cache mutex into struct kprobe_insn_cache.
- Move the alloc/free functions to kprobe.h so they are simply
wrappers for the generic __get_insn_slot/__free_insn_slot functions.
The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
which provides the alloc/free functions for each cache if needed.
- move the struct kprobe_insn_cache to kprobe.h which allows to generate
architecture specific insn slot caches outside of the core kprobes
code.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jean Delvare [Wed, 11 Sep 2013 21:24:10 +0000 (14:24 -0700)]
firmware/dmi_scan: drop OOM messages
As reported by Joe Perches: OOM messages generally aren't useful.
dmi_alloc is either a trivial front-end to kzalloc, and kzalloc already
does a dump_stack() when OOM, or for x86, dmi_alloc uses extend_brk
which BUGs when unsuccessful.
So we can remove all 6 such log messages in the dmi_scan driver, to
shrink the binary size (by 528 bytes on x86_64.)
Signed-off-by: Jean Delvare <jdelvare@suse.de> Reported-by: Joe Perches <joe@perches.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jean Delvare [Wed, 11 Sep 2013 21:24:09 +0000 (14:24 -0700)]
firmware/dmi_scan: constify strings
Add const to all DMI string pointers where this is possible. This fixes a
checkpatch warning.
Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Joe Perches <joe@perches.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jean Delvare [Wed, 11 Sep 2013 21:24:08 +0000 (14:24 -0700)]
firmware/dmi_scan: fix most checkpatch errors and warnings
Fix all errors and trivial warnings reported by checkpatch for file
drivers/firmware/dmi_scan.c.
Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Joe Perches <joe@perches.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jean Delvare [Wed, 11 Sep 2013 21:24:07 +0000 (14:24 -0700)]
firmware/dmi_scan: drop obsolete comment
This comment predates the introduction of early_ioremap. Since then the
missing calls to dmi_iounmap have been added by Ingo and Yinghai in
commits 0d64484f7ea1 ("x86: fix DMI ioremap leak") and 3212bff370c2
("x86: left over fix for leak of early_ioremp in dmi_scan") . That was
over 5 years ago so it is about time to drop this now misleading
comment.
Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Wed, 11 Sep 2013 21:24:06 +0000 (14:24 -0700)]
epoll: add a reschedule point in ep_free()
ep_free() might iterate on a huge set of epitems and hold cpu too long.
Add two cond_resched() in order to yield cpu to other tasks. This is safe
as we only hold mutexes in this function.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Acked-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:24:05 +0000 (14:24 -0700)]
checkpatch: add test for positional misuse of section specifiers like __initdata
As discussed recently on the arm [1] and lm-sensors [2] lists, it is
possible to use section markers on variables in a way which gcc doesn't
understand (or at least not the way the developer intended):
does NOT put exynos4_plls in the .initdata section. The __initdata marker
can be virtually anywhere on the line, EXCEPT right after "struct". The
preferred location is before the "=" sign if there is one, or before the
trailing ";" otherwise.
So, update checkpatch to find these misuses and report an error when it's
immediately after struct or union, and a warning when it's otherwise not
immediately before the ; or =.
A similar patch was suggested by Andi Kleen
https://lkml.org/lkml/2013/8/5/648
Signed-off-by: Joe Perches <joe@perches.com> Suggested-by: Jean Delvare <khali@linux-fr.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:24:03 +0000 (14:24 -0700)]
checkpatch: reduce runtime/cpu time used
There are some cases where checkpatch can take a long time to complete.
Reduce the likelihood of this long run-time by adding a new test for lines
with and without comments and eliminating checks on lines with only
comments.
This reduces the number of "ctx_statement_block" calls, and also the
number of tests of $stat, which is now undefined for these blank lines.
One test in particular, the "check for switch/default statements without a
break", could take an extremely long time to parse as it tries to skip
interleaving comments within the ctx_statement_block/$stat and that could
be done multiple times unnecessarily.
A small test case taken from cfg80211.h before this patch would take
1000's of seconds to run, now it's just a couple seconds.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:24:01 +0000 (14:24 -0700)]
checkpatch: better --fix of SPACING errors.
Previous attempt at fixing SPACING errors could make a hash of several
defects.
This patch should make --fix be a lot better at correcting these defects.
Trim left and right sides of these defects appropriately instead of a
somewhat random attempt at it.
Trim left spaces from any following bit of the modified line when only a
single space is required around an operator.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Phil Carmody <phil.carmody@partner.samsung.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:24:00 +0000 (14:24 -0700)]
checkpatch: ignore #define TRACE_<foo> macros
The tracing subsystem uses slightly odd #defines to set path/directory
locations for include files.
These #defines can cause false positives for the complex macro tests so
add exclusions for these specific #defines (TRACE_SYSTEM,
TRACE_INCLUDE_FILE, TRACE_INCLUDE_PATH).
Signed-off-by: Joe Perches <joe@perches.com> Cc: Sarah Sharp <sarah.a.sharp@linux.intel.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:59 +0000 (14:23 -0700)]
checkpatch: add --types option to report only specific message types
Add a --types convenience option to show only specific message types.
Combined with the --fix option, this can produce specific suggested
formatting patches to files.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Wed, 11 Sep 2013 21:23:56 +0000 (14:23 -0700)]
checkpatch: enforce sane perl version
I got a bug report from a couple of users who said checkpatch.pl was
broken for them. It was erroring out on fairly random lines most commonly
with messages like:
Nested quantifiers in regex; marked by <--HERE in m/(\((?:[^\(\)]++ <-- HERE |(?-1))*\))/ at ./checkpatch.pl line 340.
The bug reporter was running a version of perl 5.8 which was end-of-lifed
in 2008: http://www.cpan.org/src/. Versions of perl this old are at
_best_ quite untested. At worst, they are crusty and known to be
completely broken.
If folks have a system _that_ old, then we should have mercy on them and
give them a half-decent error message rather than fail with nutty error
messages.
This patch enforces that checkpatch.pl is run with perl 5.10, which was
end-of-lifed in 2009. The new --ignore-perl-version command-line switch
will let folks override this if they want.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:55 +0000 (14:23 -0700)]
checkpatch: check CamelCase by word, not by $Lval
$Lval is a test for complete name (ie: foo->bar.Baz[1])
If any of this is CamelCase, then the current test uses the entire $Lval.
This isn't optimal because it can emit messages with foo->bar.Baz and
bar.Baz when Baz is a variable specified in an include file.
So instead, break the $Lval into words and check each word for CamelCase
uses.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:54 +0000 (14:23 -0700)]
checkpatch: add a few more --fix corrections
Suggest a few more single-line corrections.
Remove DOS line endings
Simplify removing trailing whitespace
Remove global/static initializations to 0/NULL
Convert pr_warning to pr_warn
Add space after brace
Convert binary constants to hex
Remove whitespace after line continuation
Use inline not __inline or __inline__
Use __printf and __scanf
Use a single ; for statement terminations
Convert __FUNCTION__ to __func__
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/decompressors: fix "no limit" output buffer length
When decompressing into memory, the output buffer length is set to some
arbitrarily high value (0x7fffffff) to indicate the output is, virtually,
unlimited in size.
The problem with this is that some platforms have their physical memory at
high physical addresses (0x80000000 or more), and that the output buffer
address and its "unlimited" length cannot be added without overflowing.
An example of this can be found in inflate_fast():
/* next_out is the output buffer address */
out = strm->next_out - OFF;
/* avail_out is the output buffer size. end will overflow if the output
* address is >= 0x80000104 */
end = out + (strm->avail_out - 257);
This has huge consequences on the performance of kernel decompression,
since the following exit condition of inflate_fast() will be always true:
} while (in < last && out < end);
Indeed, "end" has overflowed and is now always lower than "out". As a
result, inflate_fast() will return after processing one single byte of
input data, and will thus need to be called an unreasonably high number of
times. This probably went unnoticed because kernel decompression is fast
enough even with this issue.
Nonetheless, adjusting the output buffer length in such a way that the
above pointer arithmetic never overflows results in a kernel decompression
that is about 3 times faster on affected machines.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Tested-by: Jon Medhurst <tixy@linaro.org> Cc: Stephen Warren <swarren@wwwdotorg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:48 +0000 (14:23 -0700)]
MAINTAINERS: update GRE DEMUX patterns
Commit c50cd357887a ("net: gre: move GSO functions to gre_offload")
renamed and separated the file into multiple files. Update the
patterns.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Dmitry Kozlov <xeb@mail.ru> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:37 +0000 (14:23 -0700)]
MAINTAINERS: ARM: spear: consolidate sections
Commit a7ed099ffc8e ("ARM: spear: move all files to mach-spear") moved
all the files into a single directory, delete the now unnecessary
duplicate sections and update the pattern.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 11 Sep 2013 21:23:32 +0000 (14:23 -0700)]
MAINTAINERS: EXYNOS: remove board files
Commit ca9143501c30 ("ARM: EXYNOS: Remove unused board files") removed
the files, remove the patterns too.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Tomasz Figa <t.figa@samsung.com> Acked-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:29 +0000 (14:23 -0700)]
kernel/smp.c: quit unconditionally enabling irqs in on_each_cpu_mask().
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled.
I don't know of any bugs currently caused by this unconditional
local_irq_enable(), but I want to use this function in MIPS/OCTEON early
boot (when we have early_boot_irqs_disabled). This also makes this
function have similar semantics to on_each_cpu() which is good in
itself.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: Christoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syscalls.h: add forward declarations for inplace syscall wrappers
Unclutter -Wmissing-prototypes warning types (enabled at make W=1)
linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes]
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
^
linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
^
by adding forward declarations right before definitions.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At least on ARM no-MMU the extable is empty and so there is nothing to
sort. So add a check for the table to be empty which effectively only
changes that the misleading pr_notice is suppressed.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Daney <david.daney@cavium.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:26 +0000 (14:23 -0700)]
smp.h: move !SMP version of on_each_cpu() out-of-line
All of the other non-trivial !SMP versions of functions in smp.h are
out-of-line in up.c. Move on_each_cpu() there as well.
This allows us to get rid of the #include <linux/irqflags.h>. The
drawback is that this makes both the x86_64 and i386 defconfig !SMP
kernels about 200 bytes larger each.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:25 +0000 (14:23 -0700)]
up.c: use local_irq_{save,restore}() in smp_call_function_single.
The SMP version of this function doesn't unconditionally enable irqs, so
neither should this !SMP version. There are no know problems caused by
this, but we make the change for consistency's sake.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Wed, 11 Sep 2013 21:23:24 +0000 (14:23 -0700)]
smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_cond
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled. There are currently no known problematical
callers of these functions, but since it is a known failure pattern, we
preemptively fix them.
Since they are not trivial functions, make them non-inline by moving
them to up.c. This also makes it so we don't have to fix #include
dependancies for preempt_{disable,enable}.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Deacon [Wed, 11 Sep 2013 21:23:23 +0000 (14:23 -0700)]
kernel/spinlock.c: add default arch_*_relax definitions for GENERIC_LOCKBREAK
When running with GENERIC_LOCKBREAK=y, the locking implementations emit
calls to arch_{read,write,spin}_relax when spinning on a contended lock
in order to allow architectures to favour the CPU owning the lock if
possible.
In reality, everybody apart from PowerPC and S390 just does cpu_relax()
here, so make that the default behaviour and allow it to be overridden
if required.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lto, watchdog/hpwdt.c: make assembler label global
We cannot assume that the inline assembler code always ends up in the same
file as the original C file. So make any assembler labels that are called
with "extern" by C global
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Wim Van Sebroeck <wim@iguana.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.
For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.
The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely. The fix is inspired from x86. This could
lead to information leak on alpha. I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Wed, 11 Sep 2013 21:23:17 +0000 (14:23 -0700)]
drivers/firmware/google/gsmi.c: replace strict_strtoul() with kstrtoul()
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete. Thus, kstrtoul() should be used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Tom Gundersen <teg@jklm.no> Cc: Mike Waychison <mikew@google.com> Acked-by: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
platform: convert apple-gmux driver to dev_pm_ops from legacy pm_ops
Convert drivers/platform/x86/apple-gmux to use dev_pm_ops instead of
legacy pm_ops. This patch depends on pnp driver bus ops change to invoke
pnp_driver dev_pm_ops.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tpm: convert tpm_tis driver to use dev_pm_ops from legacy pm_ops
Convert drivers/char/tpm/tpm_tis.c to use dev_pm_ops instead of legacy
pm_ops. This patch depends on pnp driver bus ops change to invoke
pnp_driver dev_pm_ops.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rtc: convert rtc-cmos to dev_pm_ops from legacy pm_ops
Convert drivers/rtc/rtc-cmos to use dev_pm_ops instead of legacy pm_ops.
This patch depends on pnp driver bus ops change to invoke pnp_driver
dev_pm_ops.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pnp: change pnp bus pm_ops to invoke pnp driver dev_pm_ops if specified
pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from
pnp_driver. Changed pnp_bus_suspend() and pnp_bus_resume() to check if
pnp driver has dev_pm_ops and call. If dev_pm_ops don't exist, then call
use legacy pm_ops. Without this change, pnp_driver dev_pm_ops will not
get called.
In addition to the pnp driver bus pm_ops change to invoke driver
dev_pm_ops, this patch set contains changes to rtc-cmos, tpm_tis, and
apple-gmux pnp drivers to convert from legacy pm_ops to dev_pm_ops.
This patch (of 4):
pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from
pnp_driver. Changed pnp_bus_suspend() and pnp_bus_resume() to check if
pnp driver has dev_pm_ops and call. If dev_pm_ops don't exist, then call
use legacy pm_ops. Without this change, pnp_driver dev_pm_ops will not
get called.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A memory cgroup with (1) multiple threshold notifications and (2) at least
one threshold >=2G was not reliable. Specifically the notifications would
either not fire or would not fire in the proper order.
The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
thresholds in sorted order. mem_cgroup_usage_register_event() sorts them
with compare_thresholds(), which returns the difference of two 64 bit
thresholds as an int. If the difference is positive but has bit[31] set,
then sort() treats the difference as negative and breaks sort order.
This fix compares the two arbitrary 64 bit thresholds returning the
classic -1, 0, 1 result.
The test below sets two notifications (at 0x1000 and 0x81001000):
cd /sys/fs/cgroup/memory
mkdir x
for x in 4096 2164264960; do
cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
done
echo $$ > x/cgroup.procs
anon_leaker 500M
v3.11-rc7 fails to signal the 4096 event listener:
Leaking...
Done leaking pages.
The fixed bug is old. It appears to date back to the introduction of
memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
implement memory thresholds"
Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Wed, 11 Sep 2013 21:23:04 +0000 (14:23 -0700)]
writeback: fix race that cause writeback hung
There is a race between mark inode dirty and writeback thread, see the
following scenario. In this case, writeback thread will not run though
there is dirty_io.
__mark_inode_dirty() bdi_writeback_workfn()
... ...
spin_lock(&inode->i_lock);
...
if (bdi_cap_writeback_dirty(bdi)) {
<<< assume wb has dirty_io, so wakeup_bdi is false.
<<< the following inode_dirty also have wakeup_bdi false.
if (!wb_has_dirty_io(&bdi->wb))
wakeup_bdi = true;
}
spin_unlock(&inode->i_lock);
<<< assume last dirty_io is removed here.
pages_written = wb_do_writeback(wb);
...
<<< work_list empty and wb has no dirty_io,
<<< delayed_work will not be queued.
if (!list_empty(&bdi->work_list) ||
(wb_has_dirty_io(wb) && dirty_writeback_interval))
queue_delayed_work(bdi_wq, &wb->dwork,
msecs_to_jiffies(dirty_writeback_interval * 10));
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
<<< new dirty_io is added.
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
spin_unlock(&bdi->wb.list_lock);
<<< though there is dirty_io, but wakeup_bdi is false,
<<< so writeback thread will not be waked up and
<<< the new dirty_io will not be flushed.
if (wakeup_bdi)
bdi_wakeup_thread_delayed(bdi);
Writeback will run until there is a new flush work queued. This may cause
a lot of dirty pages stay in memory for a long time.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:23:02 +0000 (14:23 -0700)]
mm/madvise.c: fix return value of madvise_hwpoison()
The return value outside for loop is always zero which means
madvise_hwpoison return success, however, this is not truth for
soft_offline_page w/ failure return value.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;
munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
return 0;
}
There is one page reference count for default empty zero page,
madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison
reduce one page reference count since it's a non LRU page.
unpoison_memory release the last page reference count and free empty zero
page to buddy system which is not correct since empty zero page has
PG_reserved flag. This patch fix it by don't reduce the page reference
count under 1 against empty zero page.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:59 +0000 (14:22 -0700)]
mm/hwpoison.c: fix held reference count after unpoisoning empty zero page
madvise hwpoison inject will poison the read-only empty zero page if there
is no write access before poison. Empty zero page reference count will be
increased for hwpoison, subsequent poison zero page will return directly
since page has already been set PG_hwpoison, however, page reference count
is still increased by get_user_pages_fast. The unpoison process will
unpoison the empty zero page and decrease the reference count successfully
for the fist time, however, subsequent unpoison empty zero page will
return directly since page has already been unpoisoned and without
decrease the page reference count of empty zero page.
This patch fixes it by make madvise_hwpoison() put a page and return
immediately (without calling memory_failure() or soft_offline_page()) when
the page is already hwpoisoned.
Wanpeng Li [Wed, 11 Sep 2013 21:22:55 +0000 (14:22 -0700)]
mm/hwpoison: don't set migration type twice to avoid holding heavily contend zone->lock
Set pageblock migration type will hold zone->lock which is heavy contended
in system to avoid race. However, soft offline page will set pageblock
migration type twice during get page if the page is in used, not hugetlbfs
page and not on lru list. There is unnecessary to set the pageblock
migration type and hold heavy contended zone->lock again if the first
round get page have already set the pageblock to right migration type.
The trick here is migration type is MIGRATE_ISOLATE. There are other two
parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug,
however, we hold lock_memory_hotplug() which avoid race. The second is
CMA which umovable page allocation requst can't fallback to. So it's safe
here.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:53 +0000 (14:22 -0700)]
mm/hwpoison: fix race against poison thp
There is a race between hwpoison page and unpoison page, memory_failure
set the page hwpoison and increase num_poisoned_pages without hold page
lock, and one page count will be accounted against thp for
num_poisoned_pages. However, unpoison can occur before memory_failure
hold page lock and split transparent hugepage, unpoison will decrease
num_poisoned_pages by 1 << compound_order since memory_failure has not yet
split transparent hugepage with page lock held. That means we account one
page for hwpoison and 1 << compound_order for unpoison. This patch fix it
by inserting a PageTransHuge check before doing TestClearPageHWPoison,
unpoison failed without clearing PageHWPoison and decreasing
num_poisoned_pages.
A B
memory_failue
TestSetPageHWPoison(p);
if (PageHuge(p))
nr_pages = 1 << compound_order(hpage);
else
nr_pages = 1;
atomic_long_add(nr_pages, &num_poisoned_pages);
unpoison_memory
nr_pages = 1<< compound_trans_order(page);
if(TestClearPageHWPoison(p))
atomic_long_sub(nr_pages, &num_poisoned_pages);
lock page
if (!PageHWPoison(p))
unlock page and return
hwpoison_user_mappings
if (PageTransHuge(hpage))
split_huge_page(hpage);
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:52 +0000 (14:22 -0700)]
mm/hwpoison: don't need to hold compound lock for hugetlbfs page
compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
is used to serialize put_page against __split_huge_page_refcount(). In
addition, transparent hugepages will be splitted in hwpoison handler and
just one subpage will be poisoned. There is unnecessary to hold compound
lock for hugetlbfs page. This patch replace compound_trans_order by
compond_order in the place where the page is hugetlbfs page.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Wed, 11 Sep 2013 21:22:50 +0000 (14:22 -0700)]
mm/hwpoison: fix loss of PG_dirty for errors on mlocked pages
memory_failure() store the page flag of the error page before doing unmap,
and (only) if the first check with page flags at the time decided the
error page is unknown, it do the second check with the stored page flag
since memory_failure() does unmapping of the error pages before doing
page_action(). This unmapping changes the page state, especially
page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
page_action() can't catch mlocked pages after that.
However, memory_failure() can't handle memory errors on dirty mlocked
pages correctly. try_to_unmap_one will move the dirty bit from pte to the
physical page, the second check lose it since it check the stored page
flag. This patch fix it by restore PG_dirty flag to stored page flag if
the page is dirty.
hwpoison: always unset MIGRATE_ISOLATE before returning from soft_offline_page()
Soft offline code expects that MIGRATE_ISOLATE is set on the target page
only during soft offlining work. But currenly it doesn't work as expected
when get_any_page() fails and returns negative value. In the result, end
users can have unexpectedly isolated pages. This patch just fixes it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling. For such filesystems balance_dirty_pages always check bdi
counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.
The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
filesystem may set the flag when it initializes its BDI.
The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
writeback). The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately. A fuse request queued
for real processing bears a copy of original page. Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.
To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".
The problem was very easy to reproduce. There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region. An
hour later I observed almost all RAM consumed by fuse writeback. Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.
Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users. This is write-through
page cache policy FUSE currently uses. I.e. handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation. Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.
And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature. Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain. Let's make simple
computations. Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.
After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule. May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).
[akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 11 Sep 2013 21:22:44 +0000 (14:22 -0700)]
mm/backing-dev.c: check user buffer length before copying data to the related user buffer
'*lenp' may be less than "sizeof(kbuf)" so we must check this before the
next copy_to_user().
pdflush_proc_obsolete() is called by sysctl which 'procname' is
"nr_pdflush_threads", if the user passes buffer length less than
"sizeof(kbuf)", it will cause issue.
Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>