Michal Hocko [Sat, 17 May 2014 13:19:26 +0000 (23:19 +1000)]
memcg: allow setting low_limit
Export memory.low_limit_in_bytes knob with the same rules as the hard
limit represented by limit_in_bytes knob (e.g. no limit to be set for the
root cgroup). There is no memsw alternative for low_limit_in_bytes
because the primary motivation behind this limit is to protect the working
set of the group and so considering swap doesn't make much sense. There
is also no kmem variant exported because we do not have any easy way to
protect kernel allocations now.
Please note that the low limit might exceed the hard limit which basically
means that the group is not reclaimable if there is other reclaim target
in the hierarchy under pressure.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Sat, 17 May 2014 13:19:25 +0000 (23:19 +1000)]
memcg, mm: introduce lowlimit reclaim
Previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach for protecting memory allocated to processes executing within a
memory cgroup controller. It is based on a new tunable that was discussed
with Johannes and Tejun held during the kernel summit 2013 and at LSF
2014.
This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not
considered eligible for the reclaim (both global and hardlimit) unless all
groups under the reclaimed hierarchy are below the low limit when all of
them are considered eligible.
The previous version of the patchset posted as a RFC
(http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a hard
guarantee without any fallback. More discussions led me to reconsidering
the default behavior and come up a more relaxed one. The hard requirement
can be added later based on a use case which really requires. It would be
controlled by memory.reclaim_flags knob which would specify whether to OOM
or fallback (default) when all groups are bellow low limit.
The default value of the limit is 0 so all groups are eligible by default
and an interested party has to explicitly set the limit.
The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the use of
memory overcommit as the application must explicitly manage memory
residency.
With the low limit, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.
The hierarchical behavior of the lowlimit is described in the first patch.
The second patch allows setting the lowlimit. The last 2 patches clarify
documentation about the memcg reclaim in gereneral (3rd patch) and low
limit (4th patch).
This patch (of 4)
This patch introduces low limit reclaim. The low_limit acts as a reclaim
protection because groups which are under their low_limit are considered
ineligible for reclaim. While hardlimit protects from using more memory
than allowed lowlimit protects from getting below memory assigned to the
group due to external memory pressure.
More precisely a group is considered eligible for the reclaim under a
specific hierarchy represented by its root only if the group is above its
low limit and the same applies to all parents up the hierarchy to the
root. Nevertheless the limit still might be ignored if all groups under
the reclaimed hierarchy are under their low limits. This will prevent
from OOM rather than protecting the memory.
Consider the following hierarchy with memory pressure coming from the
group A (hard limit reclaim - l-low_limit_in_bytes, u-usage_in_bytes,
h-limit_in_bytes):
root_mem_cgroup
.
_____/
/
A (l = 80 u=90 h=90)
/
/ \_________
/ \
B (l=0 u=50) C (l=50 u=40)
\
D (l=0 u=30)
A and B are reclaimable but C and D are not (D is protected by C).
The low_limit is 0 by default so every group is eligible. This patch
doesn't provide a way to set the limit yet although the core
infrastructure is there already.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Sat, 17 May 2014 13:19:25 +0000 (23:19 +1000)]
sysrq,rcu: suppress RCU stall warnings while sysrq runs
Some sysrq handlers can run for a long time, because they dump a lot of
data onto a serial console. Having RCU stall warnings pop up in the
middle of them only makes the problem worse.
This patch temporarily disables RCU stall warnings while a sysrq request
is handled.
Signed-off-by: Rik van Riel <riel@redhat.com> Suggested-by: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Madper Xie <cxie@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Sat, 17 May 2014 13:19:25 +0000 (23:19 +1000)]
sysrq: rcu-ify __handle_sysrq
Echoing values into /proc/sysrq-trigger seems to be a popular way to get
information out of the kernel. However, dumping information about
thousands of processes, or hundreds of CPUs to serial console can result
in IRQs being blocked for minutes, resulting in various kinds of cascade
failures.
The most common failure is due to interrupts being blocked for a very long
time. This can lead to things like failed IO requests, and other things
the system cannot easily recover from.
This problem is easily fixable by making __handle_sysrq use RCU instead of
spin_lock_irqsave.
This leaves the warning that RCU grace periods have not elapsed for a long
time, but the system will come back from that automatically.
It also leaves sysrq-from-irq-context when the sysrq keys are pressed, but
that is probably desired since people want that to work in situations
where the system is already hosed.
The callers of register_sysrq_key and unregister_sysrq_key appear to be
capable of sleeping.
Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Madper Xie <cxie@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Sat, 17 May 2014 13:19:24 +0000 (23:19 +1000)]
kernel/kprobes.c: convert printk to pr_foo()
Also fixes some checkpatch warnings
-Static initialization
-Lines over 80 characters
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Sat, 17 May 2014 13:19:24 +0000 (23:19 +1000)]
x86,vdso: fix an OOPS accessing the hpet mapping w/o an hpet
The oops can be triggered in qemu using -no-hpet (but not nohpet) by
reading a couple of pages past the end of the vdso text. This should send
SIGBUS instead of OOPSing.
x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
which is new in 3.15.
This will be fixed separately in 3.15, but that patch will not apply to
tip/x86/vdso. This is the equivalent fix for tip/x86/vdso and,
presumably, 3.16.
Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stefani Seibold <stefani@seibold.net> Cc: <stable@vger.kernel.org> [needs rework for 3.15 and earlier] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Sat, 17 May 2014 13:19:24 +0000 (23:19 +1000)]
rwsem: Support optimistic spinning
We have reached the point where our mutexes are quite fine tuned for a
number of situations. This includes the use of heuristics and optimistic
spinning, based on MCS locking techniques.
Exclusive ownership of read-write semaphores are, conceptually, just about
the same as mutexes, making them close cousins. To this end we need to
make them both perform similarly, and right now, rwsems are simply not up
to it. This was discovered by both reverting commit 4fc3f1d6 (mm/rmap,
migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable)
and similarly, converting some other mutexes (ie: i_mmap_mutex) to rwsems.
This creates a situation where users have to choose between a rwsem and
mutex taking into account this important performance difference.
Specifically, biggest difference between both locks is when we fail to
acquire a mutex in the fastpath, optimistic spinning comes in to play and
we can avoid a large amount of unnecessary sleeping and overhead of moving
tasks in and out of wait queue. Rwsems do not have such logic.
This patch, based on the work from Tim Chen and I, adds support for
write-side optimistic spinning when the lock is contended. It also
includes support for the recently added cancelable MCS locking for
adaptive spinning. Note that is is only applicable to the xadd method,
and the spinlock rwsem variant remains intact.
Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem to
get acquired. With these changes, rwsem is on par with mutex. The
performance benefits can be seen on a number of workloads. For instance,
on a 8 socket, 80 core 64bit Westmere box, aim7 shows the following
improvements in throughput:
There was also improvement on smaller systems, such as a quad-core x86-64
laptop running a 30Gb PostgreSQL (pgbench) workload for up to +60% in
throughput for over 50 clients. Additionally, benefits were also noticed
in exim (mail server) workloads. When comparing against regular
non-blocking rw locks ([q]rwlock_t), this change proves that it can
outperform them, for instance when studying the popular anon-vma lock:
kernel/watchdog.c: In function `watchdog_timer_fn':
kernel/watchdog.c:368:4: warning: `smp_mb__after_clear_bit' is deprecated (declared at include/linux/bitops.h:48) [-Wdeprecated-declarations]
smp_mb__after_clear_bit();
That code was introduced in commit 90e6b763ca8a5eb739e59489f42d45e13431d157
("kernel/watchdog.c: print traces for all cpus on lockup detection") and then
merged with another branch containing commit febdbfe8a91ce0d11939d4940b592eb0dba8d663 ("arch: Prepare for
smp_mb__{before,after}_atomic()") which deprecates the
smp_mb__after_clear_bit() call in favour of smp_mb__after_atomic().
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Acked-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aaron Tomlin <atomlin@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Aaron Tomlin [Sat, 17 May 2014 13:19:22 +0000 (23:19 +1000)]
kernel/watchdog.c: print traces for all cpus on lockup detection
A 'softlockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than a predefined period to time, without giving
other tasks a chance to run.
Currently, upon detection of this condition by the per-cpu watchdog task,
debug information (including a stack trace) is sent to the system log.
On some occasions, we have observed that the "victim" rather than the
actual "culprit" (i.e. the owner/holder of the contended resource) is
reported to the user. Often this information has proven to be
insufficient to assist debugging efforts.
To avoid loss of useful debug information, for architectures which support
NMI, this patch makes it possible to improve soft lockup reporting. This
is accomplished by issuing an NMI to each cpu to obtain a stack trace.
If NMI is not supported we just revert back to the old method. A sysctl
and boot-time parameter is available to toggle this feature.
[dzickus@redhat.com: add CONFIG_SMP in certain areas] Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Sat, 17 May 2014 13:19:22 +0000 (23:19 +1000)]
blackfin/ptrace: call find_vma with the mmap_sem held
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (shared) in
order to avoid races while iterating through the vmacache and/or rbtree.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Steven Miao <realmz6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rtc: s5m: consolidate two device type switch statements
In probe the configuration of driver for different chipsets was done in
two switch (pdata->device_type) statements. Consolidate them into one
switch statement to increase code readability.
Additionally check the return value of regmap_irq_get_virq and exit probe
on error.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Lee Jones <lee.jones@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add support for S2MPS14 to the rtc-s5m driver. Differences in S2MPS14
(in comparison to S5M8767):
- Layout of registers;
- Lack of century support for time and alarms (7 registers used for
storing time/alarm);
- Two buffer control registers: WUDR and RUDR;
- No register for enabling writing time;
- RTC interrupts are reported in main PMIC I2C device;
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Lee Jones <lee.jones@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prepare for adding support for S2MPS14 RTC device to the
rtc-s5m driver:
1. Add a map of registers used by the driver which differ between
the chipsets (S5M876X and S2MPS14).
2. Move code of checking for alarm pending to separate function.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Lee Jones <lee.jones@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Set the time needed for updating alarm and time registers to 0.45 ms.
The default is 7.32 ms which is too long and leads to warnings when
setting alarm or time:
s5m-rtc: waiting for UDR update, reached max number of retries
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Lee Jones <lee.jones@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rtc: s5m: remove undocumented time init on first boot
Remove the code for initializing time if this is first boot.
The code for detecting first boot uses undocumented field RTC_TCON in
RTC_UDR_CON register. According to S5M8767's datasheet this field is
reserved. On S2MPS14 it is not documented at all. On device first boot
the registers will be initialized with reset value (2000-01-01 00:00:00).
The code might work on S5M8763 but still this does not look like a task
for RTC driver.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Lee Jones <lee.jones@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prepare for adding support for S2MPS14 RTC device to the rtc-s5m driver:
1. Rename SEC* symbols to S5M.
2. Add S5M prefix to some of defines which are different between S5M876X
and S2MPS14.
This is only a rename-like patch, new code is not added.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Lee Jones <lee.jones@linaro.org> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 17 May 2014 13:19:15 +0000 (23:19 +1000)]
lib/test_bpf.c: don't use gcc union shortcut
Older gcc's (mine is gcc-4.4.4) make a mess of this.
lib/test_bpf.c:74: error: unknown field 'insns' specified in initializer
lib/test_bpf.c:75: warning: missing braces around initializer
lib/test_bpf.c:75: warning: (near initialization for 'tests[0].<anonymous>.insns[0]')
lib/test_bpf.c:76: error: extra brace group at end of initializer
lib/test_bpf.c:76: error: (near initialization for 'tests[0].<anonymous>')
lib/test_bpf.c:76: warning: excess elements in union initializer
lib/test_bpf.c:76: warning: (near initialization for 'tests[0].<anonymous>')
lib/test_bpf.c:77: error: extra brace group at end of initializer
Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 17 May 2014 13:19:15 +0000 (23:19 +1000)]
mm/page_io.c: work around gcc bug
gcc-4.4.4 (at least) screws up this initialization.
mm/page_io.c: In function '__swap_writepage':
mm/page_io.c:277: error: unknown field 'bvec' specified in initializer
mm/page_io.c:278: warning: excess elements in struct initializer
mm/page_io.c:278: warning: (near initialization for 'from')