Lai Jiangshan [Sun, 19 Oct 2008 03:28:21 +0000 (20:28 -0700)]
bitmask: remove bitmap_scnprintf_len()
bitmap_scnprintf_len() is not used now, so we remove it.
Otherwise we have to maintain it and make its return
value always equal to bitmap_scnprintf()'s return value.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Sun, 19 Oct 2008 03:28:20 +0000 (20:28 -0700)]
cpuset: use seq_*mask_* to print masks
1) seq_file excepts that m->count == m->size when it's buf is full,
so current code will causes bugs when buf is overflow.
2) There is not too good that cpuset accesses struct seq_file's
fields directly.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
seq_cpumask_list(), seq_nodemask_list() are very like seq_cpumask(),
seq_nodemask(), but they print human readable string.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Sun, 19 Oct 2008 03:28:18 +0000 (20:28 -0700)]
seq_file: don't call bitmap_scnprintf_len()
"m->count + len < m->size" is true commonly, so bitmap_scnprintf()
is commonly called. this fix saves a call to bitmap_scnprintf_len().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rakib Mullick [Sun, 19 Oct 2008 03:28:18 +0000 (20:28 -0700)]
cpuset.c: remove extra variable
Remove the use of int cpus_nonempty variable from 'update_flag' function.
Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)
On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.
By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.
I added printk message as
"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"
This patch makes page_cgroup->flags to be atomic_ops and define functions
(and macros) to access it.
Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
For example, we'll place LOCK bit on flags field. We need atomic
operation to modify LRU bit without LOCK.
I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation. This patch is for optimization to
reduce the size of this function.
There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to
handle specical case) We'd like to make use of usual objrmap/radix-tree's
protcol and don't want to account out-of-vm's control pages.
When special_mapping_fault() is called, page->mapping is tend to be NULL
and it's charged as Anonymous page. insert_page() also handles some
special pages from drivers.
This patch is for avoiding to account special pages.
This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.
"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not". This patch also adds BUG_ON() to
mem_cgroup_uncharge_cache_page();
While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)
This patch moves do_swap_page()'s charge() call under lock. I don't see
any bad problem *now* but this fix will be good for future for avoiding
unnecessary racy state.
Lai Jiangshan [Sun, 19 Oct 2008 03:28:07 +0000 (20:28 -0700)]
devcgroup: remove spin_lock()
Since we introduced rcu for read side, spin_lock is used only for update.
But we always hold cgroup_lock() when update, so spin_lock() is not need.
Additional cleanup:
1) include linux/rcupdate.h explicitly
2) remove unused variable cur_devcgroup in devcgroup_update_access()
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Sun, 19 Oct 2008 03:28:05 +0000 (20:28 -0700)]
cgroups: fix declaration of cgroup_mm_owner_callbacks
The choice of real/dummy declaration for cgroup_mm_owner_callbacks()
shouldn't be based on CONFIG_MM_OWNER, but on CONFIG_CGROUPS. Otherwise
kernel/exit.c fails to compile when something other than a cgroups
controller selects CONFIG_MM_OWNER
Signed-off-by: Paul Menage <menage@google.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Sun, 19 Oct 2008 03:28:04 +0000 (20:28 -0700)]
cgroups: convert tasks file to use a seq_file with shared pid array
Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace. All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array. By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.
[Based on a patch by Lai Jiangshan, extended to use seq_file]
Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Sesterhenn [Sun, 19 Oct 2008 03:28:02 +0000 (20:28 -0700)]
hfsplus: fix possible deadlock when handling corrupted extents
A corrupted extent for the extent file itself may try to get an impossible
extent, causing a deadlock if I see it correctly.
Check the inode number after the first_blocks checks and fail if it's the
extent file, as according to the spec the extent file should have no
extent for itself.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Sandeen [Sun, 19 Oct 2008 03:28:00 +0000 (20:28 -0700)]
ext3: avoid printk floods in the face of directory corruption
A very large directory with many read failures (either due to storage
problems, or due to invalid size & blocks from corruption) will generate a
printk storm as the filesystem continues to try to read all the blocks.
This flood of messages can tie up the box until it is complete - which may
be a very long time, especially for very large corrupted values.
This is fixed by only reporting the corruption once each time we try to
read the directory.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eugene Teo <eugeneteo@kernel.sg> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext3: truncate block allocated on a failed ext3_write_begin
For blocksize < pagesize we need to remove blocks that got allocated in
block_write_begin() if we fail with ENOSPC for later blocks.
block_write_begin() internally does this if it allocated page locally.
This makes sure we don't have blocks outside inode.i_size during ENOSPC.
Hidehiro Kawai [Sun, 19 Oct 2008 03:27:58 +0000 (20:27 -0700)]
jbd: ordered data integrity fix
In ordered mode, if a file data buffer being dirtied exists in the
committing transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it. But we
don't have to remove the buffer from the committing transaction when the
buffer couldn't be written out, otherwise it would miss the error and the
committing transaction would not abort.
This patch adds an error check before removing the buffer from the
committing transaction.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hidehiro Kawai [Sun, 19 Oct 2008 03:27:57 +0000 (20:27 -0700)]
ext3: add an option to control error handling on file data
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently. Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error. It's scary for mission critical systems. On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable. So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.
If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error. If you mount it with data_err=ignore, it doesn't abort, just
call printk(). data_err=ignore is the default.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mingming Cao [Sun, 19 Oct 2008 03:27:56 +0000 (20:27 -0700)]
ext3: fix ext3 block reservation early ENOSPC issue
We could run into ENOSPC error on ext3, even when there is free blocks on
the filesystem.
The problem is triggered in the case the goal block group has 0 free
blocks , and the rest block groups are skipped due to the check of
"free_blocks < windowsz/2". Current code could fall back to non
reservation allocation to prevent early ENOSPC after examing all the block
groups with reservation on , but this code was bypassed if the reservation
window is turned off already, which is true in this case.
This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.
Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough for
make the desired reservation window),to try to allocation in the goal
block group, to get better locality. But if the goal blocks have 0 free
blocks, it should leave the block reservation on, and continues search for
the next block groups,rather than turn off block reservation completely.
2) we don't need to check the window size if the block reservation is off.
The problem was originally found and fixed in ext4.
Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josef Bacik [Sun, 19 Oct 2008 03:27:55 +0000 (20:27 -0700)]
ext3: don't try to resize if there are no reserved gdt blocks left
When trying to resize a ext3 fs and you run out of reserved gdt blocks,
you get an error that doesn't actually tell you what went wrong, it just
says that the gdb it picked is not correct, which is the case since you
don't have any reserved gdt blocks left. This patch adds a check to make
sure you have reserved gdt blocks to use, and if not prints out a more
relevant error.
Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Andreas Dilger <adilger@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hidehiro Kawai [Sun, 19 Oct 2008 03:27:54 +0000 (20:27 -0700)]
jbd: don't dirty original metadata buffer on abort
Currently, original metadata buffers are dirtied when they are unfiled
whether the journal has aborted or not. Eventually these buffers will be
written-back to the filesystem by pdflush. This means some metadata
buffers are written to the filesystem without journaling if the journal
aborts. So if both journal abort and system crash happen at the same
time, the filesystem would become inconsistent state. Additionally,
replaying journaled metadata can overwrite the latest metadata on the
filesystem partly. Because, if the journal aborts, journaled metadata are
preserved and replayed during the next mount not to lose uncheckpointed
metadata. This would also break the consistency of the filesystem.
This patch prevents original metadata buffers from being dirtied on abort
by clearing BH_JBDDirty flag from those buffers. Thus, no metadata
buffers are written to the filesystem without journaling.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hidehiro Kawai [Sun, 19 Oct 2008 03:27:53 +0000 (20:27 -0700)]
jbd: abort when failed to log metadata buffers
If we failed to write metadata buffers to the journal space and succeeded
to write the commit record, stale data can be written back to the
filesystem as metadata in the recovery phase.
To avoid this, when we failed to write out metadata buffers, abort the
journal before writing the commit record.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Richard Holden [Sun, 19 Oct 2008 03:27:53 +0000 (20:27 -0700)]
phonedev: remove BKL
The phone_device array is covered by the phone_lock mutex in all cases and
request_module no longer needs the BKL so we can remove the only remaining
instance of the BKL from phonedev.
Signed-off-by: Richard Holden <aciddeath@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Krzysztof Helt [Sun, 19 Oct 2008 03:27:51 +0000 (20:27 -0700)]
fb: convert lock/unlock_kernel() into local fb mutex
Change lock_kernel()/unlock_kernel() to local fb mutex. Each frame buffer
instance has its own mutex.
The one line try_to_load() function is unrolled to request_module() in two
places for readability.
[righi.andrea@gmail.com: fb: fix NULL pointer BUG dereference in fb_open()] Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Sun, 19 Oct 2008 03:27:51 +0000 (20:27 -0700)]
fb: push down the BKL in the ioctl handler
Framebuffer is heavily BKL dependant at the moment so just wrap the ioctl
handler in the driver as we push down.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alan Cox <alan@redhat.com> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's OK to request the value of *any* GPIO; most GPIOs are bidirectional,
so configuring them as outputs just enables an output driver and doesn't
disable the input logic.
So the problem is that gpio_get_value_cansleep() isn't making the same
sanity check that gpio_get_value() does: making sure this GPIO isn't one
of the atypical "no input logic" cases.
Reported-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: <stable@kernel.org> [2.6.27.x, 2.6.26.x, 2.6.25.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steven A. Falco [Sun, 19 Oct 2008 03:27:48 +0000 (20:27 -0700)]
gpio: modify sysfs gpio export so that "value" displays as 0 or 1
gpiolib can export GPIOs to userspace via sysfs. This patch modifies the
gpio_value_show() so that any non-zero value is explicitly printed as "1",
rather than whatever numerical value the lower-level driver returns.
Signed-off-by: Steve Falco <sfalco@harris.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Sun, 19 Oct 2008 03:27:47 +0000 (20:27 -0700)]
rtc-cmos: export second NVRAM bank
Teach rtc-cmos about the second bank of registers found on most modern x86
systems, giving access to 128 bytes more NVRAM.
This version only sees that extra NVRAM when both register banks are
provided as part of *one* PNP resource. Since BIOS on some systems
presents them using two IO resources, and nothing merges them, this can't
always show all the NVRAM. (We're supposed to be able to use PNP id
PNP0b01 too, but BIOS tables doesn't often seem to use that particular
option.)
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The read fail ratio is sensitive to the delay between the first byte
written and the first byte read; apparently the sensors cannot be rushed.
Increasing the minimum wait time, without changing the total wait time,
improves the fail ratio from a 8% chance that any of the sensors fails in
one read, down to 0.4%, on a Macbook Air. On a Macbook Pro 3,1, the
effect is even more apparent. By reducing the number of status polls, the
ratio is further improved to below 0.1%. Finally, increasing the total
wait time brings the fail ratio down to virtually zero.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Tested-by: Bob McElrath <bob@mcelrath.org> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:42 +0000 (20:27 -0700)]
hwmon: applesmc: Add support for Macbook Pro 3
Add temperature sensor support for Macbook Pro 3.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:41 +0000 (20:27 -0700)]
hwmon: applesmc: Add support for Macbook Pro 4
Adds temperature sensor support for the Macbook Pro 4.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Sun, 19 Oct 2008 03:27:41 +0000 (20:27 -0700)]
drivers/hwmon/applesmc.c: remove unneeded casts
dmi_system_id.driver_data is already void*.
Cc: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:40 +0000 (20:27 -0700)]
hwmon: applesmc: add support for Macbook Air
This patch adds accelerometer, backlight and temperature sensor support
for the Macbook Air.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:39 +0000 (20:27 -0700)]
hwmon: applesmc: allow for variable ALV0 and ALV1 package length
On some recent Macbooks, the package length for the light sensors ALV0 and
ALV1 has changed from 6 to 10. This patch allows for a variable package
length encompassing both variants.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:39 +0000 (20:27 -0700)]
hwmon: applesmc: prolong status wait
The time to wait for a status change while reading or writing to the SMC
ports is a balance between read reliability and system performance. The
current setting yields rougly three errors in a thousand when
simultaneously reading three different temperature values on a Macbook
Air. This patch increases the setting to a value yielding roughly one
error in ten thousand, with no noticable system performance degradation.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:38 +0000 (20:27 -0700)]
hwmon: applesmc: fix the 'wait status failed: c != 8' problem
On many Macbooks since mid 2007, the Pro, C2D and Air models, applesmc
fails to read some or all SMC ports. This problem has various effects,
such as flooded logfiles, malfunctioning temperature sensors,
accelerometers failing to initialize, and difficulties getting backlight
functionality to work properly.
The root of the problem seems to be the command protocol. The current
code sends out a command byte, then repeatedly polls for an ack before
continuing to send or recieve data. From experiments leading to this
patch, it seems the command protocol never quite worked or changed so that
one now sends a command byte, waits a little bit, polls for an ack, and if
it fails, repeats the whole thing by sending the command byte again.
This patch implements a send_command function according to the new
interpretation of the protocol, and should work also for earlier models.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Henrik Rydberg [Sun, 19 Oct 2008 03:27:35 +0000 (20:27 -0700)]
hwmon: applesmc: specified number of bytes to read should match actual
At one single place in the code, the specified number of bytes to read and
the actual number of bytes read differ by one. This one-liner patch fixes
that inconsistency.
Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Cc: Nicolas Boichat <nicolas@boichat.ch> Cc: Riki Oktarianto <rkoktarianto@gmail.com> Cc: Mark M. Hoffman <mhoffman@lightlink.com> Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Cromie [Sun, 19 Oct 2008 03:27:35 +0000 (20:27 -0700)]
hwmon/pc87360 separate alarm files: add therm-min/max/crit-alarms
Adds therm-min/max/crit-alarm callbacks, sensor-device-attribute
declarations, and refs to those new decls in the macro used to initialize
the therm_group (of sysfs files)
The thermistors use voltage channels to measure; so they don't have a
fault-alarm, but unlike the other voltages, they do have an overtemp,
which we call crit (by convention).
[akpm@linux-foundation.org: cleanup] Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Cromie [Sun, 19 Oct 2008 03:27:34 +0000 (20:27 -0700)]
hwmon/pc87360 separate alarm files: add dev_dbg help
temp and vin status register values may be set by chip specifications, set
again by bios, or by this previously loaded driver. Debug output nicely
displays modprobe init=\d actions.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Cromie [Sun, 19 Oct 2008 03:27:32 +0000 (20:27 -0700)]
hwmon/pc87360 separate alarm files: add temp-min/max/crit/fault-alarms
Adds temp-min/max/crit/fault-alarm callbacks, sensor-device-attribute
declarations, and refs to those new decls in the macro used to initialize
the temp_group (of sysfs files)
[akpm@linux-foundation.org: cleanups] Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Cromie [Sun, 19 Oct 2008 03:27:32 +0000 (20:27 -0700)]
hwmon/pc87360 separate alarm files: add in-min/max-alarms
Adds vin-min/max-alarm callbacks, sensor-device-attribute declarations,
and refs to those new decls in the macro used to initialize the vin_group
(of sysfs files)
[akpm@linux-foundation.org: cleanups] Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Cromie [Sun, 19 Oct 2008 03:27:30 +0000 (20:27 -0700)]
hwmon/pc87360 separate alarm files: define some constants
Bring hwmon/pc87360 into agreement with
Documentation/hwmon/sysfs-interface.
Patchset adds separate limit alarms for voltages and temps, it also adds
temp[123]_fault files. On my Soekris, temps 1,2 are unused/unconnected,
so temp[123]_fault = 1,1,0 respectively. This agrees with
/usr/bin/sensors, which has always shown them as OPEN. Temps 4,5,6 are
thermistor based, and dont have a fault bit in their status register.
This patch:
2 different kinds of constants added:
- CHAN_ALM_* constants for (later) vin, temp alarm callbacks.
- CHAN_* conversion constants, used in _init_device, partly for RW1C bits
Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: "Mark M. Hoffman" <mhoffman@lightlink.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Piel [Sun, 19 Oct 2008 03:27:28 +0000 (20:27 -0700)]
HP-WMI: additional keycode (or typo)
On my HP 2510, pressing the (i) button generates an unknown keycode:
0x213b. So here is a patch adding support for it. However, as it seems
there is already support for a similar button connected to 0x231b as
keycode, I wonder if it could be a typo in the driver?
Signed-off-by: Eric Piel <eric.piel@tremplin-utc.net> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WANG Cong [Sun, 19 Oct 2008 03:27:26 +0000 (20:27 -0700)]
uml: fix a compile error
Fix
arch/um/sys-i386/signal.c: In function 'copy_sc_from_user':
arch/um/sys-i386/signal.c:182: warning: dereferencing 'void *' pointer
arch/um/sys-i386/signal.c:182: error: request for member '_fxsr_env' in something not a structure or union
Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Helsley [Sun, 19 Oct 2008 03:27:24 +0000 (20:27 -0700)]
container freezer: document the cgroup freezer subsystem.
Describe why we need the freezer subsystem and how to use it in a
documentation file. Since the cgroups.txt file is focused on the
subsystem-agnostic portions of cgroups make a directory and move the old
cgroups.txt file at the same time.
Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: containers@lists.linux-foundation.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Helsley [Sun, 19 Oct 2008 03:27:22 +0000 (20:27 -0700)]
container freezer: prevent frozen tasks or cgroups from changing
Don't let frozen tasks or cgroups change. This means frozen tasks can't
leave their current cgroup for another cgroup. It also means that tasks
cannot be added to or removed from a cgroup in the FROZEN state. We
enforce these rules by checking for frozen tasks and cgroups in the
can_attach() function.
Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Helsley [Sun, 19 Oct 2008 03:27:22 +0000 (20:27 -0700)]
container freezer: skip frozen cgroups during power management resume
When a system is resumed after a suspend, it will also unfreeze frozen
cgroups.
This patchs modifies the resume sequence to skip the tasks which are part
of a frozen control group.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements a new freezer subsystem in the control groups
framework. It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.
The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup. Reading will return the current state.
This is the basic mechanism which should do the right thing for user space
task in a simple scenario.
It's important to note that freezing can be incomplete. In that case we
return EBUSY. This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time. After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read. The state will remain
"FREEZING" until one of these things happens:
1) Userspace cancels the freezing operation by writing "RUNNING" to
the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
the freezer.state file (writing "FREEZING" is not legal
and returns EIO)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Helsley [Sun, 19 Oct 2008 03:27:19 +0000 (20:27 -0700)]
container freezer: make refrigerator always available
Now that the TIF_FREEZE flag is available in all architectures, extract
the refrigerator() and freeze_task() from kernel/power/process.c and make
it available to all.
The refrigerator() can now be used in a control group subsystem
implementing a control group freezer.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Helsley [Sun, 19 Oct 2008 03:27:18 +0000 (20:27 -0700)]
container freezer: add TIF_FREEZE flag to all architectures
This patch series introduces a cgroup subsystem that utilizes the swsusp
freezer to freeze a group of tasks. It's immediately useful for batch job
management scripts. It should also be useful in the future for
implementing container checkpoint/restart.
The freezer subsystem in the container filesystem defines a cgroup file
named freezer.state. Reading freezer.state will return the current state
of the cgroup. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.
Brice Goglin [Sun, 19 Oct 2008 03:27:17 +0000 (20:27 -0700)]
mm: extract do_pages_move() out of sys_move_pages()
To prepare the chunking, move the sys_move_pages() code that is used when
nodes!=NULL into do_pages_move(). And rename do_move_pages() into
do_move_page_to_node_array().
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brice Goglin [Sun, 19 Oct 2008 03:27:16 +0000 (20:27 -0700)]
mm: don't vmalloc a huge page_to_node array for do_pages_stat()
do_pages_stat() does not need any page_to_node entry for real. Just pass
the pointers to the user-space page address array and to the user-space
status array, and have do_pages_stat() traverse the former and fill the
latter directly.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Brice Goglin [Sun, 19 Oct 2008 03:27:15 +0000 (20:27 -0700)]
mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated
A patchset reworking sys_move_pages(). It removes the possibly large
vmalloc by using multiple chunks when migrating large buffers. It also
dramatically increases the throughput for large buffers since the lookup
in new_page_node() is now limited to a single chunk, causing the quadratic
complexity to have a much slower impact. There is no need to use any
radix-tree-like structure to improve this lookup.
sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
migrating between nodes #2 and #3:
Patches #1 and #4 are the important ones:
1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
2) don't vmalloc a huge page_to_node array for do_pages_stat()
3) extract do_pages_move() out of sys_move_pages()
4) rework do_pages_move() to work on page_sized chunks
5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default
This patch:
There is no point in returning -ENOENT from sys_move_pages() if all pages
were already on the right node, while we return 0 if only 1 page was not.
Most application don't know where their pages are allocated, so it's not
an error to try to migrate them anyway.
Just return 0 and let the status array in user-space be checked if the
application needs details.
It will make the upcoming chunked-move_pages() support much easier.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nathan Fontenot [Sun, 19 Oct 2008 03:27:14 +0000 (20:27 -0700)]
memory hotplug: release memory regions in PAGES_PER_SECTION chunks
During hotplug memory remove, memory regions should be released on a
PAGES_PER_SECTION size chunks. This mirrors the code in add_memory where
resources are requested on a PAGES_PER_SECTION size.
Attempting to release the entire memory region fails because there is not
a single resource for the total number of pages being removed. Instead
the resources for the pages are split in PAGES_PER_SECTION size chunks as
requested during memory add.
Andrea Righi [Sun, 19 Oct 2008 03:27:13 +0000 (20:27 -0700)]
documentation: clarify dirty_ratio and dirty_background_ratio description
The current documentation of dirty_ratio and dirty_background_ratio is a
bit misleading.
In the documentation we say that they are "a percentage of total system
memory", but the current page writeback policy, intead, is to apply the
percentages to the dirtyable memory, that means free pages + reclaimable
pages.
Better to be more explicit to clarify this concept.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Sun, 19 Oct 2008 03:27:12 +0000 (20:27 -0700)]
memory_probe: fix wrong sysfs file attribute
This attribute just has a write operation.
[akpm@linux-foundation.org: use S_IWUSR as suggested by Randy] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gerald Schaefer [Sun, 19 Oct 2008 03:27:11 +0000 (20:27 -0700)]
setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock
This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
There seems to be no need for the lru_lock anymore, but there is a need for
zone->lock instead, because that function may call move_freepages() via
setup_zone_migrate_reserve().
KOSAKI Motohiro [Sun, 19 Oct 2008 03:27:08 +0000 (20:27 -0700)]
coredump_filter: add hugepage dumping
Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped. But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g. vmalloc, sound related).
Thus hugepages are never dumped and it can't be debugged easily. Many
developers want hugepages to be included into core-dump.
However, We can't read generic VM_RESERVED area because this area is often
IO mapping area. then these area reading may change device state. it is
definitly undesiable side-effect.
So adding a hugepage specific bit to the coredump filter is better. It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.
In additional, libhugetlb use hugetlb private mapping pages as anonymous
page. Then, hugepage private mapping pages should be core dumped by
default.
Then, /proc/[pid]/core_dump_filter has two new bits.
- bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
- bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)
Harvey Harrison [Sun, 19 Oct 2008 03:27:06 +0000 (20:27 -0700)]
mm: hugetlb.c make functions static, use NULL rather than 0
mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Sun, 19 Oct 2008 03:27:03 +0000 (20:27 -0700)]
mm: rewrite vmap layer
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Denys Vlasenko [Sun, 19 Oct 2008 03:27:01 +0000 (20:27 -0700)]
mmap.c: deinline a few functions
__vma_link_file and expand_downwards functions are not small, yeat they
are marked inline. They probably had one callsite sometime in the past,
but now they have more. In order to prevent similar thing, I also
deinlined expand_upwards, despite it having only pne callsite. Nowadays
gcc auto-inlines such static functions anyway. In find_extend_vma, I
removed one extra level of indirection.
Patch is deliberately generated with -U $BIGNUM to make
it easier to see that functions are big.
Nick Piggin [Sun, 19 Oct 2008 03:26:58 +0000 (20:26 -0700)]
mm: unlockless reclaim
unlock_page is fairly expensive. It can be avoided in page reclaim
success path. By definition if we have any other references to the page
it would be a bug anyway.
Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Sun, 19 Oct 2008 03:26:57 +0000 (20:26 -0700)]
mm: pagecache insertion fewer atomics
Setting and clearing the page locked when inserting it into swapcache /
pagecache when it has no other references can use non-atomic page flags
operations because no other CPU may be operating on it at this time.
This saves one atomic operation when inserting a page into pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:56 +0000 (20:26 -0700)]
mlock: make mlock error return Posixly Correct
Rework Posix error return for mlock().
Posix requires error code for mlock*() system calls for some conditions
that differ from what kernel low level functions, such as
get_user_pages(), return for those conditions. For more info, see:
This patch provides the same translation of get_user_pages()
error codes to posix specified error codes in the context
of the mlock rework for unevictable lru.
[akpm@linux-foundation.org: fix build] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:56 +0000 (20:26 -0700)]
mlock: revert mainline handling of mlock error return
This change is intended to make mlock() error returns correct.
make_page_present() is a lower level function used by more than mlock().
Subsequent patch[es] will add this error return fixup in an mlock specific
path.
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Sun, 19 Oct 2008 03:26:55 +0000 (20:26 -0700)]
vmscan: don't accumulate scan pressure on unrelated lists
During each reclaim scan we accumulate scan pressure on unrelated lists
which will result in bogus scans and unwanted reclaims eventually.
Scanning lists with few reclaim candidates results in a lot of rotation
and therefor also disturbs the list balancing, putting even more
pressure on the wrong lists.
In a test-case with much streaming IO, and therefor a crowded inactive
file page list, swapping started because
a) anon pages were reclaimed after swap_cluster_max reclaim
invocations -- nr_scan of this list has just accumulated
b) active file pages were scanned because *their* nr_scan has also
accumulated through the same logic. And this in return created a
lot of rotation for file pages and resulted in a decrease of file
list priority, again increasing the pressure on anon pages.
The result was an evicted working set of anon pages while there were
tons of inactive file pages that should have been taken instead.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:53 +0000 (20:26 -0700)]
vmscan: unevictable LRU scan sysctl
This patch adds a function to scan individual or all zones' unevictable
lists and move any pages that have become evictable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.
Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.
Kosaki: If evictable page found in unevictable lru when write
/proc/sys/vm/scan_unevictable_pages, print filename and file offset of
these pages.
[akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
[kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:52 +0000 (20:26 -0700)]
swap: cull unevictable pages in fault path
In the fault paths that install new anonymous pages, check whether the
page is evictable or not using lru_cache_add_active_or_unevictable(). If
the page is evictable, just add it to the active lru list [via the pagevec
cache], else add it to the unevictable list.
This "proactive" culling in the fault path mimics the handling of mlocked
pages in Nick Piggin's series to keep mlocked pages off the lru lists.
Notes:
1) This patch is optional--e.g., if one is concerned about the
additional test in the fault path. We can defer the moving of
nonreclaimable pages until when vmscan [shrink_*_list()]
encounters them. Vmscan will only need to handle such pages
once, but if there are a lot of them it could impact system
performance.
2) The 'vma' argument to page_evictable() is require to notice that
we're faulting a page into an mlock()ed vma w/o having to scan the
page's rmap in the fault path. Culling mlock()ed anon pages is
currently the only reason for this patch.
3) We can't cull swap pages in read_swap_cache_async() because the
vma argument doesn't necessarily correspond to the swap cache
offset passed in by swapin_readahead(). This could [did!] result
in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
cull in this path.
4) Move set_pte_at() to after where we add page to lru to keep it
hidden from other tasks that might walk the page table.
We already do it in this order in do_anonymous() page. And,
these are COW'd anon pages. Is this safe?
[riel@redhat.com: undo an overzealous code cleanup] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rik van Riel [Sun, 19 Oct 2008 03:26:50 +0000 (20:26 -0700)]
mmap: handle mlocked pages during map, remap, unmap
Originally by Nick Piggin <npiggin@suse.de>
Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate(). Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed. Remove
PageMlocked() status when page truncated from file.
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:49 +0000 (20:26 -0700)]
mlock: downgrade mmap sem while populating mlocked regions
We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas. However, this can lead to very
long lock hold times attempting to fault in a large memory region to mlock
it into memory. This can hold off other faults against the mm
[multithreaded tasks] and other scans of the mm, such as via /proc. To
alleviate this, downgrade the mmap_sem to read mode during the population
of the region for locking. This is especially the case if we need to
reclaim memory to lock down the region. We [probably?] don't need to do
this for unlocking as all of the pages should be resident--they're already
mlocked.
Now, the caller's of the mlock functions [mlock_fixup() and
mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
Changing all callers appears to be way too much effort at this point.
So, restore write mode before returning. Note that this opens a window
where the mmap list could change in a multithreaded process. So, at least
for mlock_fixup(), where we could be called in a loop over multiple vmas,
we check that a vma still exists at the start address and that vma still
covers the page range [start,end). If not, we return an error, -EAGAIN,
and let the caller deal with it.
Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma. Again, let the caller deal
with it. Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.
With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down. However, I occassionally see delays while unlocking or
unmapping a large mlocked region. Should we also downgrade the mmap_sem
for the unlock path?
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Sun, 19 Oct 2008 03:26:44 +0000 (20:26 -0700)]
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes: Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:43 +0000 (20:26 -0700)]
SHM_LOCKED pages are unevictable
Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
kept on the normal LRU, since scanning them is a waste of time and might
throw off kswapd's balancing algorithms. Place them on the unevictable
LRU list instead.
Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
memory regions as unevictable. Then these pages will be culled off the
normal LRU lists during vmscan.
Add new wrapper function to clear the mapping's unevictable state when/if
shared memory segment is munlocked.
Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
the shmem segment's mapping [struct address_space] for evictability now
that they're no longer locked. If so, move them to the appropriate zone
lru list.
Changes depend on [CONFIG_]UNEVICTABLE_LRU.
[kosaki.motohiro@jp.fujitsu.com: revert shm change] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:42 +0000 (20:26 -0700)]
Ramfs and Ram Disk pages are unevictable
Christoph Lameter pointed out that ram disk pages also clutter the LRU
lists. When vmscan finds them dirty and tries to clean them, the ram disk
writeback function just redirties the page so that it goes back onto the
active list. Round and round she goes...
With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
longer the case, as ram disk pages are no longer maintained on the lru.
[This makes them unmigratable for defrag or memory hot remove, but that
can be addressed by a separate patch series.] However, the ramfs pages
behave like ram disk pages used to, so:
Define new address_space flag [shares address_space flags member with
mapping's gfp mask] to indicate that the address space contains all
unevictable pages. This will provide for efficient testing of ramfs pages
in page_evictable().
Also provide wrapper functions to set/test the unevictable state to
minimize #ifdefs in ramfs driver and any other users of this facility.
Set the unevictable state on address_space structures for new ramfs
inodes. Test the unevictable state in page_evictable() to cull
unevictable pages.
These changes depend on [CONFIG_]UNEVICTABLE_LRU.
[riel@redhat.com: undo the brd.c part] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:40 +0000 (20:26 -0700)]
unevictable lru: add event counting with statistics
Fix to unevictable-lru-page-statistics.patch
Add unevictable lru infrastructure vm events to the statistics patch.
Rename the "NORECL_" and "noreclaim_" symbols and text strings to
"UNEVICTABLE_" and "unevictable_", respectively.
Currently, both the infrastructure and the mlocked pages event are
added by a single patch later in the series. This makes it difficult
to add or rework the incremental patches. The events actually "belong"
with the stats, so pull them up to here.
Also, restore the event counting to putback_lru_page(). This was removed
from previous patch in series where it was "misplaced". The actual events
weren't defined that early.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>