Jan Kara [Thu, 13 Jan 2011 23:45:47 +0000 (15:45 -0800)]
writeback: stop background/kupdate works from livelocking other works
Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file). This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.
But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed. In particular, since
e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.
Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does. Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout. So it
makes sense to give specific work a priority over a generic page cleaning.
Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.
This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.
[fengguang.wu@intel.com: update comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 13 Jan 2011 23:45:44 +0000 (15:45 -0800)]
writeback: integrated background writeback work
Check whether background writeback is needed after finishing each work.
When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes). If so, just do background write back.
This way, bdi_start_background_writeback() just needs to wake up the
flusher thread. It will do background writeback as soon as there is no
other work.
This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Thu, 13 Jan 2011 23:45:43 +0000 (15:45 -0800)]
mm: vmstat: use a single setter function and callback for adjusting percpu thresholds
reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
errors due to counter drift. The functions duplicate some code so this
patch replaces them with a single set_pgdat_percpu_threshold() that takes
a callback function to calculate the desired threshold as a parameter.
Mel Gorman [Thu, 13 Jan 2011 23:45:41 +0000 (15:45 -0800)]
mm: page allocator: adjust the per-cpu counter threshold when memory is low
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold. On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high. The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues. The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.
Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node. There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.
To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat(). We are still using an expensive
function but limiting how often it is called.
When the test case is reproduced, the time spent in the watermark
functions is reduced. The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().
vanilla 11.6615%
disable-threshold 0.2584%
David said:
: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity. I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Tested-by: Nicolas Bareil <nico@chdir.org> Cc: David Rientjes <rientjes@google.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: <stable@kernel.org> [2.6.37.1, 2.6.36.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Jones [Thu, 13 Jan 2011 23:45:40 +0000 (15:45 -0800)]
sched: remove long deprecated CLONE_STOPPED flag
This warning was added in commit bdff746a3915 ("clone: prepare to recycle
CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
no-one is actually using CLONE_STOPPED.
Signed-off-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Claudio Scordino [Thu, 13 Jan 2011 23:45:39 +0000 (15:45 -0800)]
atmel_serial: fix RTS high after initialization in RS485 mode
When working in RS485 mode, the atmel_serial driver keeps RTS high after
the initialization of the serial port. It goes low only after the first
character has been sent.
Eric Dumazet [Thu, 13 Jan 2011 23:45:38 +0000 (15:45 -0800)]
irq: use per_cpu kstat_irqs
Use modern per_cpu API to increment {soft|hard}irq counters, and use
per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.
This gives better SMP/NUMA locality and saves few instructions per irq.
With small nr_cpuids values (8 for example), kstats_irq was a small array
(less than L1_CACHE_BYTES), potentially source of false sharing.
In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
kstat_irqs_all[NR_IRQS][NR_CPUS] array.
Note: we still populate kstats_irq for all possible irqs in
early_irq_init(). We probably could use on-demand allocations. (Code
included in alloc_descs()). Problem is not all IRQS are used with a prior
alloc_descs() call.
kstat_irqs_this_cpu() is not used anymore, remove it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bruce Chang [Thu, 13 Jan 2011 23:45:37 +0000 (15:45 -0800)]
MAINTAINERS: update entries affecting VIA Technologies
Since the original maintainer-Joseph Chan (josephchan@via.com.tw) doesn't
handle the Linux driver for VIA now, I would like to request to update the
maintainer for the SD/MMC CARD CONTROLLER DRIVER and VIA
UNICHROME(PRO)/CHROME9 FRAMEBUFFER DRIVER before we find a better one.
Signed-off-by: Bruce Chang <brucechang@via.com.tw> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Joseph Chan <JosephChan@via.com.tw> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Harald Welte <HaraldWelte@viatech.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
dm: raid456 basic support
dm: per target unplug callback support
dm: introduce target callbacks and congestion callback
dm mpath: delay activate_path retry on SCSI_DH_RETRY
dm: remove superfluous irq disablement in dm_request_fn
dm log: use PTR_ERR value instead of ENOMEM
dm snapshot: avoid storing private suspended state
dm snapshot: persistent make metadata_wq multithreaded
dm: use non reentrant workqueues if equivalent
dm: convert workqueues to alloc_ordered
dm stripe: switch from local workqueue to system_wq
dm: dont use flush_scheduled_work
dm snapshot: remove unused dm_snapshot queued_bios_work
dm ioctl: suppress needless warning messages
dm crypt: add loop aes iv generator
dm crypt: add multi key capability
dm crypt: add post iv call to iv generator
dm crypt: use io thread for reads only if mempool exhausted
dm crypt: scale to multiple cpus
dm crypt: simplify compatible table output
...
Linus Torvalds [Fri, 14 Jan 2011 01:30:20 +0000 (17:30 -0800)]
Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
md: Fix removal of extra drives when converting RAID6 to RAID5
md: range check slot number when manually adding a spare.
md/raid5: handle manually-added spares in start_reshape.
md: fix sync_completed reporting for very large drives (>2TB)
md: allow suspend_lo and suspend_hi to decrease as well as increase.
md: Don't let implementation detail of curr_resync leak out through sysfs.
md: separate meta and data devs
md-new-param-to_sync_page_io
md-new-param-to-calc_dev_sboffset
md: Be more careful about clearing flags bit in ->recovery
md: md_stop_writes requires mddev_lock.
md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
md: Ensure no IO request to get md device before it is properly initialised.
md: Fix single printks with multiple KERN_<level>s
md: fix regression resulting in delays in clearing bits in a bitmap
md: fix regression with re-adding devices to arrays with no metadata
Linus Torvalds [Fri, 14 Jan 2011 01:19:38 +0000 (17:19 -0800)]
ecryptfs: fix broken build
Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
The breakage comes from commit 66cb76666d69 ("sanitize ecryptfs
->mount()") which was obviously not even build tested. Tssk, tssk, Al.
This is the minimal build fixup for the situation, although I don't have
a filesystem to actually test it with.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Thu, 13 Jan 2011 22:14:34 +0000 (09:14 +1100)]
md/raid5: handle manually-added spares in start_reshape.
It is possible to manually add spares to specific slots before
starting a reshape.
raid5_start_reshape should recognised this possibility and include
it in the accounting.
NeilBrown [Thu, 13 Jan 2011 22:14:34 +0000 (09:14 +1100)]
md: allow suspend_lo and suspend_hi to decrease as well as increase.
The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
to which read/writes are suspended so that the under lying data can be
manipulated without user-space noticing.
Currently the window they describe can only move forwards along the
device. However this is an unnecessary restriction which will cause
problems with planned developments.
So relax this restriction and allow these endpoints to move
arbitrarily.
NeilBrown [Thu, 13 Jan 2011 22:14:34 +0000 (09:14 +1100)]
md: Don't let implementation detail of curr_resync leak out through sysfs.
mddev->curr_resync has artificial values of '1' and '2' which are used
by the code which ensures only one resync is happening at a time on
any given device.
These values are internal and should never be exposed to user-space
(except when translated appropriately as in the 'pending' status in
/proc/mdstat).
Unfortunately they are as ->curr_resync is assigned to
->curr_resync_completed and that value is directly visible through
sysfs.
So change the assignments to ->curr_resync_completed to get the same
valued from elsewhere in a form that doesn't have the magic '1' or '2'
values.
Jonathan Brassow [Thu, 13 Jan 2011 22:14:33 +0000 (09:14 +1100)]
md-new-param-to_sync_page_io
Add new parameter to 'sync_page_io'.
The new parameter allows us to distinguish between metadata and data
operations. This becomes important later when we add the ability to
use separate devices for data and metadata.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Jonathan Brassow [Thu, 13 Jan 2011 22:14:33 +0000 (09:14 +1100)]
md-new-param-to-calc_dev_sboffset
When we allow for separate devices for data and metadata
in a later patch, we will need to be able to calculate
the superblock offset based on more than the bdev.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
NeilBrown [Thu, 13 Jan 2011 22:14:33 +0000 (09:14 +1100)]
md: Be more careful about clearing flags bit in ->recovery
Setting ->recovery to 0 is generally not a good idea as it could clear
bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN
should only be cleared on explicit request from user-space.
So when we need to clear things, just clear the bits that need
clearing.
As there are a few different places which reap a resync process - and
some do an incomplte job - factor out the code for doing the from
md_check_recovery and call that function instead of open coding part
of it.
Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jonathan Brassow <jbrassow@redhat.com>
NeilBrown [Thu, 13 Jan 2011 22:14:33 +0000 (09:14 +1100)]
md: md_stop_writes requires mddev_lock.
As md_stop_writes manipulates the sync_thread and calls md_update_sb,
it need to be called with mddev_lock held.
In all internal cases it is, but the symbol is exported for dm-raid to
call and in that case the lock won't be help.
Do make an exported version which takes the lock, and an internal
version which does not.
Jonathan Brassow [Thu, 13 Jan 2011 22:14:33 +0000 (09:14 +1100)]
md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
With the module parameter 'start_dirty_degraded' set,
raid5_spare_active() previously called sysfs_notify_dirent() with a NULL
argument (rdev->sysfs_state) when a rebuild finished.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NeilBrown [Thu, 13 Jan 2011 22:14:33 +0000 (09:14 +1100)]
md: Ensure no IO request to get md device before it is properly initialised.
When an md device is in the process of coming on line it is possible
for an IO request (typically a partition table probe) to get through
before the array is fully initialised, which can cause unexpected
behaviour (e.g. a crash).
So explicitly record when the array is ready for IO and don't allow IO
through until then.
There is no possibility for a similar problem when the array is going
off-line as there must only be one 'open' at that time, and it is busy
off-lining the array and so cannot send IO requests. So no memory
barrier is needed in md_stop()
This has been a bug since commit 409c57f3801 in 2.6.30 which
introduced md_make_request. Before then, each personality would
register its own make_request_fn when it was ready.
This is suitable for any stable kernel from 2.6.30.y onwards.
NeilBrown [Thu, 13 Jan 2011 22:13:53 +0000 (09:13 +1100)]
md: fix regression resulting in delays in clearing bits in a bitmap
commit 589a594be1fb (2.6.37-rc4) fixed a problem were md_thread would
sometimes call the ->run function at a bad time.
If an error is detected during array start up after the md_thread has
been started, the md_thread is killed. This resulted in the ->run
function being called once. However the array may not be in a state
that it is safe to call ->run.
However the fix imposed meant that ->run was not called on a timeout.
This means that when an array goes idle, bitmap bits do not get
cleared promptly. While the array is busy the bits will still be
cleared when appropriate so this is not very serious. There is no
risk to data.
Change the test so that we only avoid calling ->run when the thread
is being stopped. This more explicitly addresses the problem situation.
This is suitable for 2.6.37-stable and any -stable kernel to which 589a594be1fb was applied.
Linus Torvalds [Thu, 13 Jan 2011 20:06:58 +0000 (12:06 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/avr32-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/avr32-2.6:
avr32: update default configuration files for Atmel boards
avr32: Convert to clocksource_register_hz
avr32: make architecture sys_clone prototype match asm-generic prototype
avr32: use syscall prototypes from asm-generic instead of arch
avr32: disable kprobes for all default configurations
avr32: boards: setup: use IS_ERR() instead of NULL check
Trond Myklebust [Thu, 13 Jan 2011 19:15:50 +0000 (14:15 -0500)]
NFS: Fix NFSv3 exclusive open semantics
Commit c0204fd2b8fe047b18b67e07e1bf2a03691240cd (NFS: Clean up
nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
that passes the O_EXCL flag down to nfs3_proc_create(). This patch
reverts that offending hunk from the original commit.
Reported-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org [2.6.37] Tested-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Thu, 13 Jan 2011 20:00:02 +0000 (20:00 +0000)]
dm: raid456 basic support
This patch is the skeleton for the DM target that will be
the bridge from DM to MD (initially RAID456 and later RAID1). It
provides a way to use device-mapper interfaces to the MD RAID456
drivers.
As with all device-mapper targets, the nominal public interfaces are the
constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
and STATUSTYPE_TABLE). The CTR table looks like the following:
Line 1 contains the standard first three arguments to any device-mapper
target - the start, length, and target type fields. The target type in
this case is "raid".
Line 2 contains the arguments that define the particular raid
type/personality/level, the required arguments for that raid type, and
any optional arguments. Possible raid types include: raid4, raid5_la,
raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is
planned for the future.) The list of required and optional parameters
is the same for all the current raid types. The required parameters are
positional, while the optional parameters are given as key/value pairs.
The possible parameters are as follows:
<chunk_size> Chunk size in sectors.
[[no]sync] Force/Prevent RAID initialization
[rebuild <idx>] Rebuild the drive indicated by the index
[daemon_sleep <ms>] Time between bitmap daemon work to clear bits
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization
[max_write_behind <value>] See '-write-behind=' (man mdadm)
[stripe_cache <sectors>] Stripe cache size for higher RAIDs
Line 3 contains the list of devices that compose the array in
metadata/data device pairs. If the metadata is stored separately, a '-'
is given for the metadata device position. If a drive has failed or is
missing at creation time, a '-' can be given for both the metadata and
data drives for a given position.
Examples:
# RAID4 - 4 data drives, 1 parity
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \
raid4 1 2048 \
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
# RAID4 - 4 data drives, 1 parity (no metadata devices)
# Chunk size of 1MiB, force RAID initialization,
# min recovery rate at 20 kiB/sec/disk
0 1960893648 raid \
raid4 4 2048 min_recovery_rate 20 sync\
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
Performing a 'dmsetup table' should display the CTR table used to
construct the mapping (with possible reordering of optional
parameters).
Performing a 'dmsetup status' will yield information on the state and
health of the array. The output is as follows:
1: <s> <l> raid \
2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio>
Line 1 is standard DM output. Line 2 is best shown by example:
0 1960893648 raid raid4 5 AAAAA 2/490221568
Here we can see the RAID type is raid4, there are 5 devices - all of
which are 'A'live, and the array is 2/490221568 complete with recovery.
Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
NeilBrown [Thu, 13 Jan 2011 20:00:01 +0000 (20:00 +0000)]
dm: introduce target callbacks and congestion callback
DM currently implements congestion checking by checking on congestion
in each component device. For raid456 we need to also check if the
stripe cache is congested.
Extending the target_callbacks structure with additional callback
functions allows for establishing multiple callbacks per-target (a
callback is also needed for unplug).
Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm mpath: delay activate_path retry on SCSI_DH_RETRY
This patch adds a user-configurable 'pg_init_delay_msecs' feature. Use
this feature to specify the number of milliseconds to delay before
retrying scsi_dh_activate, when SCSI_DH_RETRY is returned.
SCSI Device Handlers return SCSI_DH_IMM_RETRY if we could retry
activation immediately and SCSI_DH_RETRY in cases where it is better to
retry after some delay.
Currently we immediately retry scsi_dh_activate irrespective of
SCSI_DH_IMM_RETRY and SCSI_DH_RETRY.
The 'pg_init_delay_msecs' feature may be provided during table create or
load, e.g.:
dmsetup create --table "0 20971520 multipath 3 queue_if_no_path \
pg_init_delay_msecs 2500 ..." mpatha
The default for 'pg_init_delay_msecs' is 2000 milliseconds.
Maximum configurable delay is 60000 milliseconds. Specifying a
'pg_init_delay_msecs' of 0 will cause immediate retry.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Kiyoshi Ueda [Thu, 13 Jan 2011 20:00:00 +0000 (20:00 +0000)]
dm: remove superfluous irq disablement in dm_request_fn
This patch changes spin_lock_irq() to spin_lock() in dm_request_fn().
This patch is just a clean-up and no functional change.
The spin_lock_irq() was leftover from the early request-based dm code,
where map_request() used to enable interrupts.
Since current map_request() never enables interrupts, we can change it
to spin_lock() to match the prior spin_unlock().
Auditing through the dm and block-layer code called from
map_request(), I confirmed all functions save/restore interrupt
status, so no function returning with interrupts enabled.
Also I haven't observed any problem on my test environment which
uses scsi and lpfc driver after heavy I/O testing with occasional
path down/up.
Added BUG_ON() to detect breakage in future.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Dan Carpenter [Thu, 13 Jan 2011 20:00:00 +0000 (20:00 +0000)]
dm log: use PTR_ERR value instead of ENOMEM
It's nicer to return the PTR_ERR() value instead of just returning
-ENOMEM. In the current code the PTR_ERR() value is always equal to
-ENOMEM so this doesn't actually affect anything, but still...
In addition, dm_dirty_log_create() doesn't check for a specific -ENOMEM
return. So this change is safe relative to potential for a non -ENOMEM
return in the future.
Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tejun Heo [Thu, 13 Jan 2011 19:59:59 +0000 (19:59 +0000)]
dm snapshot: persistent make metadata_wq multithreaded
metadata_wq serves on-stack work items from chunk_io(). Even if
multiple chunk_io() are simultaneously in progress, each is
independent and queued only once, so multithreaded workqueue can be
safely used.
Switch metadata_wq to multithread and flush the work item instead of
the workqueue in chunk_io().
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tejun Heo [Thu, 13 Jan 2011 19:59:58 +0000 (19:59 +0000)]
dm: use non reentrant workqueues if equivalent
kmirrord_wq, kcopyd_work and md->wq are created per dm instance and
serve only a single work item from the dm instance, so non-reentrant
workqueues would provide the same ordering guarantees as ordered ones
while allowing CPU affinity and use of the workqueues for other
purposes. Switch them to non-reentrant workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tejun Heo [Thu, 13 Jan 2011 19:59:57 +0000 (19:59 +0000)]
dm: convert workqueues to alloc_ordered
Convert all create[_singlethread]_work() users to the new
alloc[_ordered]_workqueue(). This conversion is mechanical and
doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tejun Heo [Thu, 13 Jan 2011 19:59:57 +0000 (19:59 +0000)]
dm stripe: switch from local workqueue to system_wq
kstriped only serves sc->kstriped_ws which runs dm_table_event().
This doesn't need to be executed from an ordered workqueue w/ rescuer.
Drop kstriped and use the system_wq instead. While at it, rename
kstriped_ws to trigger_event so that it's consistent with other dm
modules.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tejun Heo [Thu, 13 Jan 2011 19:59:56 +0000 (19:59 +0000)]
dm: dont use flush_scheduled_work
flush_scheduled_work() is being deprecated. Flush the used work
directly instead. In all dm targets, the only work which uses
system_wq is ->trigger_event.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm_snapshot->queued_bios_work isn't used. Remove ->queued_bios[_work]
from dm_snapshot structure, the flush_queued_bios work function and
ksnapd workqueue.
The DM snapshot changes that were going to use the ksnapd workqueue were
either superseded (fix for origin write races) or never completed
(deallocation of invalid snapshot's memory via workqueue).
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Milan Broz [Thu, 13 Jan 2011 19:59:55 +0000 (19:59 +0000)]
dm ioctl: suppress needless warning messages
The device-mapper should not send warning messages to syslog
if a device is not found. This can be done by userspace
according to the returned dm-ioctl error code.
So move these messages to debug level and use rate limiting
to not flood syslog.
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Milan Broz [Thu, 13 Jan 2011 19:59:55 +0000 (19:59 +0000)]
dm crypt: add loop aes iv generator
This patch adds a compatible implementation of the block
chaining mode used by the Loop-AES block device encryption
system (http://loop-aes.sourceforge.net/) designed
by Jari Ruusu.
It operates on full 512 byte sectors and uses CBC
with an IV derived from the sector number, the data and
optionally extra IV seed.
This means that after CBC decryption the first block of sector
must be tweaked according to decrypted data.
Loop-AES can use three encryption schemes:
version 1: is plain aes-cbc mode (already compatible)
version 2: uses 64 multikey scheme with own IV generator
version 3: the same as version 2 with additional IV seed
(it uses 65 keys, last key is used as IV seed)
The IV generator is here named lmk (Loop-AES multikey)
and for the cipher specification looks like: aes:64-cbc-lmk
Version 2 and 3 is recognised according to length
of provided multi-key string (which is just hexa encoded
"raw key" used in original Loop-AES ioctl).
Configuration of the device and decoding key string will
be done in userspace (cryptsetup).
(Loop-AES stores keys in gpg encrypted file, raw keys are
output of simple hashing of lines in this file).
Based on an implementation by Max Vozeler:
http://article.gmane.org/gmane.linux.kernel.cryptoapi/3752/
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> CC: Max Vozeler <max@hinterhof.net>
Milan Broz [Thu, 13 Jan 2011 19:59:54 +0000 (19:59 +0000)]
dm crypt: add multi key capability
This patch adds generic multikey handling to be used
in following patch for Loop-AES mode compatibility.
This patch extends mapping table to optional keycount and
implements generic multi-key capability.
With more keys defined the <key> string is divided into
several <keycount> sections and these are used for tfms.
The tfm is used according to sector offset
(sector 0->tfm[0], sector 1->tfm[1], sector N->tfm[N modulo keycount])
(only power of two values supported for keycount here).
Because of tfms per-cpu allocation, this mode can be take
a lot of memory on large smp systems.
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Max Vozeler <max@hinterhof.net>
Milan Broz [Thu, 13 Jan 2011 19:59:54 +0000 (19:59 +0000)]
dm crypt: add post iv call to iv generator
IV (initialisation vector) can in principle depend not only
on sector but also on plaintext data (or other attributes).
Change IV generator interface to work directly with dmreq
structure to allow such dependence in generator.
Also add post() function which is called after the crypto
operation.
This allows tricky modification of decrypted data or IV
internals.
In asynchronous mode the post() can be called after
ctx->sector count was increased so it is needed
to add iv_sector copy directly to dmreq structure.
(N.B. dmreq always include only one sector in scatterlists)
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Andi Kleen [Thu, 13 Jan 2011 19:59:53 +0000 (19:59 +0000)]
dm crypt: scale to multiple cpus
Currently dm-crypt does all the encryption work for a single dm-crypt
mapping in a single workqueue. This does not scale well when multiple
CPUs are submitting IO at a high rate. The single CPU running the single
thread cannot keep up with the encryption and encrypted IO performance
tanks.
This patch changes the crypto workqueue to be per CPU. This means
that as long as the IO submitter (or the interrupt target CPUs
for reads) runs on different CPUs the encryption work will be also
parallel.
To avoid a bottleneck on the IO worker I also changed those to be
per-CPU threads.
There is still some shared data, so I suspect some bouncing
cache lines. But I haven't done a detailed study on that yet.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Milan Broz [Thu, 13 Jan 2011 19:59:52 +0000 (19:59 +0000)]
dm crypt: simplify compatible table output
Rename cc->cipher_mode to cc->cipher_string and store the whole of the cipher
information so it can easily be printed when processing the DM_DEV_STATUS ioctl.
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Thu, 13 Jan 2011 19:59:52 +0000 (19:59 +0000)]
dm log userspace: add version number to comms
This patch adds a 'version' field to the 'dm_ulog_request'
structure.
The 'version' field is taken from a portion of the unused
'padding' field in the 'dm_ulog_request' structure. This
was done to avoid changing the size of the structure and
possibly disrupting backwards compatibility.
The version number will help notify user-space daemons
when a change has been made to the kernel/userspace
log API.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Thu, 13 Jan 2011 19:59:51 +0000 (19:59 +0000)]
dm log userspace: group clear and mark requests
Allow the device-mapper log's 'mark' and 'clear' requests to be
grouped and processed in a batch. This can significantly reduce the
amount of traffic going between the kernel and userspace (where the
processing daemon resides).
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Thu, 13 Jan 2011 19:59:50 +0000 (19:59 +0000)]
dm log userspace: split flush queue
Split the 'flush_list', which contained a mix of both 'mark' and 'clear'
requests, into two distinct lists ('mark_list' and 'clear_list').
The device mapper log implementations (used by various DM targets) are
allowed to cache 'mark' and 'clear' requests until a 'flush' is
received. Until now, these cached requests were kept in the same list.
They will now be put into distinct lists to facilitate group processing
of these requests (in the next patch).
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Thu, 13 Jan 2011 19:59:50 +0000 (19:59 +0000)]
dm kcopyd: delay unplugging
Make kcopyd merge more I/O requests by using device unplugging.
Without this patch, each I/O request is dispatched separately to the device.
If the device supports tagged queuing, there are many small requests sent
to the device. To improve performance, this patch will batch as many requests
as possible, allowing the queue to merge consecutive requests, and send them
to the device at once.
In my tests (15k SCSI disk), this patch improves sequential write throughput:
In most common uses (snapshot or two-way mirror), kcopyd is only used for
two devices, one for reading and the other for writing, thus this optimization
is implemented only for two devices. The optimization may be extended to n-way
mirrors with some code complexity increase.
We keep track of two block devices to unplug (one for read and the
other for write) and unplug them when exiting "do_work" thread. If
there are more devices used (in theory it could happen, in practice it
is rare), we unplug immediately.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Thu, 13 Jan 2011 19:59:49 +0000 (19:59 +0000)]
dm log userspace: trap all failed log construction errors
When constructing a mirror log, it is possible for the initial request
to fail for other reasons besides -ESRCH. These must be handled too.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Milan Broz [Thu, 13 Jan 2011 19:59:48 +0000 (19:59 +0000)]
dm: remove dm_mutex after bkl conversion
This patch replaces dm_mutex with _minor_lock in dm_blk_close()
and then removes it.
During the BKL conversion, commit 6e9624b8caec290d28b4c6d9ec75749df6372b87
(block: push down BKL into .open and .release) pushed lock_kernel()
down into dm_blk_open/close calls.
Commit 2a48fc0ab24241755dc93bfd4f01d68efab47f5a
(block: autoconvert trivial BKL users to private mutex) converted it to a
local mutex, but _minor_lock is sufficient.
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Peter Jones [Thu, 13 Jan 2011 19:59:47 +0000 (19:59 +0000)]
dm ioctl: allow rename to fill empty uuid
Allow the uuid of a mapped device to be set after device creation.
Previously the uuid (which is optional) could only be set by
DM_DEV_CREATE. If no uuid was supplied it could not be set later.
Sometimes it's necessary to create the device before the uuid is known,
and in such cases the uuid must be filled in after the creation.
This patch extends DM_DEV_RENAME to accept a uuid accompanied by
a new flag DM_UUID_FLAG. This can only be done once and if no
uuid was previously supplied. It cannot be used to change an
existing uuid.
DM_VERSION_MINOR is also bumped to 19 to indicate this interface
extension is available.
Signed-off-by: Peter Jones <pjones@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Snapshot exception reallocations are triggered by writes that are
usually async, so mark the associated dm_io_request as async as well.
This helps when using the CFQ I/O scheduler because it has separate
queues for sync and async I/O. Async is optimized for throughput; sync
for latency. With this change we're consciously favoring throughput over
latency.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Multipath began to use blk_abort_queue() to allow for
lower latency path deactivation. This was found to
cause list corruption:
the cmd gets blk_abort_queued/timedout run on it and the scsi eh
somehow is able to complete and run scsi_queue_insert while
scsi_request_fn is still trying to process the request.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: stable@kernel.org
Mike Snitzer [Thu, 13 Jan 2011 19:53:46 +0000 (19:53 +0000)]
dm: dont take i_mutex to change device size
No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the
size of a DM device. This additional locking is unnecessary because
i_size_write() is already protected by the existing critical section in
dm_swap_table(). DM already has a reference on md->bdev so the
associated bd_inode may be changed without lifetime concerns.
A negative side-effect of having held md->bdev->bd_inode->i_mutex was
that a concurrent DM device resize and flush (via fsync) would deadlock.
Dropping md->bdev->bd_inode->i_mutex eliminates this potential for
deadlock. The following reproducer no longer deadlocks:
https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
avr32: update default configuration files for Atmel boards
This patch adjusts some values to make the default configuration for Atmel
boards more similar, and adds missing values to enable required functions. Also
remove defined symbols for functions not in use.
avr32: make architecture sys_clone prototype match asm-generic prototype
This patch will fix the arguments to the architecture sys_clone() function to
match the asm-generic/syscalls.h prototype. In the same go remove the
architecture specific prototype for the same function.
The sys_clone() function is only called from assembly, hence the argument types
were not having any affect.
avr32: use syscall prototypes from asm-generic instead of arch
This patch removes the redundant syscalls prototypes in the architecture
specific syscalls.h header file. These were identical with the ones in
asm-generic/syscalls.h.
Signed-off-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com> Reported-by: Peter Huewe <PeterHuewe@gmx.de> Reported-by: Sven Schnelle <svens@stackframe.org> Cc: stable <stable@kernel.org>
avr32: disable kprobes for all default configurations
This patch will disable kprobes for all the default AVR32 board configurations.
This works around a regression in kprobes which seems to be related to AVR32 is
now lacking the struct kprobe_ctlblk.
Linus Torvalds [Thu, 13 Jan 2011 19:02:05 +0000 (11:02 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
firewire: ohci: fix compilation on arches without PAGE_KERNEL_RO
Linus Torvalds [Thu, 13 Jan 2011 18:50:24 +0000 (10:50 -0800)]
Merge branch 'for-2.6.38/drivers' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/drivers' of git://git.kernel.dk/linux-2.6-block:
cciss: reinstate proper FIFO order of command queue list
floppy: replace NO_GEOM macro with a function
Linus Torvalds [Thu, 13 Jan 2011 18:45:01 +0000 (10:45 -0800)]
Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
Linus Torvalds [Thu, 13 Jan 2011 18:40:57 +0000 (10:40 -0800)]
Merge branch 'for-linus/i2c-2638' of git://git.fluff.org/bjdooks/linux
* 'for-linus/i2c-2638' of git://git.fluff.org/bjdooks/linux:
i2c-bfin-twi: move setup to the earlier subsys initcall
i2c-bfin-twi: handle faulty slave devices better
i2c-mv64xxx: send repeated START between messages in xfer
i2c-nomadik: fix regression on adapter name
i2c-omap: Set latency requirements only once for several messages
i2c-eg20t: add driver for Intel EG20T
i2c-ocores: add some device tree documentation
i2c-ocores: Use devres for resource allocation
i2c-ocores: Adapt for device tree
i2c-iop3xx: add iomem annotation
Linus Torvalds [Thu, 13 Jan 2011 18:40:00 +0000 (10:40 -0800)]
Merge branch 'rmobile-latest' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* 'rmobile-latest' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
ARM: mach-shmobile: Kill off unused !gpio_is_valid() case
ARM: mach-shmobile: sh7372 Enable SDIO IRQs for Mackerel
ARM: mach-shmobile: sh7377 Enable SDIO IRQs
ARM: mach-shmobile: sh7367 Enable SDIO IRQs
ARM: mach-shmobile: sh7372 Enable SDIO IRQs
ARM: mach-shmobile: mackerel: Add touchscreen ST1232 support
ARM: mach-shmobile: ap4eb: SCIF port for earlyprintk when using zboot
ARM: mach-shmobile: mackerel: SCIF port for earlyprintk when using zboot
ARM: mach-shmobile: mackerel: Add support get_cd in CN23
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-2.6: (29 commits)
video: move SH_MIPI_DSI/SH_LCD_MIPI_DSI to the top of menu
fbdev: Implement simple blanking in pseudocolor modes for vt8500lcdfb
video: imx: Update the manufacturer's name
nuc900fb: don't treat NULL clk as an error
s3c2410fb: don't treat NULL clk as an error
video: tidy up modedb formatting.
video: matroxfb: Correct video option in comments and kernel config help.
fbdev: sh_mobile_hdmi: simplify pointer handling
fbdev: sh_mobile_hdmi: framebuffer notifiers have to be registered
fbdev: sh_mobile_hdmi: add command line option to use the preferred EDID mode
OMAP: DSS2: Introduce omap_channel as an omap_dss_device parameter, add new overlay manager.
OMAP: DSS2: Use dss_features to handle DISPC bits removed on OMAP4
OMAP: DSS2: LCD2 Channel Changes for DISPC
OMAP: DSS2: Change remaining DISPC functions for new omap_channel argument
OMAP: DSS2: Introduce omap_channel argument to DISPC functions used by interface drivers
OMAP: DSS2: Represent DISPC register defines with channel as parameter
OMAP: DSS2: Add dss_features for omap4 and overlay manager related features
OMAP: DSS2: Clean up DISPC color mode validation checks
OMAP: DSS2: Add back authors of panel-generic.c based drivers
OMAP: DSS2: remove generic DPI panel driver duplicated panel drivers
...
Linus Torvalds [Thu, 13 Jan 2011 18:27:28 +0000 (10:27 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
Linus Torvalds [Thu, 13 Jan 2011 18:25:24 +0000 (10:25 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix cleanup when trying to mount inexistent image
net/ceph: make ceph_msgr_wq non-reentrant
ceph: fsc->*_wq's aren't used in memory reclaim path
ceph: Always free allocated memory in osdmap_decode()
ceph: Makefile: Remove unnessary code
ceph: associate requests with opening sessions
ceph: drop redundant r_mds field
ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
ceph: add dir_layout to inode
* git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog:
watchdog: Add MCF548x watchdog driver.
watchdog: add driver for the Atheros AR71XX/AR724X/AR913X SoCs
watchdog: Add TCO support for nVidia chipsets
watchdog: Add support for sp5100 chipset TCO
watchdog: f71808e_wdt: add F71862FG, F71869 to Kconfig
watchdog: iTCO_wdt: TCO Watchdog patch for Intel DH89xxCC PCH
watchdog: iTCO_wdt: TCO Watchdog patch for Intel NM10 DeviceIDs
watchdog: ks8695_wdt: include mach/hardware.h instead of mach/timex.h.
watchdog: Propagate Book E WDT period changes to all cores
watchdog: add CONFIG_WATCHDOG_NOWAYOUT support to PowerPC Book-E watchdog driver
watchdog: alim7101_wdt: fix compiler warning on alim7101_pci_tbl
watchdog: alim1535_wdt: fix compiler warning on ali_pci_tbl
watchdog: Fix reboot on W83627ehf chipset.
watchdog: Add watchdog support for W83627DHG chip
watchdog: f71808e_wdt: Add Fintek F71869 watchdog
watchdog: add f71862fg support
watchdog: clean-up f71808e_wdt.c
Linus Torvalds [Thu, 13 Jan 2011 18:24:29 +0000 (10:24 -0800)]
Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
hwmon: (dme1737) Minor cleanups
hwmon: (dme1737) Add support for in7 for SCH5127
hwmon: (emc1403) Add EMC1423 support
hwmon: (w83627hf) Document W83627THF voltage pin mapping
hwmon: (w83793) Drop useless mutex
hwmon: (fschmd) Drop useless mutex
hwmon: (w83781d) Use pr_fmt and pr_<level>
hwmon: (pc87427) Use pr_fmt and pr_<level>
hwmon: (pc87360) Use pr_fmt and pr_<level>
hwmon: (lm78) Use pr_fmt and pr_<level>
hwmon: (it87) Use pr_fmt and pr_<level>
hwmon: Schedule the removal of the old intrusion detection interfaces
hwmon: (w83793) Implement the standard intrusion detection interface
hwmon: (w83792d) Implement the standard intrusion detection interface
hwmon: (adm9240) Implement the standard intrusion detection interface
hwmon: (via686a) Initialize fan_div values
hwmon: (w83795) Silent false warning from gcc
hwmon: (ads7828) Update email contact details
Linus Torvalds [Thu, 13 Jan 2011 18:24:07 +0000 (10:24 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6: (45 commits)
regulator: missing index in PTR_ERR() in isl6271a_probe()
regulator: Assign return value of mc13xxx_reg_rmw to ret
regulator: Add initial per-regulator debugfs support
regulator: Make regulator_has_full_constraints a bool
regulator: Clean up logging a bit
regulator: Optimise out noop voltage changes
regulator: Add API to re-apply voltage to hardware
regulator: Staticise non-exported functions in mc13892
regulator: Only notify voltage changes when they succeed
regulator: Provide a selector based set_voltage_sel() operation
regulator: Factor out voltage set operation into a separate function
regulator: Convert WM8994 to use get_voltage_sel()
regulator: Convert WM835x to use get_voltage_sel()
regulator: Allow modular build of mc13xxx-core
regulator: support PMIC mc13892
make mc13783 regulator code generic
Change the register name definitions for mc13783
mach-ux500: Updated and connected ab8500 regulator board configuration
regulators: Removed macros for initialization of ab8500 regulators
regulators: Added verbose debug messages to ab8500 regulators
...
Linus Torvalds [Thu, 13 Jan 2011 18:15:12 +0000 (10:15 -0800)]
Merge branch 'x86-olpc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-olpc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, olpc: Speed up device tree creation during boot
x86, olpc: Add OLPC device-tree support
x86, of: Define irq functions to allow drivers/of/* to build on x86
Linus Torvalds [Thu, 13 Jan 2011 18:14:24 +0000 (10:14 -0800)]
Merge branch 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
KVM: Initialize fpu state in preemptible context
KVM: VMX: when entering real mode align segment base to 16 bytes
KVM: MMU: handle 'map_writable' in set_spte() function
KVM: MMU: audit: allow audit more guests at the same time
KVM: Fetch guest cr3 from hardware on demand
KVM: Replace reads of vcpu->arch.cr3 by an accessor
KVM: MMU: only write protect mappings at pagetable level
KVM: VMX: Correct asm constraint in vmcs_load()/vmcs_clear()
KVM: MMU: Initialize base_role for tdp mmus
KVM: VMX: Optimize atomic EFER load
KVM: VMX: Add definitions for more vm entry/exit control bits
KVM: SVM: copy instruction bytes from VMCB
KVM: SVM: implement enhanced INVLPG intercept
KVM: SVM: enhance mov DR intercept handler
KVM: SVM: enhance MOV CR intercept handler
KVM: SVM: add new SVM feature bit names
KVM: cleanup emulate_instruction
KVM: move complete_insn_gp() into x86.c
KVM: x86: fix CR8 handling
KVM guest: Fix kvm clock initialization when it's configured out
...
Linus Torvalds [Thu, 13 Jan 2011 17:58:38 +0000 (09:58 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: hid-multitouch: minor fixes based on additional review
HID: Switch turbox/mosart touchscreen to hid-mosart
HID: add Add Cando touch screen 10.1-inch product id
HID: hid-mulitouch: add support for the 'Sensing Win7-TwoFinger'
HID: hid-multitouch: add support for Cypress TrueTouch panels
HID: hid-multitouch: support for PixCir-based panels
HID: set HID_MAX_FIELD at 128
HID: add feature_mapping callback
Lasse Collin [Thu, 13 Jan 2011 01:01:24 +0000 (17:01 -0800)]
x86: support XZ-compressed kernel
This integrates the XZ decompression code to the x86 pre-boot code.
mkpiggy.c is updated to reserve about 32 KiB more buffer safety margin for
kernel decompression. It is done unconditionally for all decompressors to
keep the code simpler.
The XZ decompressor needs around 30 KiB of heap, so the heap size is
increased to 32 KiB on both x86-32 and x86-64.
Documentation/x86/boot.txt is updated to list the XZ magic number.
With the x86 BCJ filter in XZ, XZ-compressed x86 kernel tends to be a few
percent smaller than the equivalent LZMA-compressed kernel.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Thu, 13 Jan 2011 01:01:23 +0000 (17:01 -0800)]
decompressors: add boot-time XZ support
This implements the API defined in <linux/decompress/generic.h> which is
used for kernel, initramfs, and initrd decompression. This patch together
with the first patch is enough for XZ-compressed initramfs and initrd;
XZ-compressed kernel will need arch-specific changes.
The buffering requirements described in decompress_unxz.c are stricter
than with gzip, so the relevant changes should be done to the
arch-specific code when adding support for XZ-compressed kernel.
Similarly, the heap size in arch-specific pre-boot code may need to be
increased (30 KiB is enough).
The XZ decompressor needs memmove(), memeq() (memcmp() == 0), and
memzero() (memset(ptr, 0, size)), which aren't available in all
arch-specific pre-boot environments. I'm including simple versions in
decompress_unxz.c, but a cleaner solution would naturally be nicer.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Thu, 13 Jan 2011 01:01:22 +0000 (17:01 -0800)]
decompressors: add XZ decompressor module
In userspace, the .lzma format has become mostly a legacy file format that
got superseded by the .xz format. Similarly, LZMA Utils was superseded by
XZ Utils.
These patches add support for XZ decompression into the kernel. Most of
the code is as is from XZ Embedded <http://tukaani.org/xz/embedded.html>.
It was written for the Linux kernel but is usable in other projects too.
Advantages of XZ over the current LZMA code in the kernel:
- Nice API that can be used by other kernel modules; it's
not limited to kernel, initramfs, and initrd decompression.
- Integrity check support (CRC32)
- BCJ filters improve compression of executable code on
certain architectures. These together with LZMA2 can
produce a few percent smaller kernel or Squashfs images
than plain LZMA without making the decompression slower.
This patch: Add the main decompression code (xz_dec), testing module
(xz_dec_test), wrapper script (xz_wrap.sh) for the xz command line tool,
and documentation. The xz_dec module is enough to have a usable XZ
decompressor e.g. for Squashfs.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Thu, 13 Jan 2011 01:01:21 +0000 (17:01 -0800)]
Decompressors: check input size in decompress_unlzo.c
The code assumes that the input is valid and not truncated. Add checks to
avoid reading past the end of the input buffer. Change the type of "skip"
from u8 to int to fix a possible integer overflow.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Thu, 13 Jan 2011 01:01:20 +0000 (17:01 -0800)]
Decompressors: check for write errors in decompress_unlzo.c
The return value of flush() is not checked in unlzo(). This means that
the decompressor won't stop even if the caller doesn't want more data.
This can happen e.g. with a corrupt LZO-compressed initramfs image.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Thu, 13 Jan 2011 01:01:19 +0000 (17:01 -0800)]
Decompressors: validate match distance in decompress_unlzma.c
Validate the newly decoded distance (rep0) in process_bit1(). This is to
detect corrupt LZMA data quickly. The old code can run for long time
producing garbage until it hits the end of the input.
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lasse Collin [Thu, 13 Jan 2011 01:01:18 +0000 (17:01 -0800)]
Decompressors: check for write errors in decompress_unlzma.c
The return value of wr->flush() is not checked in write_byte(). This
means that the decompressor won't stop even if the caller doesn't want
more data. This can happen e.g. with corrupt LZMA-compressed initramfs.
Returning the error quickly allows the user to see the error message
quicker.
There is a similar missing check for wr.flush() near the end of unlzma().
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>