Jeff Moyer [Mon, 15 Aug 2011 19:37:25 +0000 (21:37 +0200)]
block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:
So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.
The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.
In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.
I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.
Shaohua Li [Thu, 11 Aug 2011 08:39:04 +0000 (10:39 +0200)]
block: improve rq_affinity placement
This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Namhyung Kim [Thu, 11 Aug 2011 08:36:05 +0000 (10:36 +0200)]
blktrace: add FLUSH/FUA support
Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
FUA follows WRITE, use the same 'F' flag for both cases and
distinguish them by their (relative) position. The end results
look like (other flags might be shown also):
Note that we reuse TC_BARRIER due to lack of bit space of act_mask
so that the older versions of blktrace tools will report flush
requests as barriers from now on.
Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jeff Moyer [Tue, 9 Aug 2011 18:32:09 +0000 (20:32 +0200)]
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
blk_insert_flush has the following check:
/*
* If there's data but flush is not necessary, the request can be
* processed directly without going through flush machinery. Queue
* for normal execution.
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
list_add_tail(&rq->queuelist, &q->queue_head);
return;
}
However, blk_flush_policy will not return with policy set to only
REQ_FSEQ_DATA:
static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
unsigned int policy = 0;
if (fflags & REQ_FLUSH) {
if (rq->cmd_flags & REQ_FLUSH)
policy |= REQ_FSEQ_PREFLUSH;
if (blk_rq_sectors(rq))
policy |= REQ_FSEQ_DATA;
if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
policy |= REQ_FSEQ_POSTFLUSH;
}
return policy;
}
Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set. Fix this
mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH
check.
Tejun notes:
Hmmm... yes, this can become a correctness issue if (and only if)
blk_queue_flush() is called to change q->flush_flags while requests
are in-flight; otherwise, requests wouldn't reach the function at all.
Also, I think it would be a generally good idea to always set
FSEQ_DATA if the request has data.
Vivek Goyal [Fri, 5 Aug 2011 07:42:20 +0000 (09:42 +0200)]
cfq-iosched: Add documentation about idling
There are always questions about why CFQ is idling on various conditions.
Recent ones is Christoph asking again why to idle on REQ_NOIDLE. His
assertion is that XFS is relying more and more on workqueues and is
concerned that CFQ idling on IO from every workqueue will impact
XFS badly.
So he suggested that I add some more documentation about CFQ idling
and that can provide more clarity on the topic and also gives an
opprotunity to poke a hole in theory and lead to improvements.
So here is my attempt at that. Any comments are welcome.
Tao Ma [Fri, 5 Aug 2011 07:37:10 +0000 (09:37 +0200)]
block: Make rq_affinity = 1 work as expected
Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.
This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.
Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Herbert Poetzl [Tue, 2 Aug 2011 10:43:50 +0000 (12:43 +0200)]
block/genhd.c: remove useless cast in diskstats_show()
Remove the (unsigned long long) cast in diskstats_show() and adjusts the
seq_printf() format string to 'unsigned long'
diskstats_show() uses part_stat_read() to get the stats, which either
accesses the specified field in the struct disk_stats directly (non SMP)
or sums up the per CPU values in a variable of the same type as the field,
so in any case the result will have the same type and range as the
specified field which for all disk_stats entries is unsigned long
Also, for unsigned long ranges the output of %lu should be identical to
the one of %llu, so no change in the actual proc entry contents.
Signed-off-by: Herbert Poetzl <herbert@13thfloor.at> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
The buffer 'sc.cpu_mask' is a kernel buffer. If bitmap_parse is used
instead of __bitmap_parse the extra parameter that indicates a kernel
buffer is not needed.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Jens Axboe [Tue, 2 Aug 2011 08:43:35 +0000 (10:43 +0200)]
bsg-lib: add module.h include
Due to conflicts with the moduleh tree in linux-next, we
run into an include file mess. We really need export.h
in that tree, but if we add module.h locally then the
issue is easier to resolve.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Vivek Goyal [Tue, 2 Aug 2011 07:24:09 +0000 (09:24 +0200)]
cfq-iosched: Reduce linked group count upon group destruction
FQ keeps track of number of groups which are linked on blkcg->blkg_list.
This is useful to avoid races between queue exit and cgroup exit code
paths. So if at the request queue exit time linked group count is not
zero, that means there are some group out there which is yet to be
deleted under rcu read period and queue exit code should wait for
on rcu period.
In my previous patch I forgot to decrease the number of group count.
So in current form, we nr_blkcg_linked_grps is always non-zero and
we will always wait one rcu period (if BLK_CGROUP=y). The side effect
of this is that it can increase boot time. I am surprised, nobody
complained so far.
Kay Sievers [Sun, 31 Jul 2011 20:21:35 +0000 (22:21 +0200)]
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
LOOP_CLR_FD takes lo->lo_ctl_mutex and tries to remove the loop sysfs
files. Sysfs calls show() and waits for lo->lo_ctl_mutex. LOOP_CLR_FD
waits for show() to finish to remove the sysfs file.
Kay Sievers [Sun, 31 Jul 2011 20:08:04 +0000 (22:08 +0200)]
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
Instead of unconditionally creating a fixed number of dead loop
devices which need to be investigated by storage handling services,
even when they are never used, we allow distros start with 0
loop devices and have losetup(8) and similar switch to the dynamic
/dev/loop-control interface instead of searching /dev/loop%i for free
devices.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Kay Sievers [Sun, 31 Jul 2011 20:08:04 +0000 (22:08 +0200)]
loop: add management interface for on-demand device allocation
Loop devices today have a fixed pre-allocated number of usually 8.
The number can only be changed at module init time. To find a free
device to use, /dev/loop%i needs to be scanned, and all devices need
to be opened until a free one is possibly found.
This adds a new /dev/loop-control device node, that allows to
dynamically find or allocate a free device, and to add and remove loop
devices from the running system:
LOOP_CTL_ADD adds a specific device. Arg is the number
of the device. It returns the device i or a negative
error code.
LOOP_CTL_REMOVE removes a specific device, Arg is the
number the device. It returns the device i or a negative
error code.
LOOP_CTL_GET_FREE finds the next unbound device or allocates
a new one. No arg is given. It returns the device i or a
negative error code.
The loop kernel module gets automatically loaded when
/dev/loop-control is accessed the first time. The alias
specified in the module, instructs udev to create this
'dead' device node, even when the module is not loaded.
Example:
cfd = open("/dev/loop-control", O_RDWR);
# add a new specific loop device
err = ioctl(cfd, LOOP_CTL_ADD, devnr);
# remove a specific loop device
err = ioctl(cfd, LOOP_CTL_REMOVE, devnr);
# find or allocate a free loop device to use
devnr = ioctl(cfd, LOOP_CTL_GET_FREE);
Mike Christie [Sun, 31 Jul 2011 20:05:09 +0000 (22:05 +0200)]
block: add bsg helper library
This moves the FC classes bsg code to the block layer and
makes it a lib so that other classes like iscsi and SAS can use it.
It is helpful because working with the request queue, bios,
creating scatterlists, etc are a pain that the LLD does not
have to worry about with normal IOs and should not have to
worry about for bsg requests.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* git://git.infradead.org/battery-2.6:
gpio-charger: Fix checking return value of request_any_context_irq
power_supply: MAX17042: Support additional properties
max8903_charger: Allow platform data to be __initdata
power_supply: Add charger driver for MAX8998/LP3974
power_supply: Add charger driver for MAX8997/8966
max17042_battery: Remove obsolete cleanup for clientdata
twl4030_charger: Fix warnings
wm831x_power: Support multiple instances
wm831x_backup: Support multiple instances
apm_power: Fix style error in macros
s3c_adc_battery: Fix annotation for s3c_adc_battery_probe()
bq20z75: Enable detection after registering
bq20z75: Add support for external notification
* git://git.kernel.org/pub/scm/linux/kernel/git/brodo/cpupowerutils:
cpupower: Do detect IDA (opportunistic processor performance) via cpuid
cpupower: Show Intel turbo ratio support via ./cpupower frequency-info
cpupowerutils: increase MAX_LINE_LEN
cpupower: Rename package from cpupowerutils to cpupower
cpupowerutils: Rename: libcpufreq->libcpupower
cpupowerutils: use kernel version-derived version string
cpupowerutils: utils - ConfigStyle bugfixes
cpupowerutils: helpers - ConfigStyle bugfixes
cpupowerutils: idle_monitor - ConfigStyle bugfixes
cpupowerutils: lib - ConfigStyle bugfixes
cpupowerutils: bench - ConfigStyle bugfixes
cpupowerutils: do not update po files on each and every compile
cpupowerutils: remove ccdv, use kernel quiet/verbose mechanism
cpupowerutils: use COPYING, CREDITS from top-level directory
cpupowerutils - cpufrequtils extended with quite some features
* git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6:
smc91c92_cs.c: fix bogus compiler warning
orinoco_cs: be more careful when matching cards with ID 0x0156:0x0002
hostap_cs: support cards with "Version 01.02" as third product ID
pcmcia: add PCMCIA_DEVICE_MANF_CARD_PROD_ID3
pxa2xx pcmcia - stargate 2 use gpio array.
pcmcia: pxa2xx: remove empty socket_init / socket_resume functions.
drivers:pcmcia:soc_common: make socket_init and socket_suspend optional
Fred Isaman [Sun, 31 Jul 2011 00:52:54 +0000 (20:52 -0400)]
pnfsblock: bl_write_pagelist
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: bl_write_pagelist support functions]
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.] Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu>
[SQUASHME: pnfsblock: mds_offset is set in the generic layer] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED] Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.] Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Sun, 31 Jul 2011 00:52:53 +0000 (20:52 -0400)]
pnfsblock: bl_read_pagelist
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED] Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: read path error handling] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.] Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new read_pagelist api] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Sun, 31 Jul 2011 00:52:52 +0000 (20:52 -0400)]
pnfsblock: cleanup_layoutcommit
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()] Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: introduce bl_committing list] Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.] Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: cleanup_layoutcommit wants a status parameter] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Sun, 31 Jul 2011 00:52:51 +0000 (20:52 -0400)]
pnfsblock: encode_layoutcommit
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
[pnfsblock: get rid of deprecated xdr macros] Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.] Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Sun, 31 Jul 2011 00:52:49 +0000 (20:52 -0400)]
pnfsblock: add extent manipulation functions
Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
bl_mark_sectors_init and bl_is_sector_init.
Fred Isaman [Sun, 31 Jul 2011 00:52:48 +0000 (20:52 -0400)]
pnfsblock: bl_find_get_extent
Implement bl_find_get_extent(), one of the core extent manipulation
routines.
[pnfsblock: Lookup list entry of layouts and tags in reverse order] Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Jim Rees <rees@umich.edu>
pnfsblock: fix print format warnings for sector_t and size_t
gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
Fred Isaman [Sun, 31 Jul 2011 00:52:46 +0000 (20:52 -0400)]
pnfsblock: call and parse getdevicelist
Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.
[pnfsblock: get rid of deprecated xdr macros] Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: fix pnfs_deviceid references] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
[pnfsblock: no PNFS_NFS_SERVER] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices] Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: get rid of vmap and deviceid->area structure] Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Peng Tao [Sun, 31 Jul 2011 00:52:34 +0000 (20:52 -0400)]
pnfs: use lwb as layoutcommit length
Using NFS4_MAX_UINT64 will break current protocol.
[Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Peng Tao [Sun, 31 Jul 2011 00:52:31 +0000 (20:52 -0400)]
pnfs: save layoutcommit lwb at layout header
No need to save it for every lseg.
[Needed in v3.0] CC: Stable Tree <stable@kernel.org> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
don't busy retry the inode on failed grab_super_passive()
This fixes a soft lockup on conditions
a) the flusher is working on a work by __bdi_start_writeback(), while
b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
grab sb->s_umount and enqueue a new work for the flusher to execute
The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.
Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().
Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging: (24 commits)
hwmon: (lm90) Refactor reading of config2 register
hwmon: (lm90) Make SA56004 detection more robust
hwmon: (lm90) Simplify handling of extended local temp register
hwmon: (pmbus) Add client driver for LM25066, LM5064, and LM5066
hwmon: (max34440) Add support for peak attributes
hwmon: (max8688) Add support for peak attributes
hwmon: (max16064) Add support for peak attributes
hwmon: (adm1275) Add support for peak attributes
hwmon: (pmbus) Add support for peak attributes
hwmon: Add new attributes to sysfs ABI
hwmon: (pmbus) Strengthen check for status register existence
hwmon: (pmbus) Add support for virtual pages
hwmon: (pmbus) Support reading and writing of word registers in device specific code
hwmon: (pmbus) Increase attribute name size
hwmon: (pmbus) Add ADP4000, NCP4200 and NCP4208 to list of supported devices
hwmon: (pmbus) Add support for VID output voltage mode
hwmon: (pmbus) Move PMBus drivers to drivers/hwmon/pmbus
hwmon: (coretemp) Add core/pkg threshold support to Coretemp
hwmon: (lm95241) Add support for LM95231
hwmon: LM95245 driver
...
shm_lock() does a lookup of shm segment in shm_ids(ns).ipcs_idr, which
is redundant as we already know shmid_kernel address. An actual lock is
also not required for reads until we really want to destroy the segment.
exit_shm() and shm_destroy_orphaned() may avoid the loop by checking
whether there is at least one segment in current ipc_namespace.
The check of nsproxy and ipc_ns against NULL is redundant as exit_shm()
is called from do_exit() before the call to exit_notify(), so the
dereferencing current->nsproxy->ipc_ns is guaranteed to be safe.
shm_try_destroy_orphaned() and shm_try_destroy_current() didn't handle
the case of separate PID namespaces, but a single IPC namespace. If
there are tasks with the same PID values using the same shmem object,
the wrong destroy decision could be reached.
On shm segment creation store the pointer to the creator task in
shmid_kernel->shm_creator field and zero it on task exit. Then
use the ->shm_creator insread of shm_cprid in both functions. As
shmid_kernel object is already locked at this stage, no additional
locking is needed.
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (71 commits)
[SCSI] fcoe: cleanup cpu selection for incoming requests
[SCSI] fcoe: add fip retry to avoid missing critical keep alive
[SCSI] libfc: fix warn on in lport retry
[SCSI] libfc: Remove the reference to FCP packet from scsi_cmnd in case of error
[SCSI] libfc: cleanup sending SRR request
[SCSI] libfc: two minor changes in comments
[SCSI] libfc, fcoe: ignore rx frame with wrong xid info
[SCSI] libfc: release exchg cache
[SCSI] libfc: use FC_MAX_ERROR_CNT
[SCSI] fcoe: remove unused ptype field in fcoe_rcv_info
[SCSI] bnx2fc: Update copyright and bump version to 1.0.4
[SCSI] bnx2fc: Tx BDs cache in write tasks
[SCSI] bnx2fc: Do not arm CQ when there are no CQEs
[SCSI] bnx2fc: hold tgt lock when calling cmd_release
[SCSI] bnx2fc: Enable support for sequence level error recovery
[SCSI] bnx2fc: HSI changes for tape
[SCSI] bnx2fc: Handle REC_TOV error code from firmware
[SCSI] bnx2fc: REC/SRR link service request and response handling
[SCSI] bnx2fc: Support 'sequence cleanup' task
[SCSI] dh_rdac: Associate HBA and storage in rdac_controller to support partitions in storage
...
If the directory contents change, then we have to accept that the
file->f_pos value may shrink if we do a 'search-by-cookie'. In that
case, we should turn off the loop detection and let the NFS client
try to recover.
The patch also fixes a second loop detection bug by ensuring
that after turning on the ctx->duped flag, we read at least one new
cookie into ctx->dir_cookie before attempting to match with
ctx->dup_cookie.
Merge branch 'slub/lockless' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'slub/lockless' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: (21 commits)
slub: When allocating a new slab also prep the first object
slub: disable interrupts in cmpxchg_double_slab when falling back to pagelock
Avoid duplicate _count variables in page_struct
Revert "SLUB: Fix build breakage in linux/mm_types.h"
SLUB: Fix build breakage in linux/mm_types.h
slub: slabinfo update for cmpxchg handling
slub: Not necessary to check for empty slab on load_freelist
slub: fast release on full slab
slub: Add statistics for the case that the current slab does not match the node
slub: Get rid of the another_slab label
slub: Avoid disabling interrupts in free slowpath
slub: Disable interrupts in free_debug processing
slub: Invert locking and avoid slab lock
slub: Rework allocator fastpaths
slub: Pass kmem_cache struct to lock and freeze slab
slub: explicit list_lock taking
slub: Add cmpxchg_double_slab()
mm: Rearrange struct page
slub: Move page->frozen handling near where the page->freelist handling occurs
slub: Do not use frozen page flag but a bit in the page counters
...
Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: (25 commits)
kconfig: Introduce IS_ENABLED(), IS_BUILTIN() and IS_MODULE()
xconfig: Abort close if configuration cannot be saved
kconfig: fix missing "0x" prefix from S_HEX symbol in autoconf.h
kconfig/nconf: remove useless conditionnal
kconfig/nconf: prevent segfault on empty menu
kconfig/nconf: use the generic menu_get_ext_help()
nconfig: Avoid Wunused-but-set warning
kconfig/conf: mark xfgets() private
kconfig: remove pending prototypes for kconfig_load()
kconfig/conf: add command line options' description
kconfig/conf: reduce the scope of `defconfig_file'
kconfig: use calloc() for expr allocation
kconfig: introduce specialized printer
kconfig: do not overwrite symbol direct dependency in assignment
kconfig/gconf: silent missing prototype warnings
kconfig/gconf: kill deadcode
kconfig: nuke LKC_DIRECT_LINK cruft
kconfig: nuke reference to SWIG
kconfig: add missing <stdlib.h> inclusion
kconfig: add missing <ctype.h> inclusion
...
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (430 commits)
[media] ir-mce_kbd-decoder: include module.h for its facilities
[media] ov5642: include module.h for its facilities
[media] em28xx: Fix DVB-C maxsize for em2884
[media] tda18271c2dd: Fix saw filter configuration for DVB-C @6MHz
[media] v4l: mt9v032: Fix Bayer pattern
[media] V4L: mt9m111: rewrite set_pixfmt
[media] V4L: mt9m111: fix missing return value check mt9m111_reg_clear
[media] V4L: initial driver for ov5642 CMOS sensor
[media] V4L: sh_mobile_ceu_camera: fix Oops when USERPTR mapping fails
[media] V4L: soc-camera: remove soc-camera bus and devices on it
[media] V4L: soc-camera: un-export the soc-camera bus
[media] V4L: sh_mobile_csi2: switch away from using the soc-camera bus notifier
[media] V4L: add media bus configuration subdev operations
[media] V4L: soc-camera: group struct field initialisations together
[media] V4L: soc-camera: remove now unused soc-camera specific PM hooks
[media] V4L: pxa-camera: switch to using standard PM hooks
[media] NetUP Dual DVB-T/C CI RF: force card hardware revision by module param
[media] Don't OOPS if videobuf_dvb_get_frontend return NULL
[media] NetUP Dual DVB-T/C CI RF: load firmware according card revision
[media] omap3isp: Support configurable HS/VS polarities
...
Fix up conflicts:
- arch/arm/mach-omap2/board-rx51-peripherals.c:
cleanup regulator supply definitions in mach-omap2
vs
OMAP3: RX-51: define vdds_csib regulator supply
- drivers/staging/tm6000/tm6000-alsa.c (trivial)
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
ecryptfs: Make inode bdi consistent with superblock bdi
eCryptfs: Unlock keys needed by ecryptfsd
James Bottomley [Fri, 29 Jul 2011 13:11:32 +0000 (17:11 +0400)]
ramoops: fix compile failure on parisc
Fixes this:
drivers/char/ramoops.c: In function 'ramoops_init':
drivers/char/ramoops.c:221: error: implicit declaration of function 'IS_ERR'
drivers/char/ramoops.c:222: error: implicit declaration of function 'PTR_ERR'
If it actually builds on other platforms, it's probably getting
linux/err.h via some other #include.
Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI: remove printks about disabled bridge windows
PCI: fold pci_calc_resource_flags() into decode_bar()
PCI: treat mem BAR type "11" (reserved) as 32-bit, not 64-bit, BAR
PCI: correct pcie_set_readrq write size
PCI: pciehp: change wait time for valid configuration access
x86/PCI: Preserve existing pci=bfsort whitelist for Dell systems
PCI: ARI is a PCIe v2 feature
x86/PCI: quirks: Use pci_dev->revision
PCI: Make the struct pci_dev * argument of pci_fixup_irqs const.
PCI hotplug: cpqphp: use pci_dev->vendor
PCI hotplug: cpqphp: use pci_dev->subsystem_{vendor|device}
x86/PCI: config space accessor functions should not ignore the segment argument
PCI: Assign values to 'pci_obff_signal_type' enumeration constants
x86/PCI: reduce severity of host bridge window conflict warnings
PCI: enumerate the PCI device only removed out PCI hieratchy of OS when re-scanning PCI
PCI: PCIe AER: add aer_recover_queue
x86/PCI: select direct access mode for mmconfig option
PCI hotplug: Rename is_ejectable which also exists in dock.c
Merge branch 'at91/cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc
* 'at91/cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc:
at91: add arch specific ioremap support
at91: factorize sram init
at91: move register clocks to soc generic init
at91: move clock subsystem init to soc generic init
at91: use structure to store the current soc
at91: remove AT91_DBGU offset from dbgu register macro
at91: factorize at91 interrupts init to soc
at91: introduce commom AT91_BASE_SYS
Merge branch 'next/dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc
* 'next/dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: (21 commits)
arm/dt: tegra devicetree support
arm/versatile: Add device tree support
dt/irq: add irq_domain_generate_simple() helper
irq: add irq_domain translation infrastructure
dmaengine: imx-sdma: add device tree probe support
dmaengine: imx-sdma: sdma_get_firmware does not need to copy fw_name
dmaengine: imx-sdma: use platform_device_id to identify sdma version
mmc: sdhci-esdhc-imx: add device tree probe support
mmc: sdhci-pltfm: dt device does not pass parent to sdhci_alloc_host
mmc: sdhci-esdhc-imx: get rid of the uses of cpu_is_mx()
mmc: sdhci-esdhc-imx: do not reference platform data after probe
mmc: sdhci-esdhc-imx: extend card_detect and write_protect support for mx5
net/fec: add device tree probe support
net: ibm_newemac: convert it to use of_get_phy_mode
dt/net: add helper function of_get_phy_mode
net/fec: gasket needs to be enabled for some i.mx
serial/imx: add device tree probe support
serial/imx: get rid of the uses of cpu_is_mx1()
arm/dt: Add dtb make rule
arm/dt: Add skeleton dtsi file
...
When commit 4e34e719e457 ("fs: take the ACL checks to common code")
changed the xyz_check_acl() functions into the more natural
xyz_get_acl() interface, we grew two copies of the
#define ext2_get_acl NULL
define for the non-acl case.
Remove the extra one.
Reported-by: Marco Stornelli <marco.stornelli@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Marek [Wed, 20 Jul 2011 15:38:57 +0000 (17:38 +0200)]
kconfig: Introduce IS_ENABLED(), IS_BUILTIN() and IS_MODULE()
Replace the config_is_*() macros with a variant that allows for grepping
for usage of CONFIG_* options in the code. Usage:
if (IS_ENABLED(CONFIG_NUMA))
or
#if IS_ENABLED(CONFIG_NUMA)
The IS_ENABLED() macro evaluates to 1 if the argument is set (to either 'y'
or 'm'), IS_BUILTIN() tests if the option is 'y' and IS_MODULE() test if
the option is 'm'. Only boolean and tristate options are supported.
Reviewed-by: Arnaud Lacombe <lacombar@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Michal Marek <mmarek@suse.cz>
Thomas Renninger [Thu, 21 Jul 2011 09:54:54 +0000 (11:54 +0200)]
cpupower: Do detect IDA (opportunistic processor performance) via cpuid
IA32-Intel Devel guide Volume 3A - 14.3.2.1
-------------------------------------------
...
Opportunistic processor performance operation can be disabled by setting bit 38 of
IA32_MISC_ENABLES. This mechanism is intended for BIOS only. If
IA32_MISC_ENABLES[38] is set, CPUID.06H:EAX[1] will return 0.
Better detect things via cpuid, this cleans up the code a bit
and the MSR parts were not working correctly anyway.
Thomas Renninger [Thu, 21 Jul 2011 09:54:53 +0000 (11:54 +0200)]
cpupower: Show Intel turbo ratio support via ./cpupower frequency-info
This adds the last piece missing from turbostat (if called with -v).
It shows on Intel machines supporting Turbo Boost how many cores
have to be active/idle to enter which boost mode (frequency).
Whether the HW really enters these boost modes can be verified via
./cpupower monitor.
cpupowerutils - cpufrequtils extended with quite some features
CPU power consumption vs performance tuning is no longer
limited to CPU frequency switching anymore: deep sleep states,
traditional dynamic frequency scaling and hidden turbo/boost
frequencies are tied close together and depend on each other.
The first two exist on different architectures like PPC, Itanium and
ARM, the latter (so far) only on X86. On X86 the APU (CPU+GPU) will
only run most efficiently if CPU and GPU has proper power management
in place.
Users and Developers want to have *one* tool to get an overview what
their system supports and to monitor and debug CPU power management
in detail. The tool should compile and work on as many architectures
as possible.
Once this tool stabilizes a bit, it is intended to replace the
Intel-specific tools in tools/power/x86
CC [M] drivers/net/pcmcia/smc91c92_cs.o
drivers/net/pcmcia/smc91c92_cs.c: In function ‘smc91c92_probe’:
drivers/net/pcmcia/smc91c92_cs.c:812:12: warning: ‘j’ may be used uninitialized in this function
However, "j" is only used in a branch which has the same condition as
a previous branch, where j is set, e.g.
int j;
if (CONDITION)
j = VALUE
...
if (CONDITION)
printk(j)
Still, avoid this warning, as it is easy to circumvent.
Pavel Roskin [Tue, 26 Jul 2011 22:52:41 +0000 (18:52 -0400)]
hostap_cs: support cards with "Version 01.02" as third product ID
Cards with numeric ID 0x0156:0x0002 and third ID "Version 01.02" can be
assumed to have Intersil firmware. Cards with Agere firmware use
"Version 01.01".
Signed-off-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Pavel Roskin [Tue, 26 Jul 2011 22:52:35 +0000 (18:52 -0400)]
pcmcia: add PCMCIA_DEVICE_MANF_CARD_PROD_ID3
This is needed to match wireless cards with Intersil firmware that have
ID 0x0156:0x0002 and the third ID "Version 01.02". Such cards are
currently matched by orinoco_cs, which doesn't support WPA. They should
be matched by hostap_cs.
The first and the second product ID vary widely, so there are few users
with some particular IDs. Of those, very few can submit a patch for
hostap_cs or write a useful bugreport. It's still important to support
their hardware properly.
With PCMCIA_DEVICE_MANF_CARD_PROD_ID3, it should be possible to cover
the remaining Intersil based designs that kept the numeric ID and the
"version" of the reference design.
Signed-off-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>