]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
9 years agodirect_IO: remove rw from a_ops->direct_IO()
Omar Sandoval [Mon, 16 Mar 2015 11:33:53 +0000 (04:33 -0700)]
direct_IO: remove rw from a_ops->direct_IO()

Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agodirect_IO: use iov_iter_rw() instead of rw everywhere
Omar Sandoval [Mon, 16 Mar 2015 11:33:52 +0000 (04:33 -0700)]
direct_IO: use iov_iter_rw() instead of rw everywhere

The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoRemove rw from dax_{do_,}io()
Omar Sandoval [Mon, 16 Mar 2015 11:33:51 +0000 (04:33 -0700)]
Remove rw from dax_{do_,}io()

And use iov_iter_rw() instead.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoRemove rw from {,__,do_}blockdev_direct_IO()
Omar Sandoval [Mon, 16 Mar 2015 11:33:50 +0000 (04:33 -0700)]
Remove rw from {,__,do_}blockdev_direct_IO()

Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonew helper: iov_iter_rw()
Omar Sandoval [Tue, 17 Mar 2015 21:04:02 +0000 (14:04 -0700)]
new helper: iov_iter_rw()

Get either READ or WRITE out of iter->type.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago->aio_read and ->aio_write removed
Al Viro [Sat, 4 Apr 2015 05:14:53 +0000 (01:14 -0400)]
->aio_read and ->aio_write removed

no remaining users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agopcm: another weird API abuse
Al Viro [Sat, 4 Apr 2015 04:19:32 +0000 (00:19 -0400)]
pcm: another weird API abuse

readv() and writev() should _not_ ignore all but the first ->iov_len,
among other things.  Really weird abuse of those syscalls - it
expects a vector element per channel, with identical lengths (it
actually assumes them to be identical - no checking is done).
readv() and writev() are really bad match for that.  Unfortunately,
userland API is userland API and we can't do anything about them.

Converted to ->read_iter/->write_iter.  Please, _please_ don't do
anything of that kind when designing new interfaces.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoinfinibad: weird APIs switched to ->write_iter()
Al Viro [Sat, 4 Apr 2015 04:11:32 +0000 (00:11 -0400)]
infinibad: weird APIs switched to ->write_iter()

Things Not To Do When Writing A Driver, part 1001st:
have writev() and write() on the same file doing completely
different things.  As in, "interpret very different sets of
commands".

We _can_ handle that, but it's a bloody bad idea.
Don't do that in new drivers.  Ever.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agokill do_sync_read/do_sync_write
Al Viro [Sat, 4 Apr 2015 02:10:20 +0000 (22:10 -0400)]
kill do_sync_read/do_sync_write

all remaining instances of aio_{read,write} (all 4 of them) have explicit
->read and ->write resp.; do_sync_read/do_sync_write is never called by
__vfs_read/__vfs_write anymore and no other users had been left.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: use iov_iter_get_pages() for non-splice path
Al Viro [Sat, 4 Apr 2015 02:06:08 +0000 (22:06 -0400)]
fuse: use iov_iter_get_pages() for non-splice path

store reference to iter instead of that to iovec

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: switch to ->read_iter/->write_iter
Al Viro [Sat, 4 Apr 2015 01:53:39 +0000 (21:53 -0400)]
fuse: switch to ->read_iter/->write_iter

we just change the calling conventions here; more work to follow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch drivers/char/mem.c to ->read_iter/->write_iter
Al Viro [Fri, 3 Apr 2015 19:57:04 +0000 (15:57 -0400)]
switch drivers/char/mem.c to ->read_iter/->write_iter

Note that _these_ guys have ->read() and ->write() left in place - they are
eqiuvalent to what we'd get if we replaced those with NULL, but we are
talking about hot paths here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agomake new_sync_{read,write}() static
Al Viro [Fri, 3 Apr 2015 19:41:18 +0000 (15:41 -0400)]
make new_sync_{read,write}() static

All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocoredump: accept any write method
Al Viro [Fri, 3 Apr 2015 19:23:17 +0000 (15:23 -0400)]
coredump: accept any write method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch /dev/loop to vfs_iter_write()
Al Viro [Fri, 3 Apr 2015 19:21:59 +0000 (15:21 -0400)]
switch /dev/loop to vfs_iter_write()

all writable files that might be used as backing store for /dev/loop
already support ->write_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoserial2002: switch to __vfs_read/__vfs_write
Al Viro [Fri, 3 Apr 2015 19:14:42 +0000 (15:14 -0400)]
serial2002: switch to __vfs_read/__vfs_write

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoashmem: use __vfs_read()
Al Viro [Fri, 3 Apr 2015 19:09:38 +0000 (15:09 -0400)]
ashmem: use __vfs_read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoexport __vfs_read()
Al Viro [Fri, 3 Apr 2015 19:09:18 +0000 (15:09 -0400)]
export __vfs_read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoautofs: switch to __vfs_write()
Al Viro [Fri, 3 Apr 2015 19:07:48 +0000 (15:07 -0400)]
autofs: switch to __vfs_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonew helper: __vfs_write()
Al Viro [Fri, 3 Apr 2015 19:06:43 +0000 (15:06 -0400)]
new helper: __vfs_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch '9p-iov_iter' into for-next
Al Viro [Sat, 4 Apr 2015 05:16:54 +0000 (01:16 -0400)]
Merge branch '9p-iov_iter' into for-next

9 years agoswitch hugetlbfs to ->read_iter()
Al Viro [Fri, 3 Apr 2015 15:31:35 +0000 (11:31 -0400)]
switch hugetlbfs to ->read_iter()

... and fix the case when the area we are asked to read crosses
a hugepage boundary

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocoda: switch to ->read_iter/->write_iter
Al Viro [Fri, 3 Apr 2015 14:58:11 +0000 (10:58 -0400)]
coda: switch to ->read_iter/->write_iter

... and request the same from the local cache - all filesystems with
anything usable for that support those already.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoncpfs: switch to ->read_iter/->write_iter
Al Viro [Fri, 3 Apr 2015 03:30:18 +0000 (23:30 -0400)]
ncpfs: switch to ->read_iter/->write_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonet/9p: remove (now-)unused helpers
Al Viro [Fri, 3 Apr 2015 03:11:36 +0000 (23:11 -0400)]
net/9p: remove (now-)unused helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agop9_client_attach(): set fid->uid correctly
Al Viro [Fri, 3 Apr 2015 01:47:49 +0000 (21:47 -0400)]
p9_client_attach(): set fid->uid correctly

it's almost always equal to current_fsuid(), but there's an exception -
if the first writeback fid is opened by non-root *and* that happens before
root has done any lookups in /, we end up doing attach for root.  The
current code leaves the resulting FID owned by root from the server POV
and by non-root from the client one.  Unfortunately, it means that e.g.
massive dcache eviction will leave that user buggered - they'll end
up redoing walks from / *and* picking that FID every time.  As soon as
they try to create something, the things will get nasty.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: we are leaking glock.client_id in v9fs_file_getlock()
Al Viro [Thu, 2 Apr 2015 16:02:03 +0000 (12:02 -0400)]
9p: we are leaking glock.client_id in v9fs_file_getlock()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch to ->read_iter/->write_iter
Al Viro [Thu, 2 Apr 2015 03:59:57 +0000 (23:59 -0400)]
9p: switch to ->read_iter/->write_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: get rid of v9fs_direct_file_read()
Al Viro [Thu, 2 Apr 2015 03:49:24 +0000 (23:49 -0400)]
9p: get rid of v9fs_direct_file_read()

do it in ->direct_IO()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch p9_client_read() to passing struct iov_iter *
Al Viro [Thu, 2 Apr 2015 03:42:28 +0000 (23:42 -0400)]
9p: switch p9_client_read() to passing struct iov_iter *

... and make it loop

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: get rid of v9fs_direct_file_write()
Al Viro [Thu, 2 Apr 2015 02:32:23 +0000 (22:32 -0400)]
9p: get rid of v9fs_direct_file_write()

just handle it in ->direct_IO()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: fold v9fs_file_write_internal() into the caller
Al Viro [Thu, 2 Apr 2015 02:04:46 +0000 (22:04 -0400)]
9p: fold v9fs_file_write_internal() into the caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch ->writepage() to direct use of p9_client_write()
Al Viro [Thu, 2 Apr 2015 01:54:42 +0000 (21:54 -0400)]
9p: switch ->writepage() to direct use of p9_client_write()

Don't mess with kmap() - just use ITER_BVEC.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years ago9p: switch p9_client_write() to passing it struct iov_iter *
Al Viro [Thu, 2 Apr 2015 00:17:51 +0000 (20:17 -0400)]
9p: switch p9_client_write() to passing it struct iov_iter *

... and make it loop until it's done

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonet/9p: switch the guts of p9_client_{read,write}() to iov_iter
Al Viro [Wed, 1 Apr 2015 23:57:53 +0000 (19:57 -0400)]
net/9p: switch the guts of p9_client_{read,write}() to iov_iter

... and have get_user_pages_fast() mapping fewer pages than requested
to generate a short read/write.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonommu: use __vfs_read()
Al Viro [Tue, 31 Mar 2015 16:35:13 +0000 (12:35 -0400)]
nommu: use __vfs_read()

... instead of open-coding the call of ->read()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoacct: check FMODE_CAN_WRITE
Al Viro [Tue, 31 Mar 2015 16:30:48 +0000 (12:30 -0400)]
acct: check FMODE_CAN_WRITE

it's not calling ->write() directly anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoaio_run_iocb(): kill dead check
Al Viro [Tue, 31 Mar 2015 15:54:59 +0000 (11:54 -0400)]
aio_run_iocb(): kill dead check

We check if ->ki_pos is positive.  However, by that point we have
already done rw_verify_area(), which would have rejected such
unless the file had been one of /dev/mem, /dev/kmem and /proc/kcore.
All of which do not have vectored rw methods, so we would've bailed
out even earlier.

This check had been introduced before rw_verify_area() had been added there
- in fact, it was a subset of checks done on sync paths by rw_verify_area()
(back then the /dev/mem exception didn't exist at all).  The rest of checks
(mandatory locking, etc.) hadn't been added until later.  Unfortunately,
by the time the call of rw_verify_area() got added, the /dev/mem exception
had already appeared, so it wasn't obvious that the older explicit check
downstream had become dead code.  It *is* a dead code, though, since the few
files for which the exception applies do not have ->aio_{read,write}() or
->{read,write}_iter() and for them we won't reach that check anyway.

What's more, even if we ever introduce vectored methods for /dev/mem
and friends, they'll have to cope with negative positions anyway, since
readv(2) and writev(2) are using the same checks as read(2) and write(2) -
i.e. rw_verify_area().

Let's bury it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoioctx_alloc(): remove pointless check
Al Viro [Tue, 31 Mar 2015 15:43:52 +0000 (11:43 -0400)]
ioctx_alloc(): remove pointless check

Way, way back kiocb used to be picked from arrays, so ioctx_alloc()
checked for multiplication overflow when calculating the size of
such array.  By the time fs/aio.c went into the tree (in 2002) they
were already allocated one-by-one by kmem_cache_alloc(), so that
check had already become pointless.  Let's bury it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agolustre: kill unused members of struct vvp_thread_info
Al Viro [Tue, 31 Mar 2015 03:39:16 +0000 (23:39 -0400)]
lustre: kill unused members of struct vvp_thread_info

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoexpand __fuse_direct_write() in both callers
Al Viro [Tue, 31 Mar 2015 02:15:58 +0000 (22:15 -0400)]
expand __fuse_direct_write() in both callers

it's actually shorter that way *and* later we'll want iocb in scope
of generic_write_check() caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agofuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()
Al Viro [Tue, 31 Mar 2015 02:08:36 +0000 (22:08 -0400)]
fuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agocuse: switch to iov_iter
Al Viro [Sat, 21 Mar 2015 13:01:45 +0000 (09:01 -0400)]
cuse: switch to iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'for-davem' into for-next
Al Viro [Mon, 30 Mar 2015 15:34:16 +0000 (11:34 -0400)]
Merge branch 'for-davem' into for-next

9 years agoswitch kernel_sendmsg() and kernel_recvmsg() to iov_iter_kvec()
Al Viro [Sat, 21 Mar 2015 23:56:16 +0000 (19:56 -0400)]
switch kernel_sendmsg() and kernel_recvmsg() to iov_iter_kvec()

For kernel_sendmsg() that eliminates the need to play with setfs();
for kernel_recvmsg() it does *not* - a couple of callers are using
it with non-NULL ->msg_control, which would be treated as userland
address on recvmsg side of things.

In all cases we are really setting a kvec-backed iov_iter, though.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonet: switch importing msghdr from userland to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 23:29:06 +0000 (19:29 -0400)]
net: switch importing msghdr from userland to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agonet: switch sendto() and recvfrom() to import_single_range()
Al Viro [Sat, 21 Mar 2015 23:12:32 +0000 (19:12 -0400)]
net: switch sendto() and recvfrom() to import_single_range()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'iov_iter' into for-davem
Al Viro [Mon, 30 Mar 2015 15:28:21 +0000 (11:28 -0400)]
Merge branch 'iov_iter' into for-davem

9 years agosg_start_req(): use import_iovec()
Al Viro [Sun, 22 Mar 2015 00:25:30 +0000 (20:25 -0400)]
sg_start_req(): use import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agosg_start_req(): make sure that there's not too many elements in iovec
Al Viro [Sun, 22 Mar 2015 00:08:18 +0000 (20:08 -0400)]
sg_start_req(): make sure that there's not too many elements in iovec

unfortunately, allowing an arbitrary 16bit value means a possibility of
overflow in the calculation of total number of pages in bio_map_user_iov() -
we rely on there being no more than PAGE_SIZE members of sum in the
first loop there.  If that sum wraps around, we end up allocating
too small array of pointers to pages and it's easy to overflow it in
the second loop.

X-Coverup: TINC (and there's no lumber cartel either)
Cc: stable@vger.kernel.org # way, way back
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoblk_rq_map_user(): use import_single_range()
Al Viro [Sun, 22 Mar 2015 00:06:04 +0000 (20:06 -0400)]
blk_rq_map_user(): use import_single_range()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agosg_io(): use import_iovec()
Al Viro [Sun, 22 Mar 2015 00:02:55 +0000 (20:02 -0400)]
sg_io(): use import_iovec()

... and don't skip access_ok() validation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoprocess_vm_access: switch to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 18:47:11 +0000 (14:47 -0400)]
process_vm_access: switch to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch keyctl_instantiate_key_common() to iov_iter
Al Viro [Tue, 17 Mar 2015 13:59:38 +0000 (09:59 -0400)]
switch keyctl_instantiate_key_common() to iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoswitch {compat_,}do_readv_writev() to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 23:40:11 +0000 (19:40 -0400)]
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoaio_setup_vectored_rw(): switch to {compat_,}import_iovec()
Al Viro [Sat, 21 Mar 2015 23:34:53 +0000 (19:34 -0400)]
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agovmsplice_to_user(): switch to import_iovec()
Al Viro [Sat, 21 Mar 2015 23:17:55 +0000 (19:17 -0400)]
vmsplice_to_user(): switch to import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agokill aio_setup_single_vector()
Al Viro [Sat, 21 Mar 2015 23:11:55 +0000 (19:11 -0400)]
kill aio_setup_single_vector()

identical to import_single_range()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'iov_iter' into for-next
Al Viro [Mon, 30 Mar 2015 15:22:35 +0000 (11:22 -0400)]
Merge branch 'iov_iter' into for-next

9 years agoaio: simplify arguments of aio_setup_..._rw()
Al Viro [Sat, 21 Mar 2015 00:40:18 +0000 (20:40 -0400)]
aio: simplify arguments of aio_setup_..._rw()

We don't need req in either of those.  We don't need nr_segs in caller.
We don't really need len in caller either - iov_iter_count(&iter) will do.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoaio: lift iov_iter_init() into aio_setup_..._rw()
Al Viro [Sat, 21 Mar 2015 00:17:32 +0000 (20:17 -0400)]
aio: lift iov_iter_init() into aio_setup_..._rw()

the only non-trivial detail is that we do it before rw_verify_area(),
so we'd better cap the length ourselves in aio_setup_single_rw()
case (for vectored case rw_copy_check_uvector() will do that for us).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agolift iov_iter into {compat_,}do_readv_writev()
Al Viro [Sat, 21 Mar 2015 00:10:21 +0000 (20:10 -0400)]
lift iov_iter into {compat_,}do_readv_writev()

get it closer to matching {compat_,}rw_copy_check_uvector().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agoMerge branch 'iocb' into for-next
Al Viro [Mon, 30 Mar 2015 15:20:14 +0000 (11:20 -0400)]
Merge branch 'iocb' into for-next

9 years agoMerge remote-tracking branch 'net/master' into for-davem
Al Viro [Mon, 30 Mar 2015 15:12:54 +0000 (11:12 -0400)]
Merge remote-tracking branch 'net/master' into for-davem

trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agosaner iov_iter initialization primitives
Al Viro [Sat, 21 Mar 2015 21:45:43 +0000 (17:45 -0400)]
saner iov_iter initialization primitives

iovec-backed iov_iter instances are assumed to satisfy several properties:
* no more than UIO_MAXIOV elements in iovec array
* total size of all ranges is no more than MAX_RW_COUNT
* all ranges pass access_ok().

The problem is, invariants of data structures should be established in the
primitives creating those data structures, not in the code using those
primitives.  And iov_iter_init() violates that principle.  For a while we
managed to get away with that, but once the use of iov_iter started to
spread, it didn't take long for shit to hit the fan - missed check in
sys_sendto() had introduced a roothole.

We _do_ have primitives for importing and validating iovecs (both native and
compat ones) and those primitives are almost always followed by shoving the
resulting iovec into iov_iter.  Life would be considerably simpler (and safer)
if we combined those primitives with initializing iov_iter.

That gives us two new primitives - import_iovec() and compat_import_iovec().
Calling conventions:
iovec = iov_array;
err = import_iovec(direction, uvec, nr_segs,
   ARRAY_SIZE(iov_array), &iovec,
   &iter);
imports user vector into kernel space (into iov_array if it fits, allocated
if it doesn't fit or if iovec was NULL), validates it and sets iter up to
refer to it.  On success 0 is returned and allocated kernel copy (or NULL
if the array had fit into caller-supplied one) is returned via iovec.
On failure all allocations are undone and -E... is returned.  If the total
size of ranges exceeds MAX_RW_COUNT, the excess is silently truncated.

compat_import_iovec() expects uvec to be a pointer to user array of compat_iovec;
otherwise it's identical to import_iovec().

Finally, import_single_range() sets iov_iter backed by single-element iovec
covering a user-supplied range -

err = import_single_range(direction, address, size, iovec, &iter);

does validation and sets iter up.  Again, size in excess of MAX_RW_COUNT gets
silently truncated.

Next commits will be switching the things up to use of those and reducing
the amount of iov_iter_init() instances.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 years agotipc: fix two bugs in secondary destination lookup
Jon Paul Maloy [Fri, 27 Mar 2015 14:19:19 +0000 (10:19 -0400)]
tipc: fix two bugs in secondary destination lookup

A message sent to a node after a successful name table lookup may still
find that the destination socket has disappeared, because distribution
of name table updates is non-atomic. If so, the message will be rejected
back to the sender with error code TIPC_ERR_NO_PORT. If the source
socket of the message has disappeared in the meantime, the message
should be dropped.

However, in the currrent code, the message will instead be subject to an
unwanted tertiary lookup, because the function tipc_msg_lookup_dest()
doesn't check if there is an error code present in the message before
performing the lookup. In the worst case, the message may now find the
old destination again, and be redirected once more, instead of being
dropped directly as it should be.

A second bug in this function is that the "prev_node" field in the message
is not updated after successful lookup, something that may have
unpredictable consequences.

The problems arising from those bugs occur very infrequently.

The third change in this function; the test on msg_reroute_msg_cnt() is
purely cosmetic, reflecting that the returned value never can be negative.

This commit corrects the two bugs described above.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Sun, 29 Mar 2015 20:38:08 +0000 (13:38 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-03-27

This series contains updates to i40e and i40evf.

Jesse adds new device IDs to handle the new 20G speed for KR2.

Mitch provides a fix for an issue that shows up as a panic or memory
corruption when the device is brought down while under heavy stress.
This is resolved by delaying the releasing of resources until we
receive acknowledgment from the PF driver that the rings have indeed
been stopped.  Also adds firmware version information to ethtool
reporting to align with ixgbevf behavior.

Akeem increases the polling loop limiter, sine we found that in
certain circumstances the firmware can take longer to be ready after
a reset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'stacked_vlan_tso'
David S. Miller [Sun, 29 Mar 2015 20:33:32 +0000 (13:33 -0700)]
Merge branch 'stacked_vlan_tso'

Toshiaki Makita says:

====================
Stacked vlan TSO

On the basis of Netdev 0.1 discussion[1], I made a patch set to enable
TSO for packets with multiple vlans.

Currently, packets with multiple vlans are always segmented by software,
which is caused by that netif_skb_features() drops most feature flags
for multiple tagged packets.

To allow NICs to segment them, we need to get rid of that check from core.
Fortunately, recently introduced ndo_features_check() can be used to
move the check to each driver, and this patch set is based on the idea.

For the initial patch set, I chose 3 drivers, bonding, team, and igb, as
candidates to enable TSO. I tested them and confirmed they works fine
with this change.

Here are samples of performance test results. As I expected, %sys gets
pretty lower than before.

* TEST1: vlan (.1Q) on vlan (.1ad) on igb (I350)

- before

$ netperf -t TCP_STREAM -H 192.168.10.1 -l 60
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    60.02     933.72

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.13      0.00     11.28      0.01      0.00     88.58

- after

$ netperf -t TCP_STREAM -H 192.168.10.1 -l 60
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    60.01     936.13

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.24      0.00      4.17      0.01      0.00     95.58

* TEST2: vlan (.1Q) on bridge (.1ad vlan filtering) on team on igb (I350)

- before

$ netperf -t TCP_STREAM -H 192.168.10.1 -l 60
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    60.01     936.28

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.41      0.00     11.57      0.01      0.00     88.01

- after

$ netperf -t TCP_STREAM -H 192.168.10.1 -l 60
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    60.02     935.72

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.14      0.00      7.66      0.01      0.00     92.19

In addition to above, I tested these configurations:
- vlan (.1Q) on vlan (1.ad) on bonding on igb (I350)
- vlan (.1Q) on vlan (1.Q) on igb (I350)
- vlan (.1Q) on vlan (1.Q) on team on igb (I350)
And didn't find any problem.

[1] https://netdev01.org/sessions/18
    https://netdev01.org/docs/netdev01_bof_8021ad_makita_150212.pdf
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoigb: Enable TSO for stacked vlan
Toshiaki Makita [Fri, 27 Mar 2015 05:31:16 +0000 (14:31 +0900)]
igb: Enable TSO for stacked vlan

As datasheets for igb (I210, I350, 82576, etc.) say, maclen can be from
14 to 127, which is enough for reasonable number of vlan tags.
My netperf test showed I350's TSO works pretty fine with multiple vlans.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoteam: Don't segment multiple tagged packets on team device
Toshiaki Makita [Fri, 27 Mar 2015 05:31:15 +0000 (14:31 +0900)]
team: Don't segment multiple tagged packets on team device

Team devices don't need to segment multiple tagged packets since their
slaves can segment them.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobonding: Don't segment multiple tagged packets on bonding device
Toshiaki Makita [Fri, 27 Mar 2015 05:31:14 +0000 (14:31 +0900)]
bonding: Don't segment multiple tagged packets on bonding device

Bonding devices don't need to segment multiple tagged packets since their
slaves can segment them.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: Introduce passthru_features_check
Toshiaki Makita [Fri, 27 Mar 2015 05:31:13 +0000 (14:31 +0900)]
net: Introduce passthru_features_check

As there are a number of (especially virtual) devices that don't
need the multiple vlan check, introduce passthru_features_check() for
convenience.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: Move check for multiple vlans to drivers
Toshiaki Makita [Fri, 27 Mar 2015 05:31:12 +0000 (14:31 +0900)]
net: Move check for multiple vlans to drivers

To allow drivers to handle the features check for multiple tags,
move the check to ndo_features_check().
As no drivers currently handle multiple tagged TSO, introduce
dflt_features_check() and call it if the driver does not have
ndo_features_check().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agovlan: Introduce helper functions to check if skb is tagged
Toshiaki Makita [Fri, 27 Mar 2015 05:31:11 +0000 (14:31 +0900)]
vlan: Introduce helper functions to check if skb is tagged

Separate the two checks for single vlan and multiple vlans in
netif_skb_features().  This allows us to move the check for multiple
vlans to another function later.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agovlan: Add features for stacked vlan device
Toshiaki Makita [Fri, 27 Mar 2015 05:31:10 +0000 (14:31 +0900)]
vlan: Add features for stacked vlan device

Stacked vlan devices curretly have few features (GRO, HIGHDMA, LLTX).
Since we have software fallbacks in case the NIC can not handle some
features for multiple vlans, we can add the same features as the lower
vlan devices for stacked vlan devices.

This allows stacked vlan devices to create large (GSO) packets and not to
segment packets. Those packets will be segmented by software on the real
device, or even can be segmented by the NIC once TSO for multiple vlans
becomes enabled by the following patches.

The exception is those related to FCoE, which does not have a software
fallback.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotc: bpf: generalize pedit action
Alexei Starovoitov [Fri, 27 Mar 2015 02:53:57 +0000 (19:53 -0700)]
tc: bpf: generalize pedit action

existing TC action 'pedit' can munge any bits of the packet.
Generalize it for use in bpf programs attached as cls_bpf and act_bpf via
bpf_skb_store_bytes() helper function.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'dsa-hw-bridging'
David S. Miller [Sun, 29 Mar 2015 20:24:05 +0000 (13:24 -0700)]
Merge branch 'dsa-hw-bridging'

Guenter Roeck says:

====================
net: dsa: HW bridging, EEE support

Patch 1 to 7 of this series prepare the drivers using the mv88e6xxx code
for HW bridging support, without adding the code itself. For the most part
this factors out common port initialization code. There is no functional
change except for patch 3, which disables the message port bit for the
CPU port to prevent packet duplication if HW bridging is configured.

Patch 8 adds the infrastructure for hardware bridging support to the
mv88e6xxx code.

Patch 9 wires the MV88E6352 driver to support hardware bridging.

Patches 10 to 12 add support for ndo_fdb functions to the dsa subsystem,
and wire up the MV88E6352 driver to support those functions.

Patches 13 to 16 add EEE support and HW bridging support to the mv88e6171
driver. This set of patches is from Andrew, applied on top of the first
set of patches.

The series applies to net-next as of 3/24/2015.

Thanks a lot to Andrew Lunn for testing and valuable feedback.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6171: Add support for hardware bridging
Andrew Lunn [Fri, 27 Mar 2015 01:36:43 +0000 (18:36 -0700)]
net: dsa: mv88e6171: Add support for hardware bridging

Wire up the common code for setting up hardware bridging
and access to the forwarding database.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6171: Add EEE support to the mv88e6172
Andrew Lunn [Fri, 27 Mar 2015 01:36:42 +0000 (18:36 -0700)]
net: dsa: mv88e6171: Add EEE support to the mv88e6172

The mv88e6172 has support for EEE. Check for the product ID and call
the common code if applicable.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6171: Add defines for switch product IDs
Andrew Lunn [Fri, 27 Mar 2015 01:36:41 +0000 (18:36 -0700)]
net: dsa: mv88e6171: Add defines for switch product IDs

Make the code more readable by using defines for the switch IDs.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: Centralise getting switch id
Andrew Lunn [Fri, 27 Mar 2015 01:36:40 +0000 (18:36 -0700)]
net: dsa: Centralise getting switch id

Get the switch id and save it away in the private mv88x6xxx structure
in a centralised piece of code, rather than each driver doing it itself.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6352: Add support for ndo_fdb functions
Guenter Roeck [Fri, 27 Mar 2015 01:36:39 +0000 (18:36 -0700)]
net: dsa: mv88e6352: Add support for ndo_fdb functions

Add support for manipulating switch fdb entries by pointing to the
ndo_fdb functions implemented for mv88e6xxxx.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6xxx: Add support for fdb_add, fdb_del, and fdb_getnext
Guenter Roeck [Fri, 27 Mar 2015 01:36:38 +0000 (18:36 -0700)]
net: dsa: mv88e6xxx: Add support for fdb_add, fdb_del, and fdb_getnext

No vlan support at this time.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: Add basic framework to support ndo_fdb functions
Guenter Roeck [Fri, 27 Mar 2015 01:36:37 +0000 (18:36 -0700)]
net: dsa: Add basic framework to support ndo_fdb functions

Provide callbacks for ndo_fdb_add, ndo_fdb_del, and ndo_fdb_dump.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6352: Add support for hardware bridging
Guenter Roeck [Fri, 27 Mar 2015 01:36:36 +0000 (18:36 -0700)]
net: dsa: mv88e6352: Add support for hardware bridging

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6xxx: Add Hardware bridging support
Guenter Roeck [Fri, 27 Mar 2015 01:36:35 +0000 (18:36 -0700)]
net: dsa: mv88e6xxx: Add Hardware bridging support

Bridge support is similar for all chips supported by the mv88e6xxx code,
so add the code there.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6171: Use common port configuration
Guenter Roeck [Fri, 27 Mar 2015 01:36:34 +0000 (18:36 -0700)]
net: dsa: mv88e6171: Use common port configuration

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6123_61_65: Use common port configuration
Guenter Roeck [Fri, 27 Mar 2015 01:36:33 +0000 (18:36 -0700)]
net: dsa: mv88e6123_61_65: Use common port configuration

This will simplify adding offloaded bridge support later on.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6352: Use common port initialization code
Guenter Roeck [Fri, 27 Mar 2015 01:36:32 +0000 (18:36 -0700)]
net: dsa: mv88e6352: Use common port initialization code

This prepares the driver for hardware bridging.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6xxx: Split mv88e6xxx_reg_read and mv88e6xxx_reg_write
Guenter Roeck [Fri, 27 Mar 2015 01:36:31 +0000 (18:36 -0700)]
net: dsa: mv88e6xxx: Split mv88e6xxx_reg_read and mv88e6xxx_reg_write

Split mv88e6xxx_reg_read and mv88e6xxx_reg_write into two functions each,
one to acquire smi_mutex and one to get struct mii_bus *bus from
struct dsa_switch *ds and to call the actual read/write function.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6xxx: Disable Message Port bit for CPU port
Guenter Roeck [Fri, 27 Mar 2015 01:36:30 +0000 (18:36 -0700)]
net: dsa: mv88e6xxx: Disable Message Port bit for CPU port

Datasheet says that the Message Port bit should not be set for the CPU port.
Having it set causes DSA tagged packets to be sent to the CPU port roughly
every 30 seconds. Those packets are the same as real packets forwarded between
switch ports if the switch is configured for switching between multiple ports.
The packets are then bridged by the software bridge, resulting in duplicated
packets on the network.

Reported-by: Andrew Lunn <andrew@lunn.ch>
Cc: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6xxx: Provide function for common port initialization
Guenter Roeck [Fri, 27 Mar 2015 01:36:29 +0000 (18:36 -0700)]
net: dsa: mv88e6xxx: Provide function for common port initialization

Provide mv88e6xxx_setup_port_common() for common port initialization.
Currently only write Port 1 Control and VLAN configuration since
this will be needed for hardware bridging. More can be added later
if desired/needed.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: dsa: mv88e6xxx: Factor out common initialization code
Guenter Roeck [Fri, 27 Mar 2015 01:36:28 +0000 (18:36 -0700)]
net: dsa: mv88e6xxx: Factor out common initialization code

Code used and needed in mv886xxx.c should be initialized there as well,
so factor it out from the individual initialization files.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agohv_netvsc: remove vmbus_are_subchannels_present() in rndis_filter_device_add()
Haiyang Zhang [Thu, 26 Mar 2015 23:20:58 +0000 (16:20 -0700)]
hv_netvsc: remove vmbus_are_subchannels_present() in rndis_filter_device_add()

The vmbus_are_subchannels_present() also involves opening the channels, which
may be too early at this point. Checking for subchannels is not necessary here.
So this patch removes it. Subchannels will be opened when offer messages arrive.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: smc91x: make use of 4th parameter to devm_gpiod_get_index
Uwe Kleine-König [Thu, 26 Mar 2015 21:26:08 +0000 (22:26 +0100)]
net: smc91x: make use of 4th parameter to devm_gpiod_get_index

Since 39b2bbe3d715 (gpio: add flags argument to gpiod_get*() functions)
which appeared in v3.17-rc1, the gpiod_get* functions take an additional
parameter that allows to specify direction and initial value for output.
Simplify accordingly.

Moreover use devm_gpiod_get_index_optional for still simpler handling.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agohv_netvsc: Implement batching in send buffer
Haiyang Zhang [Thu, 26 Mar 2015 16:03:37 +0000 (09:03 -0700)]
hv_netvsc: Implement batching in send buffer

With this patch, we can send out multiple RNDIS data packets in one send buffer
slot and one VMBus message. It reduces the overhead associated with VMBus messages.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
David S. Miller [Sun, 29 Mar 2015 19:43:43 +0000 (12:43 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next tree.
Basically, nf_tables updates to add the set extension infrastructure and finish
the transaction for sets from Patrick McHardy. More specifically, they are:

1) Move netns to basechain and use recently added possible_net_t, from
   Patrick McHardy.

2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches.

3) Restore nf_log_trace that was accidentally removed during conflict
   resolution.

4) nft_queue does not depend on NETFILTER_XTABLES, starting from here
   all patches from Patrick McHardy.

5) Use raw_smp_processor_id() in nft_meta.

Then, several patches to prepare ground for the new set extension
infrastructure:

6) Pass object length to the hash callback in rhashtable as needed by
   the new set extension infrastructure.

7) Cleanup patch to restore struct nft_hash as wrapper for struct
   rhashtable

8) Another small source code readability cleanup for nft_hash.

9) Convert nft_hash to rhashtable callbacks.

And finally...

10) Add the new set extension infrastructure.

11) Convert the nft_hash and nft_rbtree sets to use it.

12) Batch set element release to avoid several RCU grace period in a row
    and add new function nft_set_elem_destroy() to consolidate set element
    release.

13) Return the set extension data area from nft_lookup.

14) Refactor existing transaction code to add some helper functions
    and document it.

15) Complete the set transaction support, using similar approach to what we
    already use, to activate/deactivate elements in an atomic fashion.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'tipc-next'
David S. Miller [Sun, 29 Mar 2015 19:40:37 +0000 (12:40 -0700)]
Merge branch 'tipc-next'

Ying Xue says:

====================
tipc: fix two corner issues

The patch set aims at resolving the following two critical issues:

Patch #1: Resolve a deadlock which happens while all links are reset
Patch #2: Correct a mistake usage of RCU lock which is used to protect
          node list
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: involve reference counter for node structure
Ying Xue [Thu, 26 Mar 2015 10:10:24 +0000 (18:10 +0800)]
tipc: involve reference counter for node structure

TIPC node hash node table is protected with rcu lock on read side.
tipc_node_find() is used to look for a node object with node address
through iterating the hash node table. As the entire process of what
tipc_node_find() traverses the table is guarded with rcu read lock,
it's safe for us. However, when callers use the node object returned
by tipc_node_find(), there is no rcu read lock applied. Therefore,
this is absolutely unsafe for callers of tipc_node_find().

Now we introduce a reference counter for node structure. Before
tipc_node_find() returns node object to its caller, it first increases
the reference counter. Accordingly, after its caller used it up,
it decreases the counter again. This can prevent a node being used by
one thread from being freed by another thread.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotipc: fix potential deadlock when all links are reset
Ying Xue [Thu, 26 Mar 2015 10:10:23 +0000 (18:10 +0800)]
tipc: fix potential deadlock when all links are reset

[   60.988363] ======================================================
[   60.988754] [ INFO: possible circular locking dependency detected ]
[   60.989152] 3.19.0+ #194 Not tainted
[   60.989377] -------------------------------------------------------
[   60.989781] swapper/3/0 is trying to acquire lock:
[   60.990079]  (&(&n_ptr->lock)->rlock){+.-...}, at: [<ffffffffa0006dca>] tipc_link_retransmit+0x1aa/0x240 [tipc]
[   60.990743]
[   60.990743] but task is already holding lock:
[   60.991106]  (&(&bclink->lock)->rlock){+.-...}, at: [<ffffffffa00004be>] tipc_bclink_lock+0x8e/0xa0 [tipc]
[   60.991738]
[   60.991738] which lock already depends on the new lock.
[   60.991738]
[   60.992174]
[   60.992174] the existing dependency chain (in reverse order) is:
[   60.992174]
-> #1 (&(&bclink->lock)->rlock){+.-...}:
[   60.992174]        [<ffffffff810a9c0c>] lock_acquire+0x9c/0x140
[   60.992174]        [<ffffffff8179c41f>] _raw_spin_lock_bh+0x3f/0x50
[   60.992174]        [<ffffffffa00004be>] tipc_bclink_lock+0x8e/0xa0 [tipc]
[   60.992174]        [<ffffffffa0000f57>] tipc_bclink_add_node+0x97/0xf0 [tipc]
[   60.992174]        [<ffffffffa0011815>] tipc_node_link_up+0xf5/0x110 [tipc]
[   60.992174]        [<ffffffffa0007783>] link_state_event+0x2b3/0x4f0 [tipc]
[   60.992174]        [<ffffffffa00193c0>] tipc_link_proto_rcv+0x24c/0x418 [tipc]
[   60.992174]        [<ffffffffa0008857>] tipc_rcv+0x827/0xac0 [tipc]
[   60.992174]        [<ffffffffa0002ca3>] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
[   60.992174]        [<ffffffff81646e66>] __netif_receive_skb_core+0x746/0x980
[   60.992174]        [<ffffffff816470c1>] __netif_receive_skb+0x21/0x70
[   60.992174]        [<ffffffff81647295>] netif_receive_skb_internal+0x35/0x130
[   60.992174]        [<ffffffff81648218>] napi_gro_receive+0x158/0x1d0
[   60.992174]        [<ffffffff81559e05>] e1000_clean_rx_irq+0x155/0x490
[   60.992174]        [<ffffffff8155c1b7>] e1000_clean+0x267/0x990
[   60.992174]        [<ffffffff81647b60>] net_rx_action+0x150/0x360
[   60.992174]        [<ffffffff8105ec43>] __do_softirq+0x123/0x360
[   60.992174]        [<ffffffff8105f12e>] irq_exit+0x8e/0xb0
[   60.992174]        [<ffffffff8179f9f5>] do_IRQ+0x65/0x110
[   60.992174]        [<ffffffff8179da6f>] ret_from_intr+0x0/0x13
[   60.992174]        [<ffffffff8100de9f>] arch_cpu_idle+0xf/0x20
[   60.992174]        [<ffffffff8109dfa6>] cpu_startup_entry+0x2f6/0x3f0
[   60.992174]        [<ffffffff81033cda>] start_secondary+0x13a/0x150
[   60.992174]
-> #0 (&(&n_ptr->lock)->rlock){+.-...}:
[   60.992174]        [<ffffffff810a8f7d>] __lock_acquire+0x163d/0x1ca0
[   60.992174]        [<ffffffff810a9c0c>] lock_acquire+0x9c/0x140
[   60.992174]        [<ffffffff8179c41f>] _raw_spin_lock_bh+0x3f/0x50
[   60.992174]        [<ffffffffa0006dca>] tipc_link_retransmit+0x1aa/0x240 [tipc]
[   60.992174]        [<ffffffffa0001e11>] tipc_bclink_rcv+0x611/0x640 [tipc]
[   60.992174]        [<ffffffffa0008646>] tipc_rcv+0x616/0xac0 [tipc]
[   60.992174]        [<ffffffffa0002ca3>] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
[   60.992174]        [<ffffffff81646e66>] __netif_receive_skb_core+0x746/0x980
[   60.992174]        [<ffffffff816470c1>] __netif_receive_skb+0x21/0x70
[   60.992174]        [<ffffffff81647295>] netif_receive_skb_internal+0x35/0x130
[   60.992174]        [<ffffffff81648218>] napi_gro_receive+0x158/0x1d0
[   60.992174]        [<ffffffff81559e05>] e1000_clean_rx_irq+0x155/0x490
[   60.992174]        [<ffffffff8155c1b7>] e1000_clean+0x267/0x990
[   60.992174]        [<ffffffff81647b60>] net_rx_action+0x150/0x360
[   60.992174]        [<ffffffff8105ec43>] __do_softirq+0x123/0x360
[   60.992174]        [<ffffffff8105f12e>] irq_exit+0x8e/0xb0
[   60.992174]        [<ffffffff8179f9f5>] do_IRQ+0x65/0x110
[   60.992174]        [<ffffffff8179da6f>] ret_from_intr+0x0/0x13
[   60.992174]        [<ffffffff8100de9f>] arch_cpu_idle+0xf/0x20
[   60.992174]        [<ffffffff8109dfa6>] cpu_startup_entry+0x2f6/0x3f0
[   60.992174]        [<ffffffff81033cda>] start_secondary+0x13a/0x150
[   60.992174]
[   60.992174] other info that might help us debug this:
[   60.992174]
[   60.992174]  Possible unsafe locking scenario:
[   60.992174]
[   60.992174]        CPU0                    CPU1
[   60.992174]        ----                    ----
[   60.992174]   lock(&(&bclink->lock)->rlock);
[   60.992174]                                lock(&(&n_ptr->lock)->rlock);
[   60.992174]                                lock(&(&bclink->lock)->rlock);
[   60.992174]   lock(&(&n_ptr->lock)->rlock);
[   60.992174]
[   60.992174]  *** DEADLOCK ***
[   60.992174]
[   60.992174] 3 locks held by swapper/3/0:
[   60.992174]  #0:  (rcu_read_lock){......}, at: [<ffffffff81646791>] __netif_receive_skb_core+0x71/0x980
[   60.992174]  #1:  (rcu_read_lock){......}, at: [<ffffffffa0002c35>] tipc_l2_rcv_msg+0x5/0xd0 [tipc]
[   60.992174]  #2:  (&(&bclink->lock)->rlock){+.-...}, at: [<ffffffffa00004be>] tipc_bclink_lock+0x8e/0xa0 [tipc]
[   60.992174]

The correct the sequence of grabbing n_ptr->lock and bclink->lock
should be that the former is first held and the latter is then taken,
which exactly happened on CPU1. But especially when the retransmission
of broadcast link is failed, bclink->lock is first held in
tipc_bclink_rcv(), and n_ptr->lock is taken in link_retransmit_failure()
called by tipc_link_retransmit() subsequently, which is demonstrated on
CPU0. As a result, deadlock occurs.

If the order of holding the two locks happening on CPU0 is reversed, the
deadlock risk will be relieved. Therefore, the node lock taken in
link_retransmit_failure() originally is moved to tipc_bclink_rcv()
so that it's obtained before bclink lock. But the precondition of
the adjustment of node lock is that responding to bclink reset event
must be moved from tipc_bclink_unlock() to tipc_node_unlock().

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>