git.karo-electronics.de Git - linux-beck.git/log

Merge branch 'pnfs'

Merge branch 'writeback'

Merge branch 'sunrpc'

SUNRPC: Fix a compiler warning in fs/nfs/clnt.c

Fix the report:

net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Remove redundant smp_mb() from pnfs_init_lseg()

It's not visible yet, and won't be until after we grab the inode->i_lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Cleanup - do layout segment initialisation in one place

...instead of splitting the initialisation over init_lseg() and
pnfs_layout_process().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Remove redundant stateid invalidation

The layout stateid will be invalidated once it holds no more layout
segments anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()

That's already being taken care of in pnfs_layout_remove_lseg().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Clear the layout metadata if the server changed the layout stateid

If the server changed the layout stateid's "other" field, then
we should treat the old layout as being completely gone. In that
case, we want to clear the metadata such as scheduled layoutreturns.

Do this by calling pnfs_mark_layout_stateid_invalid().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()

Ensure nfs42_layoutstat_done() layoutget don't open code layout stateid
invalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id

When determining which layout segments to return, we do want
pnfs_mark_matching_lsegs_return to check that they match the layout
sequence id. This ensures that we don't waste time if the server
is replaying a layout recall that has already been satisfied.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Do not set plh_return_seq for non-callback related layoutreturns

In cases where we need to send a layoutreturn in order to propagate
an error, we should not tie that to a specific layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Ensure layoutreturn acts as a completion for layout callbacks

When we return NFS_OK to the CB_LAYOUTRECALL, we are required to
send a layoutreturn that "completes" that layout recall request, using
the correct stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Fix CB_LAYOUTRECALL stateid verification

We want to evaluate in this order:

If the client holds no layout for this inode, then return
NFS4ERR_NOMATCHING_LAYOUT; it probably forgot the layout.

If the client finds the inode among the list of layouts, but the corresponding
stateid has not yet been initialised, then return NFS4ERR_DELAY to ask the
server to retry once the outstanding LAYOUTGET is complete.

If the current layout stateid's "other" field does not match the recalled
stateid, return NFS4ERR_BAD_STATEID.

If already processing a layout recall with a newer stateid, return
NFS4ERR_OLD_STATEID. This can only happens for servers that are
non-compliant with the NFSv4.1 protocol.

If already processing a layout recall with an older stateid, return
NFS4ERR_DELAY to ask the server to retry once the outstanding
LAYOUTRETURN is complete. Again, this is technically incompliant with
the NFSv4.1 protocol.

If the current layout sequence id is newer than the recalled stateid's
sequence id, return NFS4ERR_OLD_STATEID. This too implies protocol
non-compliance.

If the current layout sequence id is older than the recalled stateid's
sequence id+1, return NFS4ERR_DELAY.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Always update the layout barrier seqid on LAYOUTGET

Currently, pnfs_set_layout_stateid() will update the layout sequence
id barrier only if the stateid itself is newer than the current
layout stateid. However in a situation where multiple LAYOUTGET calls
and a LAYOUTRETURN raced, it is entirely possible for one of the
LAYOUTGET to set the current stateid to something newer than the
LAYOUTRETURN that needs to set the barrier.

The fix is to allow the "update_barrier" flag to force a check as to
whether or not the barrier needs to be updated.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set

If the layout stateid is invalid, then pnfs_set_layout_stateid() must
always initialise it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Clear the layout return tracking on layout reinitialisation

Ensure that we don't carry over layoutreturn info from a previous
incarnation of this layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: LAYOUTRETURN should only update the stateid if the layout is valid

If the layout was completely returned, then ignore the returned layout
stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

Merge commit 'e7bdea7750eb'

Needed in order to work on top of pNFS changes in Linus' upstream kernel.

nfs: don't create zero-length requests

NFS doesn't expect requests with wb_bytes set to zero and may make
unexpected decisions about how to handle that request at the page IO layer.
Skip request creation if we won't have any wb_bytes in the request.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

Fix NULL pointer dereference in bl_free_device().

When bl_parse_deviceid() fails in bl_alloc_deviceid_node() on
blkdev_get_by_*() step we get an pnfs_block_dev struct that is
uninitialized except for bdev field which is set to whatever error
blkdev_get_by_*() returns. bl_free_device() then tries to call
blkdev_put() if bdev is not 0 resulting in a wrong pointer dereference.

Fixing this by setting bdev in struct pnfs_block_dev only if we didn't
get an error from blkdev_get_by_*().

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS/files: filelayout_write_done_cb must call nfs_writeback_update_inode()

All write callbacks are required to call nfs_writeback_update_inode() upon
success to ensure that file size changes are recorded, and the attribute
cache is invalidated.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

sunrpc: Prevent resvport min/max inversion via sysfs and module parameter

The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.

Prevent inversion of min/max values when set through sysfs and
module parameter by setting the limits dependent on each other.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

sunrpc: Prevent resvport min/max inversion via sysctl

The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.

Prevent inversion of min/max values when set through sysctl by
setting the limits dependent on each other.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

sunrpc: Fix reserved port range calculation

The range calculation for choosing the random reserved port will panic
with divide-by-zero when min_resvport == max_resvport, a range of one
port, not zero.

Fix the reserved port range calculation by adding one to the difference.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

sunrpc: Fix bit count when setting hashtable size to power-of-two

Author: Frank Sorenson <sorenson@redhat.com>
Date:   2016-06-27 13:55:48 -0500

    sunrpc: Fix bit count when setting hashtable size to power-of-two

    The hashtable size is incorrectly calculated as the next higher
    power-of-two when being set to a power-of-two.  fls() returns the
    bit number of the most significant set bit, with the least
    significant bit being numbered '1'.  For a power-of-two, fls()
    will return a bit number which is one higher than the number of bits
    required, leading to a hashtable which is twice the requested size.

    In addition, the value of (1 << nbits) will always be at least num,
    so the test will never be true.

    Fix the hash table size calculation to correctly set hashtable
    size, and eliminate the unnecessary check.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs4: flexfiles: respect noresvport when establishing connections to DSes

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs4: clnt: respect noresvport when establishing connections to DSes

result:

$ mount -o vers=4.1 dcache-lab007:/ /pnfs
$ cp /etc/profile /pnfs
tcp        0      0 131.169.185.68:1005     131.169.191.141:32049   ESTABLISHED
tcp        0      0 131.169.185.68:751      131.169.191.144:2049    ESTABLISHED
$

$ mount -o vers=4.1,noresvport dcache-lab007:/ /pnfs
$ cp /etc/profile /pnfs
tcp        0      0 131.169.185.68:34894    131.169.191.141:32049   ESTABLISHED
tcp        0      0 131.169.185.68:35722    131.169.191.144:2049    ESTABLISHED
$

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pnfs/blocklayout: put deviceid node after releasing bl_ext_lock

The last put of deviceid nodes for SCSI layouts may sleep, so we shouldn't
hold any spinlocks. Make sure we put them outside the bl_ext_lock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

sunrpc: move NO_CRKEY_TIMEOUT to the auth->au_flags

A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag.  A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.

This can be reproduced as follows:

1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server.  Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys

2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m

3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set.  Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

5. Now do some I/O to the sec=sys mount.  This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1

6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client.  Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect.  Grabbing a new kerberos ticket at this
point will have no effect either.

Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags.  Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too.  Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

mount: use sec= that was specified on the command line

When older servers return RPC_AUTH_NULL, it means the
rpc creds will be ignored. In that case use the sec=
that was specified instead of setting sec=null

Fixes Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1112983
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Fix LAYOUTGET handling of NFS4ERR_BAD_STATEID and NFS4ERR_EXPIRED

We want to recover the open stateid if there is no layout stateid
and/or the stateid argument matches an open stateid.
Otherwise throw out the existing layout and recover from scratch, as
the layout stateid is bad.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>

pNFS: Handle NFS4ERR_RECALLCONFLICT correctly in LAYOUTGET

Instead of giving up altogether and falling back to doing I/O
through the MDS, which may make the situation worse, wait for
2 lease periods for the callback to resolve itself, and then
try destroying the existing layout.

Only if this was an attempt at getting a first layout, do we
give up altogether, as the server is clearly crazy.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>

pNFS: Separate handling of NFS4ERR_LAYOUTTRYLATER and RECALLCONFLICT

They are not the same error, and need to be handled differently.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>

pNFS: Fix post-layoutget error handling in pnfs_update_layout()

The non-retry error path is currently broken and ends up releasing the
reference to the layout twice. It also can end up clearing the
NFS_LAYOUT_FIRST_LAYOUTGET flag twice, causing a race.

In addition, the retry path will fail to decrement the plh_outstanding
counter.

Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>

pNFS: Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding

We know that the attributes will need updating if there is still a
LAYOUTCOMMIT outstanding.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

SUNRPC: Fix infinite looping in rpc_clnt_iterate_for_each_xprt

If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.

The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.

Reported-by: Oleg Drokin <green@linuxhacker.ru>
Fixes: 80b14d5e61ca ("SUNRPC: Add a structure to track multiple transports")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs/blocklayout: Check max uuids and devices before decoding

Avoid nfs return uuids/devices larger than maximum.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs/blocklayout: Make sure calculate signature length aligned

Avoid a bad nfs server return an unaligned length of signature.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs/blocklayout: support RH/Fedora dm-mpath device nodes

Instead of reusing the wwn-* names for multipath devices nodes RHEL and
Fedora introduce new dm-mpath-uuid-* nodes with a slightly different
naming scheme. Try these names first to ensure we always get a
multipath-capable device if it exists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs/blocklayout: refactor open-by-wwn

The current code works with the standard udev/systemd names, but we'll have
to add another method in the next patch. Refactor it into a separate helper
to make room for the new variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

nfs/blocklayout: use proper fmode for opening block devices

This was fixed for the original block layout code a while ago, but also
needs to be fixed for the SCSI layout path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFSv4: Revert "Truncating file opens should also sync O_DIRECT writes"

We're not holding any locks, so both nfs_wb_all() and inode_dio_wait()
are unenforcible and have livelock potential. Just limit ourselves to
flushing out the data.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS nfs_vm_page_mkwrite: Don't freeze me, Bro...

Prevent filesystem freezes while handling the write page fault.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFSv4.2: llseek(SEEK_HOLE) and llseek(SEEK_DATA) don't require data sync

We want to ensure that we write the cached data to the server, but
don't require it be synced to disk. If the server reboots, we will
get a stateid error, which will cause us to retry anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFSv4.2: Fix writeback races in nfs4_copy_file_range

We need to ensure that any writes to the destination file are serialised
with the copy, meaning that the writeback has to occur under the inode lock.

Also relax the writeback requirement on the source, and rely on the
stateid checking to tell us if the source rebooted. Add the helper
nfs_filemap_write_and_wait_range() to call pnfs_sync_inode() as
is appropriate for pNFS servers that may need a layoutcommit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFSv4.2: Fix a race in nfs42_proc_deallocate()

When punching holes in a file, we want to ensure the operation is
serialised w.r.t. other writes, meaning that we want to call
nfs_sync_inode() while holding the inode lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Getattr doesn't require data sync semantics

When retrieving stat() information, NFS unfortunately does require us to
sync writes to disk in order to ensure that mtime and ctime are up to
date. However we shouldn't have to ensure that those writes are persisted.

Relaxing that requirement does mean that we may see an mtime/ctime change
if the server reboots and forces us to replay all writes.

The exception to this rule are pNFS clients that are required to send
layoutcommit, however that is dealt with by the call to pnfs_sync_inode()
in _nfs_revalidate_inode().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Do not aggressively cache file attributes in the case of O_DIRECT

A file that is open for O_DIRECT is by definition not obeying
close-to-open cache consistency semantics, so let's not cache
the attributes too aggressively either.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Remove unused function nfs_revalidate_mapping_protected()

Clean up...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Remove redundant waits for O_DIRECT in fsync() and write_begin()

We're now waiting immediately after taking the locks, so waiting
in fsync() and write_begin() is either redundant or potentially
subject to livelock (if not holding the lock).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Cleanup nfs_direct_complete()

There is only one caller that sets the "write" argument to true,
so just move the call to nfs_zap_mapping() and get rid of the
now redundant argument.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Do not serialise O_DIRECT reads and writes

Allow dio requests to be scheduled in parallel, but ensuring that they
do not conflict with buffered I/O.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Move buffered I/O locking into nfs_file_write()

Preparation for the patch that de-serialises O_DIRECT reads and
writes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS Cleanup: move call to generic_write_checks() into fs/nfs/direct.c

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Remove racy size manipulations in O_DIRECT

On success, the RPC callbacks will ensure that we make the appropriate calls
to nfs_writeback_update_inode()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Ensure we reset the write verifier 'committed' value on resend.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Fix O_DIRECT verifier problems

We should not be interested in looking at the value of the stable field,
since that could take any value.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: pnfs_layoutcommit_outstanding() is no longer used when !CONFIG_NFS_V4_1

Cleanup...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Ensure we layoutcommit before revalidating attributes

If we need to update the cached attributes, then we'd better make
sure that we also layoutcommit first. Otherwise, the server may have stale
attributes.

Prior to this patch, the revalidation code tried to "fix" this problem by
simply disabling attributes that would be affected by the layoutcommit.
That approach breaks nfs_writeback_check_extend(), leading to a file size
corruption.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS: Files and flexfiles always need to commit before layoutcommit

So ensure that we mark the layout for commit once the write is done,
and then ensure that the commit to ds is finished before sending
layoutcommit.

Note that by doing this, we're able to optimise away the commit
for the case of servers that don't need layoutcommit in order to
return updated attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS/flexfiles: Clean up calls to pnfs_set_layoutcommit()

Let's just have one place where we check ff_layout_need_layoutcommit().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS/flexfiles: Fix layoutcommit after a commit to DS

We should always do a layoutcommit after commit to DS, except if
the layout segment we're using has set FF_FLAGS_NO_LAYOUTCOMMIT.

Fixes: d67ae825a59d ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

pNFS/files: Fix layoutcommit after a commit to DS

According to the errata
https://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751
we should always send layout commit after a commit to DS.

Fixes: bc7d4b8fd091 ("nfs/filelayout: set layoutcommit...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFSv4: Allow retry of operations that used a returned delegation stateid

Fix up nfs4_do_handle_exception() so that it can check if the operation
that received the NFS4ERR_BAD_STATEID was using a defunct delegation.
Apply that to the case of SETATTR, which will currently return EIO
in some cases where this happens.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS/pnfs: Do not clobber existing pgio_done_cb in nfs4_proc_read_setup

If a pNFS client sets hdr->pgio_done_cb, then we should not overwrite that
in nfs4_proc_read_setup()

Fixes: 75bf47ebf6b5 ("pNFS/flexfile: Fix erroneous fall back to...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

NFS: Fix an Oops in the pNFS files and flexfiles connection setup to the DS

Chris Worley reports:
RIP: 0010:[<ffffffffa0245f80>]  [<ffffffffa0245f80>] rpc_new_client+0x2a0/0x2e0 [sunrpc]
RSP: 0018:ffff880158f6f548  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880234f8bc00 RCX: 000000000000ea60
RDX: 0000000000074cc0 RSI: 000000000000ea60 RDI: ffff880234f8bcf0
RBP: ffff880158f6f588 R08: 000000000001ac80 R09: ffff880237003300
R10: ffff880201171000 R11: ffffea0000d75200 R12: ffffffffa03afc60
R13: ffff880230c18800 R14: 0000000000000000 R15: ffff880158f6f680
FS:  00007f0e32673740(0000) GS:ffff88023fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000234886000 CR4: 00000000001406e0
Stack:
  ffffffffa047a680 0000000000000000 ffff880158f6f598 ffff880158f6f680
  ffff880158f6f680 ffff880234d11d00 ffff88023357f800 ffff880158f6f7d0
  ffff880158f6f5b8 ffffffffa024660a ffff880158f6f5b8 ffffffffa02492ec
Call Trace:
  [<ffffffffa024660a>] rpc_create_xprt+0x1a/0xb0 [sunrpc]
  [<ffffffffa02492ec>] ? xprt_create_transport+0x13c/0x240 [sunrpc]
  [<ffffffffa0246766>] rpc_create+0xc6/0x1a0 [sunrpc]
  [<ffffffffa038e695>] nfs_create_rpc_client+0xf5/0x140 [nfs]
  [<ffffffffa038f31a>] nfs_init_client+0x3a/0xd0 [nfs]
  [<ffffffffa038f22f>] nfs_get_client+0x25f/0x310 [nfs]
  [<ffffffffa025cef8>] ? rpc_ntop+0xe8/0x100 [sunrpc]
  [<ffffffffa047512c>] nfs3_set_ds_client+0xcc/0x100 [nfsv3]
  [<ffffffffa041fa10>] nfs4_pnfs_ds_connect+0x120/0x400 [nfsv4]
  [<ffffffffa03d41c7>] nfs4_ff_layout_prepare_ds+0xe7/0x330 [nfs_layout_flexfiles]
  [<ffffffffa03d1b1b>] ff_layout_pg_init_write+0xcb/0x280 [nfs_layout_flexfiles]
  [<ffffffffa03a14dc>] __nfs_pageio_add_request+0x12c/0x490 [nfs]
  [<ffffffffa03a1fa2>] nfs_pageio_add_request+0xc2/0x2a0 [nfs]
  [<ffffffffa03a0365>] ? nfs_pageio_init+0x75/0x120 [nfs]
  [<ffffffffa03a5b50>] nfs_do_writepage+0x120/0x270 [nfs]
  [<ffffffffa03a5d31>] nfs_writepage_locked+0x61/0xc0 [nfs]
  [<ffffffff813d4115>] ? __percpu_counter_add+0x55/0x70
  [<ffffffffa03a6a9f>] nfs_wb_single_page+0xef/0x1c0 [nfs]
  [<ffffffff811ca4a3>] ? __dec_zone_page_state+0x33/0x40
  [<ffffffffa0395b21>] nfs_launder_page+0x41/0x90 [nfs]
  [<ffffffff811baba0>] invalidate_inode_pages2_range+0x340/0x3a0
  [<ffffffff811bac17>] invalidate_inode_pages2+0x17/0x20
  [<ffffffffa039960e>] nfs_release+0x9e/0xb0 [nfs]
  [<ffffffffa0399570>] ? nfs_open+0x60/0x60 [nfs]
  [<ffffffffa0394dad>] nfs_file_release+0x3d/0x60 [nfs]
  [<ffffffff81226e6c>] __fput+0xdc/0x1e0
  [<ffffffff81226fbe>] ____fput+0xe/0x10
  [<ffffffff810bf2e4>] task_work_run+0xc4/0xe0
  [<ffffffff810a4188>] do_exit+0x2e8/0xb30
  [<ffffffff8102471c>] ? do_audit_syscall_entry+0x6c/0x70
  [<ffffffff811464e6>] ? __audit_syscall_exit+0x1e6/0x280
  [<ffffffff810a4a5f>] do_group_exit+0x3f/0xa0
  [<ffffffff810a4ad4>] SyS_exit_group+0x14/0x20
  [<ffffffff8179b76e>] system_call_fastpath+0x12/0x71

Which seems to be due to a call to utsname() when in a task exit context
in order to determine the hostname to set in rpc_new_client().

In reality, what we want here is not the hostname of the current task, but
the hostname that was used to set up the metadata server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
"ARM and x86 fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode.
  KVM: LAPIC: cap __delay at lapic_timer_advance_ns
  KVM: x86: move nsec_to_cycles from x86.c to x86.h
  pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flags
  pvclock: Cleanup to remove function pvclock_get_nsec_offset
  pvclock: Add CPU barriers to get correct version value
  KVM: arm/arm64: Stop leaking vcpu pid references
  arm64: KVM: fix build with CONFIG_ARM_PMU disabled

Merge tag 'arc-4.7-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC fix from Vineet Gupta:
"Reinstate dwarf unwinder/loadable-modules with new gnu tools"

* tag 'arc-4.7-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
arc: unwind: warn only once if DW2_UNWIND is disabled
ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame)

Merge tag 'pwm/for-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm fixes from Thierry Reding:
"One more fix for some fallout observed after the introduction of the
atomic API"

* tag 'pwm/for-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
pwm: Fix pwm_apply_args()

Merge tag 'mfd-fixes-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD fixes from Lee Jones:
"Contained are some standard fixes and unusually an extension to the
  Reset API.  Some of those changes are required to fix a bug introduced
  in -rc1, which introduces extra 'reset line checks' i.e. whether the
  line is shared or not.  If a line is shared and the new *_shared() API
  is not used, the request fails with an error.  This breaks USB in v4.7
  for ST's platforms.

  Admittedly, there are some patches contained in our (MFD/Reset)
  immutable branch which are not true -fixes, but there isn't anything I
  can do about that.  Rest assured though, there aren't any API
  'changes'.  Everything is the same from the consumer's perspective.

   - Use new reset_*_get_shared() variant to prevent reset line
     obtainment failure (Fixes commit 0b52297f2288: "reset: Add support
     for shared reset controls")

   - Fix unintentional switch() fall-through into error path

   - Fix uninitialised variable compiler warning"

* tag 'mfd-fixes-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  mfd: da9053: Fix compiler warning message for uninitialised variable
  mfd: max77620: Fix FPS switch statements
  phy: phy-stih407-usb: Inform the reset framework that our reset line may be shared
  usb: dwc3: st: Inform the reset framework that our reset line may be shared
  usb: host: ehci-st: Inform the reset framework that our reset line may be shared
  usb: host: ohci-st: Inform the reset framework that our reset line may be shared
  reset: TRIVIAL: Add line break at same place for similar APIs
  reset: Supply *_shared variant calls when using *_optional APIs
  reset: Supply *_shared variant calls when using of_* API
  reset: Ensure drivers are explicit when requesting reset lines
  reset: Reorder inline reset_control_get*() wrappers

Merge tag 'kvm-arm-for-v4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/ARM Fixes for v4.7-rc6:

Fixes a build issue without CONFIG_ARM_PMU and plugs pid leak on arm/arm64.

mfd: da9053: Fix compiler warning message for uninitialised variable

Fix compiler warning caused by an uninitialised variable inside
da9052_group_write() function. Defaulting the value to zero covers
the trivial case.

Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

mfd: max77620: Fix FPS switch statements

When configuring FPS during probe, assuming a DT node is present for
FPS, the code can run into a problem with the switch statements in
max77620_config_fps() and max77620_get_fps_period_reg_value(). Namely,
in the case of chip->chip_id == MAX77620, it will set
fps_[mix|max]_period but then fall through to the default switch case
and return -EINVAL. Returning this from max77620_config_fps() will
cause probe to fail.

Signed-off-by: Rhyland Klein <rklein@nvidia.com>
Reviewed-by: Laxman Dewangan <ldewangan@nvidia.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

phy: phy-stih407-usb: Inform the reset framework that our reset line may be shared

On the STiH410 B2120 development board the ports on the Generic PHY
share their reset lines with each other. New functionality in the
reset subsystems forces consumers to be explicit when requesting
shared/exclusive reset lines.

Signed-off-by: Lee Jones <lee.jones@linaro.org>

usb: dwc3: st: Inform the reset framework that our reset line may be shared

On the STiH410 B2120 development board the MiPHY28lp shares its reset
line with the Synopsys DWC3 SuperSpeed (SS) USB 3.0 Dual-Role-Device
(DRD). New functionality in the reset subsystems forces consumers to
be explicit when requesting shared/exclusive reset lines.

Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

usb: host: ehci-st: Inform the reset framework that our reset line may be shared

On the STiH410 B2120 development board the ST EHCI IP shares its reset
line with the OHCI IP. New functionality in the reset subsystems forces
consumers to be explicit when requesting shared/exclusive reset lines.

Acked-by: Peter Griffin <peter.griffin@linaro.org>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

usb: host: ohci-st: Inform the reset framework that our reset line may be shared

On the STiH410 B2120 development board the ST EHCI IP shares its reset
line with the OHCI IP. New functionality in the reset subsystems forces
consumers to be explicit when requesting shared/exclusive reset lines.

Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Lee Jones <lee.jones@linaro.org>

Merge tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
"Stable bugfixes:
   - Fix _cancel_empty_pagelist
   - Fix a double page unlock
   - Make nfs_atomic_open() call d_drop() on all ->open_context() errors.
   - Fix another OPEN_DOWNGRADE bug

  Other bugfixes:
   - Ensure we handle delegation errors in nfs4_proc_layoutget()
   - Layout stateids start out as being invalid
   - Add sparse lock annotations for pnfs_find_alloc_layout
   - Handle bad delegation stateids in nfs4_layoutget_handle_exception
   - Fix up O_DIRECT results
   - Fix potential use after free of state in nfs4_do_reclaim.
   - Mark the layout stateid invalid when all segments are removed
   - Don't let readdirplus revalidate an inode that was marked as stale
   - Fix potential race in nfs_fhget()
   - Fix an unused variable warning"

* tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix another OPEN_DOWNGRADE bug
  make nfs_atomic_open() call d_drop() on all ->open_context() errors.
  NFS: Fix an unused variable warning
  NFS: Fix potential race in nfs_fhget()
  NFS: Don't let readdirplus revalidate an inode that was marked as stale
  NFSv4.1/pnfs: Mark the layout stateid invalid when all segments are removed
  NFS: Fix a double page unlock
  pnfs_nfs: fix _cancel_empty_pagelist
  nfs4: Fix potential use after free of state in nfs4_do_reclaim.
  NFS: Fix up O_DIRECT results
  NFS/pnfs: handle bad delegation stateids in nfs4_layoutget_handle_exception
  NFSv4.1/pnfs: Add sparse lock annotations for pnfs_find_alloc_layout
  NFSv4.1/pnfs: Layout stateids start out as being invalid
  NFSv4.1/pnfs: Ensure we handle delegation errors in nfs4_proc_layoutget()

Merge branch 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit

Pull audit fixes from Paul Moore:
"Two small patches to fix audit problems in 4.7-rcX: the first fixes a
  potential kref leak, the second removes some header file noise.

  The first is an important bug fix that really should go in before 4.7
  is released, the second is not critical, but falls into the very-nice-
  to-have category so I'm including in the pull request.

  Both patches are straightforward, self-contained, and pass our
  testsuite without problem"

* 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
  audit: move audit_get_tty to reduce scope and kabi changes
  audit: move calcs after alloc and check when logging set loginuid

reset: TRIVIAL: Add line break at same place for similar APIs

Standardise the way inline functions:

devm_reset_control_get_shared_by_index
devm_reset_control_get_exclusive_by_index

... are formatted.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>

reset: Supply *_shared variant calls when using *_optional APIs

Consumers need to be able to specify whether they are requesting an
'exclusive' or 'shared' reset line no matter which API (of_*, devm_*,
etc) they are using. This change allows users of the optional_* API
in particular to specify that their request is for a 'shared' line.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>

reset: Supply *_shared variant calls when using of_* API

Consumers need to be able to specify whether they are requesting an
'exclusive' or 'shared' reset line no matter which API (of_*, devm_*,
etc) they are using. This change allows users of the of_* API in
particular to specify that their request is for a 'shared' line.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>

reset: Ensure drivers are explicit when requesting reset lines

Phasing out generic reset line requests enables us to make some better
decisions on when and how to (de)assert said lines.  If an 'exclusive'
line is requested, we know a device *requires* a reset and that it's
preferable to act upon a request right away.  However, if a 'shared'
reset line is requested, we can reasonably assume sure that placing a
device into reset isn't a hard requirement, but probably a measure to
save power and is thus able to cope with not being asserted if another
device is still in use.

In order allow gentle adoption and not to forcing all consumers to
move to the API immediately, causing administration headache between
subsystems, this patch adds some temporary stand-in shim-calls.  This
will ease the burden at merge time and allow subsystems to migrate over
to the new API in a more realistic time-frame.

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>

reset: Reorder inline reset_control_get*() wrappers

We're about to split the current API into two, where consumers will
be forced to be explicit when requesting reset lines.  The choice
will be to either the call the *_exclusive or *_shared variant
depending on whether they can actually tolorate not being asserted
when that request is made.

The new API will look like this once reorded and complete:

  reset_control_get_exclusive()
  reset_control_get_shared()
  reset_control_get_optional_exclusive()
  reset_control_get_optional_shared()
  of_reset_control_get_exclusive()
  of_reset_control_get_shared()
  of_reset_control_get_exclusive_by_index()
  of_reset_control_get_shared_by_index()
  devm_reset_control_get_exclusive()
  devm_reset_control_get_shared()
  devm_reset_control_get_optional_exclusive()
  devm_reset_control_get_optional_shared()
  devm_reset_control_get_exclusive_by_index()
  devm_reset_control_get_shared_by_index()

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
  fixing.  It perhaps looks a little worse than it really is.

   1) Fix deadlock in ath10k driver, from Ben Greear.

   2) Increase scan timeout in iwlwifi, from Luca Coelho.

   3) Unbreak STP by properly reinjecting STP packets back into the
      stack.  Regression fix from Ido Schimmel.

   4) Mediatek driver fixes (missing malloc failure checks, leaking of
      scratch memory, wrong indexing when mapping TX buffers, etc.) from
      John Crispin.

   5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
      Sowa.

   6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.

   7) Fix netlink notifications in ovs for tunnels, delete link messages
      are never emitted because of how the device registry state is
      handled.  From Nicolas Dichtel.

   8) Conntrack module leaks kmemcache on unload, from Florian Westphal.

   9) Prevent endless jump loops in nft rules, from Liping Zhang and
      Pablo Neira Ayuso.

  10) Not early enough spinlock initialization in mlx4, from Eric
      Dumazet.

  11) Bind refcount leak in act_ipt, from Cong WANG.

  12) Missing RCU locking in HTB scheduler, from Florian Westphal.

  13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
      barrier, using heap for SG and IV, and erroneous use of async flag
      when allocating AEAD conext.)

  14) RCU handling fix in TIPC, from Ying Xue.

  15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
      SIT driver, from Simon Horman.

  16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.

  17) Fix potential deadlock in team enslave, from Ido Schimmel.

  18) Memory leak in KCM procfs handling, from Jiri Slaby.

  19) ESN generation fix in ipv4 ESP, from Herbert Xu.

  20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
      WANG.

  21) Use after free in netem, from Eric Dumazet.

  22) Uninitialized last assert time in multicast router code, from Tom
      Goff.

  23) Skip raw sockets in sock_diag destruction broadcast, from Willem
      de Bruijn.

  24) Fix link status reporting in thunderx, from Sunil Goutham.

  25) Limit resegmentation of retransmit queue so that we do not
      retransmit too large GSO frames.  From Eric Dumazet.

  26) Delay bpf program release after grace period, from Daniel
      Borkmann"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
  openvswitch: fix conntrack netlink event delivery
  qed: Protect the doorbell BAR with the write barriers.
  neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
  e1000e: keep VLAN interfaces functional after rxvlan off
  cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
  qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
  bpf, perf: delay release of BPF prog after grace period
  net: bridge: fix vlan stats continue counter
  tcp: do not send too big packets at retransmit time
  ibmvnic: fix to use list_for_each_safe() when delete items
  net: thunderx: Fix TL4 configuration for secondary Qsets
  net: thunderx: Fix link status reporting
  net/mlx5e: Reorganize ethtool statistics
  net/mlx5e: Fix number of PFC counters reported to ethtool
  net/mlx5e: Prevent adding the same vxlan port
  net/mlx5e: Check for BlueFlame capability before allocating SQ uar
  net/mlx5e: Change enum to better reflect usage
  net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
  net/mlx5: Update command strings
  net: marvell: Add separate config ANEG function for Marvell 88E1111
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Martin Schwidefsky:
"Another two bug fixes for 4.7:

   - The revert of patch which removed boot information for systems
     using an intermediate boot kernel, e.g. the SLES12 grub setup.

   - A fix for an incorrect inline assembly constraint that causes
     broken code to be generated with gcc 4.8.5"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390: fix test_fp_ctl inline assembly contraints
  Revert "s390/kdump: Clear subchannel ID to signal non-CCW/SCSI IPL"

Merge tag 'pinctrl-v4.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
"Here are a bunch of fixes for pin control.  Just drivers and a
  MAINTAINERS fixup:

   - Driver fixes for i.MX, single register, Tegra and BayTrail.

   - MAINTAINERS entry for the documentation"

* tag 'pinctrl-v4.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: baytrail: Fix mingled clock pins
  MAINTAINERS: belong Documentation/pinctrl.txt properly
  pinctrl: tegra: Fix build dependency
  gpio: tegra: Make lockdep class file-scoped
  pinctrl: single: Fix missing flush of posted write for a wakeirq
  pinctrl: imx: Do not treat a PIN without MUX register as an error

Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
"Three fix patches.  Two are for cgroup / css init failure path.  The
  last one makes css_set_lock irq-safe as the deadline scheduler ends up
  calling put_css_set() from irq context"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Disable IRQs while holding css_set_lock
  cgroup: set css->id to -1 during init
  cgroup: remove redundant cleanup in css_create

Merge tag 'mac80211-for-davem-2016-06-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two small fixes
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: fix conntrack netlink event delivery

Only the first and last netlink message for a particular conntrack are
actually sent. The first message is sent through nf_conntrack_confirm when
the conntrack is committed. The last one is sent when the conntrack is
destroyed on timeout. The other conntrack state change messages are not
advertised.

When the conntrack subsystem is used from netfilter, nf_conntrack_confirm
is called for each packet, from the postrouting hook, which in turn calls
nf_ct_deliver_cached_events to send the state change netlink messages.

This commit fixes the problem by calling nf_ct_deliver_cached_events in the
non-commit case as well.

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
CC: Joe Stringer <joestringer@nicira.com>
CC: Justin Pettit <jpettit@nicira.com>
CC: Andy Zhou <azhou@nicira.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Protect the doorbell BAR with the write barriers.

SPQ doorbell is currently protected with the compilation barrier. Under the
stress scenarios, we may get into a state where (due to the weak ordering)
several ramrod doorbells were written to the BAR with an out-of-order
producer values. Need to change the barrier type to a write barrier to make
sure that the write buffer is flushed after each doorbell.

Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()

neigh_xmit() expects to be called inside an RCU-bh read side critical
section, and while one of its two current callers gets this right, the
other one doesn't.

More specifically, neigh_xmit() has two callers, mpls_forward() and
mpls_output(), and while both callers call neigh_xmit() under
rcu_read_lock(), this provides sufficient protection for neigh_xmit()
only in the case of mpls_forward(), as that is always called from
softirq context and therefore doesn't need explicit BH protection,
while mpls_output() can be called from process context with softirqs
enabled.

When mpls_output() is called from process context, with softirqs
enabled, we can be preempted by a softirq at any time, and RCU-bh
considers the completion of a softirq as signaling the end of any
pending read-side critical sections, so if we do get a softirq
while we are in the part of neigh_xmit() that expects to be run inside
an RCU-bh read side critical section, we can end up with an unexpected
RCU grace period running right in the middle of that critical section,
making things go boom.

This patch fixes this impedance mismatch in the callee, by making
neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that
expects to be treated as an RCU-bh read side critical section, as this
seems a safer option than fixing it in the callers.

Fixes: 4fd3d7d9e868f ("neigh: Add helper function neigh_xmit")
Signed-off-by: David Barroso <dbarroso@fastly.com>
Signed-off-by: Lennert Buytenhek <lbuytenhek@fastly.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1000e: keep VLAN interfaces functional after rxvlan off

I've got a bug report about an e1000e interface, where a VLAN interface is
set up on top of it:

$ ip link add link ens1f0 name ens1f0.99 type vlan id 99
$ ip link set ens1f0 up
$ ip link set ens1f0.99 up
$ ip addr add 192.168.99.92 dev ens1f0.99

At this point, I can ping another host on vlan 99, ip 192.168.99.91.
However, if I do the following:

$ ethtool -K ens1f0 rxvlan off

Then no traffic passes on ens1f0.99. It comes back if I toggle rxvlan on
again. I'm not sure if this is actually intended behavior, or if there's a
lack of software VLAN stripping fallback, or what, but things continue to
work if I simply don't call e1000e_vlan_strip_disable() if there are
active VLANs (plagiarizing a function from the e1000 driver here) on the
interface.

Also slipped a related-ish fix to the kerneldoc text for
e1000e_vlan_strip_disable here...

Signed-off-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header

The PDU length of incoming LLC frames is set to the total skb payload size
in __ieee80211_data_to_8023() of net/wireless/util.c which incorrectly
includes the length of the IEEE 802.11 header.

The resulting LLC frame header has a too large PDU length, causing the
llc_fixup_skb() function of net/llc/llc_input.c to reject the incoming
skb, effectively breaking STP.

Solve the problem by properly substracting the IEEE 802.11 frame header size
from the PDU length, allowing the LLC processor to pick up the incoming
control messages.

Special thanks to Gerry Rozema for tracking down the regression and proposing
a suitable patch.

Fixes: 2d1c304cb2d5 ("cfg80211: add function for 802.3 conversion with separate output buffer")
Cc: stable@vger.kernel.org
Reported-by: Gerry Rozema <gerryr@rozeware.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()

There is a static checker warning here "warn: mask and shift to zero"
and the code sets "ring" to zero every time. From looking at how
QLCNIC_FETCH_RING_ID() is used in qlcnic_83xx_process_rcv_ring() the
qlcnic_83xx_hndl() should be removed.

Fixes: 4be41e92f7c6 ('qlcnic: 83xx data path routines')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, perf: delay release of BPF prog after grace period

Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).

Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: fix vlan stats continue counter

I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.

Fixes: a60c090361ea ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: do not send too big packets at retransmit time

Arjun reported a bug in TCP stack and bisected it to a recent commit.

In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.

However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.

Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.

Thanks to Arjun for reporting the bug and running the tests !

Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: fix to use list_for_each_safe() when delete items

Since we will remove items off the list using list_del() we need
to use a safe version of the list_for_each() macro aptly named
list_for_each_safe().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>