git.karo-electronics.de Git - karo-tx-linux.git/log

ocfs2: use 64bit variables to track heartbeat time

o2hb_elapsed_msecs computes the time taken for a disk heartbeat.  'struct
timeval' variables are used to store start and end times.  On 32-bit
systems, the 'tv_sec' component of 'struct timeval' will overflow in year
2038 and beyond.

This patch solves the overflow with the following:

1. Replace o2hb_elapsed_msecs using 'ktime_t' values to measure start
   and end time, and built-in function 'ktime_ms_delta' to compute the
   elapsed time.  ktime_get_real() is used since the code prints out the
   wallclock time.

2. Changes format string to print time as a single 64-bit nanoseconds
   value ("%lld") instead of seconds and microseconds.  This simplifies
   the code since converting ktime_t to that format would need expensive
   computation.  However, the debug log string is less readable than the
   previous format.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Suggested by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix a tiny case that inode can not removed.

When running dirop_fileop_racer we found a case that inode
can not removed.

2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create
two dirs /race/1/ and /race/2/ in the filesystem.

Node A                            Node B
rm -r /race/2/
                                  mv /race/1/ /race/2/
call ocfs2_unlink(), get
the EX mode of /race/2/
                                  wait for B unlock /race/2/
decrease i_nlink of /race/2/ to 0,
and add inode of /race/2/ into
orphan dir, unlock /race/2/
                                  got EX mode of /race/2/. because
                                  /race/1/ is dir, so inc i_nlink
                                  of /race/2/ and update into disk,
                                  unlock /race/2/
because i_nlink of /race/2/
is not zero, this inode will
always remain in orphan dir

This patch fixes this case by test whether i_nlink of new dir is zero.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: clear the rest of the buffers on error

In case a validation fails, clear the rest of the buffers and return the
error to the calling function.

This also facilitates bubbling up the error originating from ocfs2_error
to calling functions.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: acknowledge return value of ocfs2_error()

Caveat: This may return -EROFS for a read case, which seems wrong. This
is happening even without this patch series though. Should we convert
EROFS to EIO?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: add errors=continue

OCFS2 is often used in high-availaibility systems.  However, ocfs2
converts the filesystem to read-only at the drop of the hat.  This may not
be necessary, since turning the filesystem read-only would affect other
running processes as well, decreasing availability.

This attempt is to add errors=continue, which would return the EIO to the
calling process and terminate furhter processing so that the filesystem is
not corrupted further.  However, the filesystem is not converted to
read-only.

As a future plan, I intend to create a small utility or extend fsck.ocfs2
to fix small errors such as in the inode.  The input to the utility such
as the inode can come from the kernel logs so we don't have to schedule a
downtime for fixing small-enough errors.

The patch changes the ocfs2_error to return an error.  The error returned
depends on the mount option set.  If none is set, the default is to turn
the filesystem read-only.

Perhaps errors=continue is not the best option name.  Historically it is
used for making an attempt to progress in the current process itself.
Should we call it errors=eio?  or errors=killproc?  Suggestions/Comments
welcome.

Sources are available at:
https://github.com/goldwynr/linux/tree/error-cont

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: flush inode data to disk and free inode when i_count becomes zero

Disk inode deletion may be heavily delayed when one node unlink a file
after the same dentry is freed on another node(say N1) because of memory
shrink but inode is left in memory.  This inode can only be freed while N1
doing the orphan scan work.

However, N1 may skip orphan scan for several times because other nodes may
do the work earlier.  In our tests, it may take 1 hour on 4 nodes cluster
and it hurts the user experience.  So we think the inode should be freed
after the data flushed to disk when i_count becomes zero to avoid such
circumstances.

Signed-off-by: Joyce.xue <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: trusted xattr missing CAP_SYS_ADMIN check

The trusted extended attributes are only visible to the process which hvae
CAP_SYS_ADMIN capability but the check is missing in ocfs2 xattr_handler
trusted list. The check is important because this will be used for
implementing mechanisms in the userspace for which other ordinary
processes should not have access to.

Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Taesoo kim <taesoo@gatech.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: double evaluation concerns in mlog_errno()

It won't happen in real life, but for consistency etc then we should
only evaluate the "st" parameter once.

Also, since not all callers use the new return, it causes at static
checker warning:
fs/ocfs2/export.c:138 ocfs2_get_dentry() warn: unchecked 'PTR_ERR'

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: make mlog_errno return the errno

ocfs2 does

mlog_errno(v);
return v;

in many places.  Change mlog_errno() so we can do

return mlog_errno(v);

For some weird reason this patch reduces the size of ocfs2 by 6k:

akpm3:/usr/src/25> size fs/ocfs2/ocfs2.ko
   text    data     bss     dec     hex filename
1146613   82767  832192 2061572  1f7504 fs/ocfs2/ocfs2.ko-before
1140857   82767  832192 2055816  1f5e88 fs/ocfs2/ocfs2.ko-after

Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: check if the ocfs2 lock resource has been initialized before calling ocfs2_dlm_lock

If ocfs2 lockres has not been initialized before calling ocfs2_dlm_lock,
the lock won't be dropped and then will lead umount hung.  The case is
described below:

ocfs2_mknod
    ocfs2_mknod_locked
        __ocfs2_mknod_locked
            ocfs2_journal_access_di
            Failed because of -ENOMEM or other reasons, the inode lockres
            has not been initialized yet.

    iput(inode)
        ocfs2_evict_inode
            ocfs2_delete_inode
                ocfs2_inode_lock
                    ocfs2_inode_lock_full_nested
                        __ocfs2_cluster_lock
                        Succeeds and allocates a new dlm lockres.
            ocfs2_clear_inode
                ocfs2_open_unlock
                    ocfs2_drop_inode_locks
                        ocfs2_drop_lock
                        Since lockres has not been initialized, the lock
                        can't be dropped and the lockres can't be
                        migrated, thus umount will hang forever.

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: logging: remove static buffer, use vsprintf extension %pV

Use the vsprintf %pV extension to avoid using a static buffer and remove
the now unnecessary buffer.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: incorrect check for debugfs returns

debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
is not configured, so the return value should be checked against
ERROR_VALUE as well, otherwise the later dereference of the dentry pointer
would crash the kernel.

This patch tries to solve this problem by fixing certain checks. However,
I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS.
In current implementation, if CONFIG_DEBUG_FS is defined, then the above
two functions will never return any ERROR_VALUE. So another possibility
to fix this is to surround all the buggy checks/functions with the same
#ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality,
as only OCFS2_FS_STATS declares dependency on DEBUG_FS.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix a typo in the copyright statement

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Reviewed-by: Eric Ren <zren@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix possible uninitialized variable access

In ocfs2_local_alloc_find_clear_bits and ocfs2_get_dentry, variable
numfound and set may be uninitialized and then used in tracepoint. In
ocfs2_xattr_block_get and ocfs2_delete_xattr_in_bucket, variable block_off
and xv may be uninitialized and then used in the following logic due to
unchecked return value.

This patch fixes these possible issues.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: remove goto statement in ocfs2_check_dir_for_entry()

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: rollback the cleared bits if error occurs after ocfs2_block_group_clear_bits

ocfs2_block_group_clear_bits will clear bits in block group bitmap.
Once it succeeds but fails in the following step, it will cause block
group bitmap mismatch the corresponding count recorded in dinode.
So rollback the cleared bits if error occurs.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use ENOENT instead of EEXIST when get system file fails

When ocfs2_get_system_file_inode fails, it is obscure to set the return
value to -EEXIST. So change it to -ENOENT.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use actual name length when find entry in ocfs2_orphan_del()

If the namelen is 20 and name only has actual length 16, it will fail in
ocfs2_find_entry because of mismatch. So use actual name length when find
entry.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: dereferencing freed pointers in ocfs2_reflink()

The code at the "out" label assumes that "default_acl" and "acl" are NULL,
but actually the pointers can be NULL, unitialized, or freed.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix typo in ocfs2_reserve_local_alloc_bits

In ocfs2_reserve_local_alloc_bits, it calls ocfs2_error if local alloc
inode bitmap used bits mismatch, but the log mistakes it as free bits.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: do not use ocfs2_zero_extend during direct IO

In ocfs2_direct_IO_write, we use ocfs2_zero_extend to zero allocated
clusters in case of cluster not aligned. But ocfs2_zero_extend uses page
cache, this may happen that it clears the data which blockdev_direct_IO
has already written.

We should use blkdev_issue_zeroout instead of ocfs2_zero_extend during
direct IO.

So fix this issue by introducing ocfs2_direct_IO_zero_extend and
ocfs2_direct_IO_extend_no_holes.

Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Tested-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: take inode lock when get clusters

We need take inode lock when calling ocfs2_get_clusters.
And use GFP_NOFS instead of GFP_KERNEL.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: no need get dinode bh when zeroing extend

Since di_bh won't be used when zeroing extend, set it to NULL.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix a typing error in ocfs2_direct_IO_write

Only when direct IO succeeds we need consider zeroing out in case of
cluster not aligned.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: avoid a pointless delay in o2cb_cluster_check()

Fix an off-by-one when attempting to avoid an msleep() on the final loop
iteration.

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: one function call less in user_cluster_connect() after error detection

kfree() was called by user_cluster_connect() even if a previous call of
the kzalloc() function failed.

Return from this implementation directly after failure detection.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: one function call less in ocfs2_init_slot_info() after error detection

__ocfs2_free_slot_info() was called by ocfs2_init_slot_info() even if a
call of the kzalloc() function failed.

Return from this implementation directly after corresponding
exception handling.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: one function call less in ocfs2_merge_rec_right() after error detection

ocfs2_free_path() was called by ocfs2_merge_rec_right() even if a call of
the ocfs2_get_right_path() function failed.

Return from this implementation directly after corresponding
exception handling.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: one function call less in ocfs2_merge_rec_left() after error detection

ocfs2_free_path() was called by ocfs2_merge_rec_left() even if a call of
the ocfs2_get_left_path() function failed.

Return from this implementation directly after corresponding
exception handling.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: less function calls in ocfs2_figure_merge_contig_type() after error detection

ocfs2_free_path() was called in some cases by
ocfs2_figure_merge_contig_type() during error handling even if the passed
variables "left_path" and "right_path" contained still a null pointer.

Corresponding implementation details could be improved by adjustments for
jump labels according to the current Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: less function calls in ocfs2_convert_inline_data_to_extents() after error detection

kfree() was called in a few cases by
ocfs2_convert_inline_data_to_extents() during error handling even if the
passed variable "pages" contained a null pointer.

* Return from this implementation directly after failure detection for
the function call "kcalloc".

* Corresponding details could be improved by the introduction of another
jump label.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: delete unnecessary checks before three function calls

kfree(), ocfs2_free_path() and __ocfs2_free_slot_info() test whether their
argument is NULL and then return immediately. Thus the test around their
calls is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL

This basically reverts 47def82672b3 ("jbd2: Remove __GFP_NOFAIL from jbd2
layer").  The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than removing
the dependency on the non failing allocation.  So the deprecation was a
clear failure and the reality tells us that __GFP_NOFAIL is not even close
to go away.

It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements.  Allocator can take steps to help
making a progress if it knows the requirements.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vipul Pandya <vipul@chelsio.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ext4/fsync.c: generic_file_fsync call based on barrier flag

generic_file_fsync has been updated to issue a flush for older
filesystems.

This patch tests for barrier flag in ext4 mount flags and calls the right
function.

Lukas said:

: Note that the actual generic_file_fsync change fixes a real bug in ext4
: where we would _not_ send a flush on sync if we have file system
: without journal.
:
: Ted, it would be useful to mention that in the commit description
: along with the commit id:
:
: ac13a829f6adb674015ab399594c089990104af7 fs/libfs.c: add generic
: data flush to fsync

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arch/sh/kernel/dwarf.c: use mempool_create_slab_pool()

Mempools created for slab caches should use mempool_create_slab_pool().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arch/sh/kernel/dwarf.c: destroy mempools on cleanup

dwarf_reg_pool and dwarf_frame_pool are not properly destroyed when
cleaning up the dwarf unwinder. Destroy them with mempool_destroy().

Also mark dwarf_unwinder_cleanup() as __init.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mmotm: cxgb4-drop-__gfp_nofail-allocation-fix

Use kfree_skb instead of kfree because the allocation is done by
alloc_skb.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Hariprasad S <hariprasad@chelsio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cxgb4: drop __GFP_NOFAIL allocation

set_filter_wr is requesting __GFP_NOFAIL allocation although it can return
ENOMEM without any problems obviously (t4_l2t_set_switching does that
already).  So the non-failing requirement is too strong without any
obvious reason.  Drop __GFP_NOFAIL and reorganize the code to have the
failure paths easier.

The same applies to _c4iw_write_mem_dma_aligned which uses __GFP_NOFAIL
and then checks the return value and returns -ENOMEM on failure.  This
doesn't make any sense what so ever.  Either the allocation cannot fail or
it can.

del_filter_wr seems to be safe as well because the filter entry is not
marked as pending and the return value is propagated up the stack up to
c4iw_destroy_listen.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ptrace/x86: fix the TIF_FORCED_TF logic in handle_signal()

When the TIF_SINGLESTEP tracee dequeues a signal, handle_signal() clears
TIF_FORCED_TF and X86_EFLAGS_TF but leaves TIF_SINGLESTEP set.

If the tracer does PTRACE_SINGLESTEP again, enable_single_step() sets
X86_EFLAGS_TF but not TIF_FORCED_TF. This means that the subsequent
PTRACE_CONT doesn't not clear X86_EFLAGS_TF, and the tracee gets the wrong
SIGTRAP.

Test-case (needs -O2 to avoid prologue insns in signal handler):

#include <unistd.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <assert.h>
#include <stddef.h>

void handler(int n)
{
asm("nop");
}

int child(void)
{
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
signal(SIGALRM, handler);
kill(getpid(), SIGALRM);
return 0x23;
}

void *getip(int pid)
{
return (void*)ptrace(PTRACE_PEEKUSER, pid,
offsetof(struct user, regs.rip), 0);
}

int main(void)
{
int pid, status;

pid = fork();
if (!pid)
return child();

assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGALRM);

assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 0);

assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 1);

assert(ptrace(PTRACE_CONT, pid, 0,0) == 0);
assert(wait(&status) == pid);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0x23);

return 0;
}

The last assert() fails because PTRACE_CONT wrongly triggers another
single-step and X86_EFLAGS_TF can't be cleared by debugger until the
tracee does sys_rt_sigreturn().

Change handle_signal() to do user_disable_single_step() if stepping, we do
not need to preserve TIF_SINGLESTEP because we are going to do
ptrace_notify(), and it is simply wrong to leak this bit.

While at it, change the comment to explain why we also need to clear TF
unconditionally after setup_rt_frame().

Note: in the longer term we should probably change setup_sigcontext() to
use get_flags() and then just remove this user_disable_single_step().
And, the state of TIF_FORCED_TF can be wrong after restore_sigcontext()
which can set/clear TF, this needs another fix.

Reported-by: Evan Teran <eteran@alum.rit.edu>
Reported-by: Pedro Alves <palves@redhat.com>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: numa: disable change protection for vma(VM_HUGETLB)

Currently when a process accesses a hugetlb range protected with PROTNONE,
unexpected COWs are triggered, which finally puts the hugetlb subsystem
into a broken/uncontrollable state, where for example h->resv_huge_pages
is subtracted too much and wraps around to a very large number, and the
free hugepage pool is no longer maintainable.

This patch simply stops changing protection for vma(VM_HUGETLB) to fix the
problem. And this also allows us to avoid useless overhead of minor
faults.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

include/linux/dmapool.h: declare struct device

dmapool uses struct device in function arguments but relies on an implicit
inclusion to declare struct device causing warnings in some
configurations:

include/linux/dmapool.h:31:7: warning: 'struct device' declared inside parameter list

Fix this by adding a struct device declaration to the file.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: move zone lock to a different cache line than order-0 free page lists

Huang Ying reported the following problem due to commit 3484b2de9499 ("mm:
rearrange zone fields into read-only, page alloc, statistics and page
reclaim lines") from the Intel performance tests

    24b7e5819ad5cbef  3484b2de9499df23c4604a513b
    ----------------  --------------------------
             %stddev     %change         %stddev
                 \          |                \
        152288 \261  0%     -46.2%      81911 \261  0%  aim7.jobs-per-min
           237 \261  0%     +85.6%        440 \261  0%  aim7.time.elapsed_time
           237 \261  0%     +85.6%        440 \261  0%  aim7.time.elapsed_time.max
         25026 \261  0%     +70.7%      42712 \261  0%  aim7.time.system_time
       2186645 \261  5%     +32.0%    2885949 \261  4%  aim7.time.voluntary_context_switches
       4576561 \261  1%     +24.9%    5715773 \261  0%  aim7.time.involuntary_context_switches

The problem is specific to very large machines under stress.  It was not
reproducible with the machines I had used to justify the original patch
because large numbers of CPUs are required.  When pressure is high enough,
the cache line is bouncing between CPUs trying to acquire the lock and the
holder of the lock adjusting free lists.  The intention was that the
acquirer of the lock would automatically have the cache line holding the
free lists but according to Huang, this is not a universal win.

One possibility is to move the zone lock to its own cache line but it
increases the size of the zone.  This patch moves the lock to the other
end of the free lists where they do not contend under high pressure.  It
does mean the page allocator paths now require more cache lines but Huang
reports that it restores performance to previous levels on large machines

             %stddev     %change         %stddev
                 \          |                \
         84568 \261  1%     +94.3%     164280 \261  1%  aim7.jobs-per-min
       2881944 \261  2%     -35.1%    1870386 \261  8%  aim7.time.voluntary_context_switches
           681 \261  1%      -3.4%        658 \261  0%  aim7.time.user_time
       5538139 \261  0%     -12.1%    4867884 \261  0%  aim7.time.involuntary_context_switches
         44174 \261  1%     -46.0%      23848 \261  1%  aim7.time.system_time
           426 \261  1%     -48.4%        219 \261  1%  aim7.time.elapsed_time
           426 \261  1%     -48.4%        219 \261  1%  aim7.time.elapsed_time.max
           468 \261  1%     -43.1%        266 \261  2%  uptime.boot

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Huang Ying <ying.huang@intel.com>
Tested-by: Huang Ying <ying.huang@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Linux 4.0-rc7

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
    previously SACK'd, from Neal Cardwell.

2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
    from Cong WANG.

3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
    from Cong WANG.

4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.

5) When we encapsulate for a tunnel device, skb->sk still points to the
    user socket.  So this leads to cases where we retraverse the
    ipv4/ipv6 output path with skb->sk being of some other address
    family (f.e. AF_PACKET).  This can cause things to crash since the
    ipv4 output path is dereferencing an AF_PACKET socket as if it were
    an ipv4 one.

    The short term fix for 'net' and -stable is to elide these socket
    checks once we've entered an encapsulation sequence by testing
    xmit_recursion.

    Longer term we have a better solution wherein we pass the tunnel's
    socket down through the output paths, but that is way too invasive
    for 'net' and -stable.

    From Hannes Frederic Sowa.

6) l2tp_init() failure path forgets to unregister per-net ops, from
    Cong WANG.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
  net: dsa: fix filling routing table from OF description
  l2tp: unregister l2tp_net_ops on failure path
  mvneta: dont call mvneta_adjust_link() manually
  ipv6: protect skb->sk accesses from recursive dereference inside the stack
  netns: don't allocate an id for dead netns
  Revert "netns: don't clear nsid too early on removal"
  ip6mr: call del_timer_sync() in ip6mr_free_table()
  net: move fib_rules_unregister() under rtnl lock
  ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
  tcp: fix FRTO undo on cumulative ACK of SACKed range
  xen-netfront: transmit fully GSO-sized packets

net/mlx4_core: Fix error message deprecation for ConnectX-2 cards

Commit 1daa4303b4ca ("net/mlx4_core: Deprecate error message at
ConnectX-2 cards startup to debug") did the deprecation only for port 1
of the card. Need to deprecate for port 2 as well.

Fixes: 1daa4303b4ca ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: fix filling routing table from OF description

According to description in 'include/net/dsa.h', in cascade switches
configurations where there are more than one interconnected devices,
'rtable' array in 'dsa_chip_data' structure is used to indicate which
port on this switch should be used to send packets to that are destined
for corresponding switch.

However, dsa_of_setup_routing_table() fills 'rtable' with port numbers
of the _target_ switch, but not current one.

This commit removes redundant devicetree parsing and adds needed port
number as a function argument. So dsa_of_setup_routing_table() now just
looks for target switch number by parsing parent of 'link' device node.

To remove possible misunderstandings with the way of determining target
switch number, a corresponding comment was added to the source code and
to the DSA device tree bindings documentation file.

This was tested on a custom board with two Marvell 88E6095 switches with
following corresponding routing tables: { -1, 10 } and { 8, -1 }.

Signed-off-by: Pavel Nakonechny <pavel.nakonechny@skitlab.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
"Updates for the input subsystem - two more tweaks for ALPS driver to
  work out kinks after splitting the touchpad, trackstick, and potential
  external PS/2 mouse into separate input devices.

  Changes to support ALPS SS4 devices (protocol V8) will be coming in
  4.1..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: alps - document stick behavior for protocol V2
  Input: alps - report V2 Dualpoint Stick events via the right evdev node
  Input: alps - report interleaved bare PS/2 packets via dev3

l2tp: unregister l2tp_net_ops on failure path

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mvneta: dont call mvneta_adjust_link() manually

mvneta_adjust_link() is a callback for of_phy_connect() and should
not be called directly. The result of calling it directly is as below:

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: protect skb->sk accesses from recursive dereference inside the stack

We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.

ipv6 does not conform with this in three places:

1) ip6_fragment: we do consult ipv6_npinfo for frag_size

2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
loop the packet back to the local socket

3) ip6_skb_dst_mtu could query the settings from the user socket and
force a wrong MTU

Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.

Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Input: alps - document stick behavior for protocol V2

Document that protocol V2 uses standard (bare) PS/2 mouse packets for the
DualPoint stick.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-By: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: alps - report V2 Dualpoint Stick events via the right evdev node

On V2 devices the DualPoint Stick reports bare packets, these should be
reported via the "AlpsPS/2 ALPS DualPoint Stick" dev2 evdev node, which also
has the INPUT_PROP_POINTING_STICK propbit set.

Note that since there is no way to distinguish these packets from an external
PS/2 mouse (insofar as these laptops have an external PS/2 port) this means
that we will be reporting PS/2 mouse events via this evdev node too, as we've
been doing in kernel 3.19 and older.

This has been tested on a Dell Latitude D620 and a Dell Latitude E6400,
which both have a V2 touchpad + a DualPoint Stick which reports bare packets.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: alps - report interleaved bare PS/2 packets via dev3

Bare packets should be reported via the same evdev device independent on
whether they are detected on the beginning of a packet or in the middle
of a packet.

This has been tested on a Dell Latitude E6400, where the DualPoint Stick
reports bare packets, which get reported via dev3 when the touchpad is
idle, and via dev2 when the touchpad and stick are used simultaneously.

This commit fixes this inconsistency by always reporting bare packets via
dev3. Note that since the come from a DualPoint Stick they really should be
reported via dev2, this gets fixed in a later commit.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'usb-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB fixes and new device ids for 4.0-rc6.  Nothing
  major, some xhci fixes for reported problems, and some usb-serial
  device ids.

  All have been in linux-next for a while"

* tag 'usb-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: ftdi_sio: Use jtag quirk for SNAP Connect E10
  usb: isp1760: fix spin unlock in the error path of isp1760_udc_start
  usb: xhci: apply XHCI_AVOID_BEI quirk to all Intel xHCI controllers
  usb: xhci: handle Config Error Change (CEC) in xhci driver
  USB: keyspan_pda: add new device id
  USB: ftdi_sio: Added custom PID for Synapse Wireless product

Merge tag 'staging-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
"Here are some staging driver fixes, well, really all just IIO driver
  fixes, for 4.0-rc6.  They fix issues that have been reported with
  these drivers.

  All of these patches have been in linux-next for a while"

* tag 'staging-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: imu: Use iio_trigger_get for indio_dev->trig assignment
  iio: adc: vf610: use ADC clock within specification
  iio/adc/cc10001_adc.c: Fix !HAS_IOMEM build
  iio: core: Fix double free.
  iio:inv-mpu6050: Fix inconsistency for the scale channel
  staging: iio: dummy: Fix undefined symbol build error
  iio: inv_mpu6050: Clear timestamps fifo while resetting hardware fifo
  staging: iio: hmc5843: Set iio name property in sysfs
  iio: bmc150: change sampling frequency
  iio: fix drivers that check buffer->scan_mask

Merge tag 'tty-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
"Here are 3 serial driver fixes for 4.0-rc6.  They fix some reported
  issues with the samsung and fsl_lpuart drivers.

  All have been in linux-next for a while"

* tag 'tty-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: serial: fsl_lpuart: clear receive flag on FIFO flush
  tty: serial: fsl_lpuart: specify transmit FIFO size
  serial: samsung: Clear operation mode on UART shutdown

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input subsystem fixes from Dmitry Torokhov:
"A fix for ALPS driver for issue introduced in the latest update and a
  tweak for yet another Lenovo box in Synaptics.

  There will be more ALPS tweaks coming.."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: define INPUT_PROP_ACCELEROMETER behavior
  Input: synaptics - fix min-max quirk value for E440
  Input: synaptics - add quirk for Thinkpad E440
  Input: ALPS - fix max coordinates for v5 and v7 protocols
  Input: add MT_TOOL_PALM

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block layer fix from Jens Axboe:
"Just one patch in this pull request, fixing a regression caused by a
'mathematically correct' change to lcm()"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix blk_stack_limits() regression due to lcm() change

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
"Misc fixes: a SYSRET single-stepping fix, a dmi-scan robustization
  fix, a reboot quirk and a kgdb fixlet"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kgdb/x86: Fix reporting of 'si' in kgdb on x86_64
  x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set
  x86/reboot: Add ASRock Q1900DC-ITX mainboard reboot quirk
  MAINTAINERS: Change the x86 microcode loader maintainer
  firmware: dmi_scan: Prevent dmi_num integer overflow

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
"Two x86 Intel PMU constraint handling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix Haswell CYCLE_ACTIVITY.* counter constraints
perf/x86/intel: Filter branches for PEBS event

Merge tag 'devicetree-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux

Pull devicetree fix from Grant Likely:
"Simple bugfix for bad device tree data on the PA-Semi platform"

* tag 'devicetree-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux:
drivers/of: Add empty ranges quirk for PA-Semi

Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6

Pull CIFS fixes from Steve French:
"A set of small cifs fixes fixing a memory leak, kernel oops, and
  infinite loop (and some spotted by Coverity)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Fix warning
  Fix another dereference before null check warning
  CIFS: session servername can't be null
  Fix warning on impossible comparison
  Fix coverity warning
  Fix dereference before null check warning
  Don't ignore errors on encrypting password in SMBTcon
  Fix warning on uninitialized buftype
  cifs: potential memory leaks when parsing mnt opts
  cifs: fix use-after-free bug in find_writable_file
  cifs: smb2_clone_range() - exit on unhandled error

netns: don't allocate an id for dead netns

First, let's explain the problem.
Suppose you have an ipip interface that stands in the netns foo and its link
part in the netns bar (so the netns bar has an nsid into the netns foo).
Now, you remove the netns bar:
- the bar nsid into the netns foo is removed
- the netns exit method of ipip is called, thus our ipip iface is removed:
   => a netlink message is built in the netns foo to advertise this deletion
   => this netlink message requests an nsid for bar, thus a new nsid is
      allocated for bar and never removed.

This patch adds a check in peernet2id() so that an id cannot be allocated for
a netns which is currently destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "netns: don't clear nsid too early on removal"

This reverts
commit 4217291e592d ("netns: don't clear nsid too early on removal").

This is not the right fix, it introduces races.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6mr: call del_timer_sync() in ip6mr_free_table()

We need to wait for the flying timers, since we
are going to free the mrtable right after it.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: move fib_rules_unregister() under rtnl lock

We have to hold rtnl lock for fib_rules_unregister()
otherwise the following race could happen:

fib_rules_unregister(): fib_nl_delrule():
... ...
... ops = lookup_rules_ops();
list_del_rcu(&ops->list);
list_for_each_entry(ops->rules) {
fib_rules_cleanup_ops(ops); ...
list_del_rcu(); list_del_rcu();
}

Note, net->rules_mod_lock is actually not needed at all,
either upper layer netns code or rtnl lock guarantees
we are safe.

Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup

This is the IPv4 part for commit 905a6f96a1b1
(ipv6: take rtnl_lock and mark mrt6 table as freed on namespace cleanup).

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
"One drm core fix, one exynos regression fix, two sets of radeon fixes
  (Alex was a bit behind last week), and two i915 fixes.

  Nothing too serious we seem to have calmed down i915 since last week"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon: fix wait in radeon_mn_invalidate_range_start
  drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr
  drm: Exynos: Respect framebuffer pitch for FIMD/Mixer
  drm/i915: Reject the colorkey ioctls for primary and cursor planes
  drm/i915: Skip allocating shadow batch for 0-length batches
  drm/radeon: programm the VCE fw BAR as well
  drm/radeon: always dump the ring content if it's available
  radeon: Do not directly dereference pointers to BIOS area.
  drm/radeon/dpm: fix 120hz handling harder
  drm/edid: set ELD for firmware and debugfs override EDIDs

Merge tag 'irqchip-fixes-4.0-2' of git://git.infradead.org/users/jcooper/linux

Pull irqchip fixes from Jason Cooper:
"This is the second round of fixes for irqchip.  It contains some fixes
  found while the arm64 guys were writing the kvm gicv3 its emulation.

  GICv3 ITS:
    - Small batch of fixes discovered while writing the kvm ITS emulation"

* tag 'irqchip-fixes-4.0-2' of git://git.infradead.org/users/jcooper/linux:
  irqchip: gicv3-its: Use non-cacheable accesses when no shareability
  irqchip: gicv3-its: Fix PROP/PEND and BASE/CBASE confusion
  irqchip: gicv3-its: Fix device ID encoding
  irqchip: gicv3-its: Fix encoding of collection's target redistributor

Merge branch 'drm-fixes-4.0' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

Just two small fixes for radeon, both destined for stable.

* 'drm-fixes-4.0' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix wait in radeon_mn_invalidate_range_start
drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr

Merge branch 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes

Fix display on issue to Exynos5250 based Snow(1366x768) board.

* 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos:
drm: Exynos: Respect framebuffer pitch for FIMD/Mixer

Merge tag 'drm-intel-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel into drm-fixes

one oops fixes and a 0-length allocation fix from next backported.

* tag 'drm-intel-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Reject the colorkey ioctls for primary and cursor planes
drm/i915: Skip allocating shadow batch for 0-length batches

Merge tag 'topic/drm-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel into drm-fixes

Here's a single drm core fix, cc: stable, that affects i915
users.

* tag 'topic/drm-fixes-2015-04-02' of git://anongit.freedesktop.org/drm-intel:
drm/edid: set ELD for firmware and debugfs override EDIDs

Merge tag 'stable/for-linus-4.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen regression fixes from David Vrabel:
"Fix two regressions in the balloon driver's use of memory hotplug when
  used in a PV guest"

* tag 'stable/for-linus-4.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/balloon: before adding hotplugged memory, set frames to invalid
  x86/xen: prepare p2m list for memory hotplug

Merge tag 'powerpc-4.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc fix from Michael Ellerman:
"Fix memory corruption by pnv_alloc_idle_core_states"

* tag 'powerpc-4.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc: fix memory corruption by pnv_alloc_idle_core_states

tcp: fix FRTO undo on cumulative ACK of SACKed range

On processing cumulative ACKs, the FRTO code was not checking the
SACKed bit, meaning that there could be a spurious FRTO undo on a
cumulative ACK of a previously SACKed skb.

The FRTO code should only consider a cumulative ACK to indicate that
an original/unretransmitted skb is newly ACKed if the skb was not yet
SACKed.

The effect of the spurious FRTO undo would typically be to make the
connection think that all previously-sent packets were in flight when
they really weren't, leading to a stall and an RTO.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d99c ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull infiniband/rdma fix from Roland Dreier:
"Fix for exploitable integer overflow in uverbs interface"

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic

xen-netfront: transmit fully GSO-sized packets

xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.

Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
    XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.

The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
6c09fa09d) in determining when to split an skb into two is
    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.

So the maximum permitted size of an skb is calculated to be
    (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.

Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.

Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).

Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
it relates to the size of the whole packet, including the header.

Fixes: 9ecd1a75d977 ("xen-netfront: reduce gso_max_size to account for max TCP header")
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
"This time we have addition of caps for jz4740 which fixes intentional
  warning at boot.  Then we have memory leak issues in drivers using
  virt-dma by Peter on few drive"

* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: moxart-dma: Fix memory leak when stopping a running transfer
  dmaengine: bcm2835-dma: Fix memory leak when stopping a running transfer
  dmaengine: omap-dma: Fix memory leak when terminating running transfer
  dmaengine: edma: fix memory leak when terminating running transfers
  dmaengine: jz4740: Define capabilities

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix use-after-free with mac80211 RX A-MPDU reorder timer, from
    Johannes Berg.

2) iwlwifi leaks memory every module load/unload cycles, fix from Larry
    Finger.

3) Need to use for_each_netdev_safe() in rtnl_group_changelink()
    otherwise we can crash, from WANG Cong.

4) mlx4 driver does register_netdev() too early in the probe sequence,
    from Ido Shamay.

5) Don't allow router discovery hop limit to decrease the interface's
    hop limit, from D.S. Ljungmark.

6) tx_packets and tx_bytes improperly accounted for certain classes of
    USB network devices, fix from Ben Hutchings.

7) ip{6}mr_rules_init() mistakenly use plain kfree to release the ipmr
    tables in the error path, they must instead use ip{6}mr_free_table().
    Fix from WANG Cong.

8) cxgb4 doesn't properly quiesce all RX activity before unregistering
    the netdevice.  Fix from Hariprasad Shenai.

9) Fix hash corruptions in ipvlan driver, from Jiri Benc.

10) nla_memcpy(), like a real memcpy, should fully initialize the
    destination buffer, even if the source attribute is smaller.  Fix
    from Jiri Benc.

11) Fix wrong error code returned from iucv_sock_sendmsg().  We should
    use whatever sock_alloc_send_skb() put into 'err'.  From Eugene
    Crosser.

12) Fix slab object leak on module unload in TIPC, from Ying Xue.

13) Need a READ_ONCE() when reading the cached RX socket route in
    tcp_v{4,6}_early_demux().  From Michal Kubecek.

14) Still too many problems with TPC support in the ath9k driver, so
    disable it for now.  From Felix Fietkau.

15) When in AP mode the rtlwifi driver can leak DMA mappings, fix from
    Larry Finger.

16) Missing kzalloc() failure check in gs_usb CAN driver, from Colin Ian
    King.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  cxgb4: Fix to dump devlog, even if FW is crashed
  cxgb4: Firmware macro changes for fw verison 1.13.32.0
  bnx2x: Fix kdump when iommu=on
  bnx2x: Fix kdump on 4-port device
  mac80211: fix RX A-MPDU session reorder timer deletion
  MAINTAINERS: Update Intel Wired Ethernet Driver info
  tipc: fix a slab object leak
  net/usb/r8152: add device id for Lenovo TP USB 3.0 Ethernet
  af_iucv: fix AF_IUCV sendmsg() errno
  openvswitch: Return vport module ref before destruction
  netlink: pad nla_memcpy dest buffer with zeroes
  bonding: Bonding Overriding Configuration logic restored.
  ipvlan: fix check for IP addresses in control path
  ipvlan: do not use rcu operations for address list
  ipvlan: protect against concurrent link removal
  ipvlan: fix addr hash list corruption
  net: fec: setup right value for mdio hold time
  net: tcp6: fix double call of tcp_v6_fill_cb()
  cxgb4vf: Fix sparse warnings
  netns: don't clear nsid too early on removal
  ...

IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic

Properly verify that the resulting page aligned end address is larger
than both the start address and the length of the memory area requested.

Both the start and length arguments for ib_umem_get are controlled by
the user. A misbehaving user can provide values which will cause an
integer overflow when calculating the page aligned end address.

This overflow can cause also miscalculation of the number of pages
mapped, and additional logic issues.

Addresses: CVE-2014-8159
Cc: <stable@vger.kernel.org>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

perf/x86/intel: Fix Haswell CYCLE_ACTIVITY.* counter constraints

Some of the CYCLE_ACTIVITY.* events can only be scheduled on
counter 2. Due to a typo Haswell matched those with
INTEL_EVENT_CONSTRAINT, which lead to the events never
matching as the comparison does not expect anything
in the umask too. Fix the typo.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425925222-32361-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

perf/x86/intel: Filter branches for PEBS event

For supporting Intel LBR branches filtering, Intel LBR sharing logic
mechanism is introduced from commit b36817e88630 ("perf/x86: Add Intel
LBR sharing logic"). It modifies __intel_shared_reg_get_constraints() to
config lbr_sel, which is finally used to set LBR_SELECT.

However, the intel_shared_regs_constraints() function is called after
intel_pebs_constraints(). The PEBS event will return immediately after
intel_pebs_constraints(). So it's impossible to filter branches for PEBS
events.

This patch moves intel_shared_regs_constraints() ahead of
intel_pebs_constraints().

We can safely do that because the intel_shared_regs_constraints() function
only returns empty constraint if its rejecting the event, otherwise it
returns NULL such that we continue calling intel_pebs_constraints() and
x86_get_event_constraint().

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1427467105-9260-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

drm/radeon: fix wait in radeon_mn_invalidate_range_start

We need to wait for all fences, not just the exclusive one.

Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/radeon: add extra check in radeon_ttm_tt_unpin_userptr

We somehow try to free the SG table twice.

Bugs: https://bugs.freedesktop.org/show_bug.cgi?id=89734

Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm: Exynos: Respect framebuffer pitch for FIMD/Mixer

When performing a modeset, use the framebuffer pitch value to set FIMD
IMG_SIZE and Mixer SPAN registers. These are both defined as pitch - the
distance between contiguous lines (bytes for FIMD, pixels for mixer).

Fixes display on Snow (1366x768).

Signed-off-by: Daniel Stone <daniels@collabora.com>
Tested-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
Signed-off-by: Inki Dae <inki.dae@samsung.com>

kgdb/x86: Fix reporting of 'si' in kgdb on x86_64

This patch fixes an error in kgdb for x86_64 which would report
the value of dx when asked to give the value of si.

Signed-off-by: Steffen Liebergeld <steffen.liebergeld@kernkonzept.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set

When I wrote the opportunistic SYSRET code, I missed an important difference
between SYSRET and IRET.

Both instructions are capable of setting EFLAGS.TF, but they behave differently
when doing so:

- IRET will not issue a #DB trap after execution when it sets TF.
   This is critical -- otherwise you'd never be able to make forward progress when
   returning to userspace.

- SYSRET, on the other hand, will trap with #DB immediately after
   returning to CPL3, and the next instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set.  For example, running this code with TF set
and a SIGTRAP handler loaded never gets past 'post_nop':

extern unsigned char post_nop[];
asm volatile ("pushfq\n\t"
      "popq %%r11\n\t"
      "nop\n\t"
      "post_nop:"
      : : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel manual.

Fix it by using IRET to restore TF.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2a23c6b8a9c4 ("x86_64, entry: Use sysret to return to userspace when possible")
Link: http://lkml.kernel.org/r/9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

drm/i915: Reject the colorkey ioctls for primary and cursor planes

The legcy colorkey ioctls are only implemented for sprite planes, so
reject the ioctl for primary/cursor planes. If we want to support
colorkeying with these planes (assuming we have hw support of course)
we should just move ahead with the colorkey property conversion.

Testcase: kms_legacy_colorkey
Cc: Tommi Rantala <tt.rantala@gmail.com>
Cc: stable@vger.kernel.org
Reference: http://mid.gmane.org/CA+ydwtr+bCo7LJ44JFmUkVRx144UDFgOS+aJTfK6KHtvBDVuAw@mail.gmail.com
Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>

Merge tag 'wireless-drivers-for-davem-2015-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
iwlwifi:

* fix a memory leak, we leaked memory each time the module
was loaded.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-net'

Hariprasad Shenai says:

====================
cxgb4 FW macro changes for new FW

Fix to dump device log even in the case of firmware crash. Also
incorporates changes for new FW.

This patch series has been created against net tree and includes patches on
cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Fix to dump devlog, even if FW is crashed

Add new Common Code routines to retrieve Firmware Device Log
parameters from PCIE_FW_PF[7]. The firmware initializes its Device Log very
early on and stores the parameters for its location/size in that register.
Using the parameters from the register allows us to access the Firmware
Device Log even when the firmware crashes very early on or we're not
attached to the firmware

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Firmware macro changes for fw verison 1.13.32.0

Adds new macro and few macro changes for fw version 1.13.32.0 also
changes version string in driver to match 1.13.32.0

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-for-davem-2015-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This contains just a single fix for a crash I happened to randomly
run into today during testing. It's clearly been around for a while,
but is pretty hard to trigger, even when I tried explicitly (and
modified the code to make it more likely) it rarely did.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'iommu-fixes-v4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU fixes from Joerg Roedel:
"This contains fixes for:

   - a VT-d issue where hardware domain-ids might be freed while still
     in use.

   - an ipmmu-vmsa issue where where the device-table was not zero
     terminated

   - unchecked register access issue in the arm-smmu driver"

* tag 'iommu-fixes-v4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Remove unused variable
  iommu: ipmmu-vmsa: Add terminating entry for ipmmu_of_ids
  iommu/vt-d: Detach domain *only* from attached iommus
  iommu/arm-smmu: fix ARM_SMMU_FEAT_TRANS_OPS condition

lguest: now needs PCI_DIRECT.

Since commit 8e7094694396 ("lguest: add a dummy PCI host bridge.")
lguest uses PCI, but it needs you to frob the ports directly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull lazytime fixes from Ted Ts'o:
"This fixes a problem in the lazy time patches, which can cause
  frequently updated inods to never have their timestamps updated.

  These changes guarantee that no timestamp on disk will be stale by
  more than 24 hours"

* tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fs: add dirtytime_expire_seconds sysctl
  fs: make sure the timestamps for lazytime inodes eventually get written

Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
"Two main issues:

   - We found that turning on pNFS by default (when it's configured at
     build time) was too aggressive, so we want to switch the default
     before the 4.0 release.

   - Recent client changes to increase open parallelism uncovered a
     serious bug lurking in the server's open code.

  Also fix a krb5/selinux regression.

  The rest is mainly smaller pNFS fixes"

* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
  sunrpc: make debugfs file creation failure non-fatal
  nfsd: require an explicit option to enable pNFS
  NFSD: Fix bad update of layout in nfsd4_return_file_layout
  NFSD: Take care the return value from nfsd4_encode_stateid
  NFSD: Printk blocklayout length and offset as format 0x%llx
  nfsd: return correct lockowner when there is a race on hash insert
  nfsd: return correct openowner when there is a race to put one in the hash
  NFSD: Put exports after nfsd4_layout_verify fail
  NFSD: Error out when register_shrinker() fail
  NFSD: Take care the return value from nfsd4_decode_stateid
  NFSD: Check layout type when returning client layouts
  NFSD: restore trace event lost in mismerge

Merge branch 'bnx2'

Yuval Mintz says:

====================
bnx2x: kdump related fixes

This patch series aims to fix bnx2x driver issues when loading in kdump kernel.
Both issues fixed here would be fatal to the device, requiring full reset of
the system in order to recover, preventing the device from serving its purpose
in the kdump environment.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Fix kdump when iommu=on

When IOMM-vtd is active, once main kernel crashes unfinished DMAE transactions
will be blocked, putting the HW in an error state which will cause further
transactions to timeout.

Current employed logic uses wrong macros, causing the first function to be the
only function that cleanups that error state during its probe/load.

This patch allows all the functions to successfully re-load in kdump kernel.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>