Instead of (uid_t)0 use GLOBAL_ROOT_UID.
Instead of (gid_t)0 use GLOBAL_ROOT_GID.
Instead of (uid_t)-1 use INVALID_UID
Instead of (gid_t)-1 use INVALID_GID.
Instead of NOGROUP use INVALID_GID.
Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
nfs_common: Update the translation between nfsv3 acls linux posix acls
- Use kuid_t and kgit in struct nfsacl_encode_desc.
- Convert from kuids and kgids when generating on the wire values.
- Convert on the wire values to kuids and kgids when read.
- Modify cmp_acl_entry to be type safe comparison on posix acls.
Only acls with type ACL_USER and ACL_GROUP can appear more
than once and as such need to compare more than their tag.
- The e_id field is being removed from posix acls so don't initialize it.
Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ncpfs: Support interacting with multiple user namespaces
ncpfs does not natively support uids and gids so this conversion was
simply a matter of updating the the type of the mounteduid, the uid
and the gid on the superblock. Fixing the ioctls that read them,
updating the mount option parser and the mount option printer.
Cc: Petr Vandrovec <petr@vandrovec.name> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
gfs2: Enable building with user namespaces enabled
Now that all of the necessary work has been done to push kuids and
kgids throughout gfs2 and to convert between kuids and kgids when
reading and writing the on disk structures it is safe to enable gfs2
when multiple user namespaces are enabled.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Where kuid_t values are compared use uid_eq and where kgid_t values
are compared use gid_eq. This is unfortunately necessary because
of the type safety that keeps someone from accidentally mixing
kuids and kgids with other types.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
gfs2: Remove the QUOTA_USER and QUOTA_GROUP defines
Remove the QUOTA_USER and QUOTA_GRUP defines. Remove
the last vestigal users of QUOTA_USER and QUOTA_GROUP.
Now that struct kqid is used throughout the gfs2 quota
code the need there is to use QUOTA_USER and QUOTA_GROUP
and the defines are just extraneous and confusing.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
gfs2: Store qd_id in struct gfs2_quota_data as a struct kqid
- Change qd_id in struct gfs2_qutoa_data to struct kqid.
- Remove the now unnecessary QDF_USER bit field in qd_flags.
- Propopoage this change through the code generally making
things simpler along the way.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Both qd_alloc and qd2offset perform the exact same computation
to get an index from a gfs2_quota_data. Make life a little
simpler and factor out this index computation.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
gfs2: Report quotas in the caller's user namespace.
When a quota is queried return the uid or the gid in the mapped into
the caller's user namespace. In addition perform the munged version
of the mapping so that instead of -1 a value that does not map is
reported as the overflowuid or the overflowgid.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In set_dqblk it is an error to look at fdq->d_id or fdq->d_flags.
Userspace quota applications do not set these fields when calling
quotactl(Q_XSETQLIM,...), and the kernel does not set those fields
when quota_setquota calls set_dqblk.
gfs2 never looks at fdq->d_id or fdq->d_flags after checking
to see if they match the id and type supplied to set_dqblk.
No other linux filesystem in set_dqblk looks at either fdq->d_id
or fdq->d_flags.
Therefore remove these bogus checks from gfs2 and allow normal
quota setting applications to work.
Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ocfs2: Enable building with user namespaces enabled
Now that ocfs2 has been converted to store uids and gids in
kuid_t and kgid_t and all of the conversions have been added
to the appropriate places it is safe to allow building and
using ocfs2 with user namespace support enabled.
Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ocfs2: convert between kuids and kgids and DLM locks
Convert between uid and gids stored in the on the wire format of dlm
locks aka struct ocfs2_meta_lvb and kuids and kgids stored in
inode->i_uid and inode->i_gid.
Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
coda: Allow coda to be built when user namespace support is enabled
Now that the coda kernel to userspace has been modified to convert
between kuids and kgids and uids and gids, and all internal
coda structures have be modified to store uids and gids as
kuids and kgids it is safe to allow code to be built with
user namespace support enabled.
Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
coda: Cache permisions in struct coda_inode_info in a kuid_t.
- Change c_uid in struct coda_indoe_info from a vuid_t to a kuid_t.
- Initialize c_uid to GLOBAL_ROOT_UID instead of 0.
- Use uid_eq to compare cached kuids.
Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
coda: Restrict coda messages to the initial user namespace
Remove the slight chance that uids and gids in coda messages will be
interpreted in the wrong user namespace.
- Only allow processes in the initial user namespace to open the coda
character device to communicate with coda filesystems.
- Explicitly convert the uids in the coda header into the initial user
namespace.
- In coda_vattr_to_attr make kuids and kgids from the initial user
namespace uids and gids in struct coda_vattr that just came from
userspace.
- In coda_iattr_to_vattr convert kuids and kgids into uids and gids
in the intial user namespace and store them in struct coda_vattr for
sending to coda userspace programs.
Nothing needs to be changed with mounts as coda does not support
being mounted in anything other than the initial user namespace.
Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
coda: Restrict coda messages to the initial pid namespace
Remove the slight chance that pids in coda messages will be
interpreted in the wrong pid namespace.
- Explicitly send all pids in coda messages in the initial pid
namespace.
- Only allow mounts from processes in the initial pid namespace.
- Only allow processes in the initial pid namespace to open the coda
character device to communicate with coda.
Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
afs: Support interacting with multiple user namespaces
Modify struct afs_file_status to store owner as a kuid_t and group as
a kgid_t.
In xdr_decode_AFSFetchStatus as owner is now a kuid_t and group is now
a kgid_t don't use the EXTRACT macro. Instead perform the work of
the extract macro explicitly. Read the value with ntohl and
convert it to the appropriate type with make_kuid or make_kgid.
Test if the value is different from what is stored in status and
update changed. Update the value in status.
In xdr_encode_AFS_StoreStatus call from_kuid or from_kgid as
we are computing the on the wire encoding.
Initialize uids with GLOBAL_ROOT_UID instead of 0.
Initialize gids with GLOBAL_ROOT_GID instead of 0.
Cc: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
While looking for kuid_t and kgid_t conversions I found this
structure that has never been used since it was added to the
kernel in 2007. The obvious for this structure to be used
is in xdr_encode_AFS_StoreStatus and that function uses a
small handful of local variables instead.
So remove the unnecessary structure to prevent confusion.
Cc: David Howells <dhowells@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
9p: Allow building 9p with user namespaces enabled.
Now that the uid_t -> kuid_t, gid_t -> kgid_t conversion
has been completed in 9p allow 9p to be built when user
namespaces are enabled.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
9p: Modify v9fs_get_fsgid_for_create to return a kgid
Modify v9fs_get_fsgid_for_create to return a kgid and modify all of
the variables that hold the result of v9fs_get_fsgid_for_create to be
of type kgid_t.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
9p: Modify struct v9fs_session_info to use a kuids and kgids
Change struct v9fs_session_info and the code that popluates it to use
kuids and kgids. When parsing the 9p mount options convert the
dfltuid, dflutgid, and the session uid from the current user namespace
into kuids and kgids. Modify V9FS_DEFUID and V9FS_DEFGUID to be kuid
and kgid values.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
9p: Modify struct 9p_fid to use a kuid_t not a uid_t
Change struct 9p_fid and it's associated functions to
use kuid_t's instead of uid_t.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
9p: Modify the stat structures to use kuid_t and kgid_t
9p has thre strucrtures that can encode inode stat information. Modify
all of those structures to contain kuid_t and kgid_t values. Modify
he wire encoders and decoders of those structures to use 'u' and 'g' instead of
'd' in the format string where uids and gids are present.
This results in all kuid and kgid conversion to and from on the wire values
being performed by the same code in protocol.c where the client is known
at the time of the conversion.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Modify the p9_client_rpc format specifiers of every function that
directly transmits a uid or a gid from 'd' to 'u' or 'g' as
appropriate.
Modify those same functions to take kuid_t and kgid_t parameters
instead of uid_t and gid_t parameters.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
9p: Add 'u' and 'g' format specifies for kuids and kgids
This allows concentrating all of the conversion to and from kuids and
kgids into the format needed by the 9p protocol into one location.
Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ceph: Enable building when user namespaces are enabled.
Now that conversions happen from kuids and kgids when generating ceph
messages and conversion happen to kuids and kgids after receiving
celph messages, and all intermediate data structures store uids and
gids as type kuid_t and kgid_t it is safe to enable ceph with
user namespace support enabled.
Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ceph: Convert struct ceph_mds_request to use kuid_t and kgid_t
Hold the uid and gid for a pending ceph mds request using the types
kuid_t and kgid_t. When a request message is finally created convert
the kuid_t and kgid_t values into uids and gids in the initial user
namespace.
Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ceph: Translate between uid and gids in cap messages and kuids and kgids
- Make the uid and gid arguments of send_cap_msg() used to compose
ceph_mds_caps messages of type kuid_t and kgid_t.
- Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg()
through variables of type kuid_t and kgid_t.
- Modify struct ceph_cap_snap to store uids and gids in types kuid_t
and kgid_t. This allows capturing inode->i_uid and inode->i_gid in
ceph_queue_cap_snap() without loss and pssing them to
__ceph_flush_snaps() where they are removed from struct
ceph_cap_snap and passed to send_cap_msg().
- In handle_cap_grant translate uid and gids in the initial user
namespace stored in struct ceph_mds_cap into kuids and kgids
before setting inode->i_uid and inode->i_gid.
Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ceph: Only allow mounts in the initial network namespace
Today ceph opens tcp sockets from a delayed work callback. Delayed
work happens from kernel threads which are always in the initial
network namespace. Therefore fail early if someone attempts
to mount a ceph filesystem from something other than the initial
network namespace.
Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There is no backing store to tmpfs and file creation rules are the
same as for any other filesystem so it is semantically safe to allow
unprivileged users to mount it. ramfs is safe for the same reasons so
allow either flavor of tmpfs to be mounted by a user namespace root
user.
The memory control group successfully limits how much memory tmpfs can
consume on any system that cares about a user namespace root using
tmpfs to exhaust memory the memory control group can be deployed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There is no backing store to ramfs and file creation
rules are the same as for any other filesystem so
it is semantically safe to allow unprivileged users
to mount it.
The memory control group successfully limits how much
memory ramfs can consume on any system that cares about
a user namespace root using ramfs to exhaust memory
the memory control group can be deployed.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- The context in which devpts is mounted has no effect on the creation
of ptys as the /dev/ptmx interface has been used by unprivileged
users for many years.
- Only support unprivileged mounts in combination with the newinstance
option to ensure that mounting of /dev/pts in a user namespace will
not allow the options of an existing mount of devpts to be modified.
- Create /dev/pts/ptmx as the root user in the user namespace that
mounts devpts so that it's permissions to be changed.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In the help text describing user namespaces recommend use of memory
control groups. In many cases memory control groups are the only
mechanism there is to limit how much memory a user who can create
user namespaces can use.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
userns: Allow any uid or gid mappings that don't overlap.
When I initially wrote the code for /proc/<pid>/uid_map. I was lazy
and avoided duplicate mappings by the simple expedient of ensuring the
first number in a new extent was greater than any number in the
previous extent.
Unfortunately that precludes a number of valid mappings, and someone
noticed and complained. So use a simple check to ensure that ranges
in the mapping extents don't overlap.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When freeing a deeply nested user namespace free_user_ns calls
put_user_ns on it's parent which may in turn call free_user_ns again.
When -fno-optimize-sibling-calls is passed to gcc one stack frame per
user namespace is left on the stack, potentially overflowing the
kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
so we can't count on gcc to optimize this code.
Remove struct kref and use a plain atomic_t. Making the code more
flexible and easier to comprehend. Make the loop in free_user_ns
explict to guarantee that the stack does not overflow with
CONFIG_FRAME_POINTER enabled.
I have tested this fix with a simple program that uses unshare to
create a deeply nested user namespace structure and then calls exit.
With 1000 nesteuser namespaces before this change running my test
program causes the kernel to die a horrible death. With 10,000,000
nested user namespaces after this change my test program runs to
completion and causes no harm.
Li Zefan [Thu, 27 Dec 2012 03:39:12 +0000 (11:39 +0800)]
userns: Allow unprivileged reboot
In a container with its own pid namespace and user namespace, rebooting
the system won't reboot the host, but terminate all the processes in
it and thus have the container shutdown, so it's safe.
Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
With user namespaces enabled building f2fs fails with:
CC fs/f2fs/acl.o
fs/f2fs/acl.c: In function ‘f2fs_acl_from_disk’:
fs/f2fs/acl.c:85:21: error: ‘struct posix_acl_entry’ has no member named ‘e_id’
make[2]: *** [fs/f2fs/acl.o] Error 1
make[2]: Target `__build' not remade because of errors.
e_id is a backwards compatibility field only used for file systems
that haven't been converted to use kuids and kgids. When the posix
acl tag field is neither ACL_USER nor ACL_GROUP assigning e_id is
unnecessary. Remove the assignment so f2fs will build with user
namespaces enabled.
Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Amit Sahrawat <a.sahrawat@samsung.com> Acked-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.
Can lead to processes attempting access their child reaper and
instead following a stale pointer.
That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.
Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.
Linus Torvalds [Sat, 22 Dec 2012 01:10:29 +0000 (17:10 -0800)]
Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
"This includes some fixes and code improvements (like
clk_prepare_enable and clk_disable_unprepare), conversion from the
omap_wdt and twl4030_wdt drivers to the watchdog framework, addition
of the SB8x0 chipset support and the DA9055 Watchdog driver and some
OF support for the davinci_wdt driver."
* git://www.linux-watchdog.org/linux-watchdog: (22 commits)
watchdog: mei: avoid oops in watchdog unregister code path
watchdog: Orion: Fix possible null-deference in orion_wdt_probe
watchdog: sp5100_tco: Add SB8x0 chipset support
watchdog: davinci_wdt: add OF support
watchdog: da9052: Fix invalid free of devm_ allocated data
watchdog: twl4030_wdt: Change TWL4030_MODULE_PM_RECEIVER to TWL_MODULE_PM_RECEIVER
watchdog: remove depends on CONFIG_EXPERIMENTAL
watchdog: Convert dev_printk(KERN_<LEVEL> to dev_<level>(
watchdog: DA9055 Watchdog driver
watchdog: omap_wdt: eliminate goto
watchdog: omap_wdt: delete redundant platform_set_drvdata() calls
watchdog: omap_wdt: convert to devm_ functions
watchdog: omap_wdt: convert to new watchdog core
watchdog: WatchDog Timer Driver Core: fix comment
watchdog: s3c2410_wdt: use clk_prepare_enable and clk_disable_unprepare
watchdog: imx2_wdt: Select the driver via ARCH_MXC
watchdog: cpu5wdt.c: add missing del_timer call
watchdog: hpwdt.c: Increase version string
watchdog: Convert twl4030_wdt to watchdog core
davinci_wdt: preparation for switch to common clock framework
...
Linus Torvalds [Sat, 22 Dec 2012 01:09:07 +0000 (17:09 -0800)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
"Misc small cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: eliminate cifsERROR variable
cifs: don't compare uniqueids in cifs_prime_dcache unless server inode numbers are in use
cifs: fix double-free of "string" in cifs_parse_mount_options
Linus Torvalds [Sat, 22 Dec 2012 01:08:06 +0000 (17:08 -0800)]
Merge tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.
Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."
* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...
This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs. The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.
Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 22 Dec 2012 00:40:26 +0000 (16:40 -0800)]
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull more infiniband changes from Roland Dreier:
"Second batch of InfiniBand/RDMA changes for 3.8:
- cxgb4 changes to fix lookup engine hash collisions
- mlx4 changes to make flow steering usable
- fix to IPoIB to avoid pinning dst reference for too long"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/cxgb4: Fix bug for active and passive LE hash collision path
RDMA/cxgb4: Fix LE hash collision bug for passive open connection
RDMA/cxgb4: Fix LE hash collision bug for active open connection
mlx4_core: Allow choosing flow steering mode
mlx4_core: Adjustments to Flow Steering activation logic for SR-IOV
mlx4_core: Fix error flow in the flow steering wrapper
mlx4_core: Add QPN enforcement for flow steering rules set by VFs
cxgb4: Add LE hash collision bug fix path in LLD driver
cxgb4: Add T4 filter support
IPoIB: Call skb_dst_drop() once skb is enqueued for sending
Linus Torvalds [Sat, 22 Dec 2012 00:39:08 +0000 (16:39 -0800)]
Merge tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanup from Arnd Bergmann:
"These are a few cleanups for asm-generic:
- a set of patches from Lars-Peter Clausen to generalize asm/mmu.h
and use it in the architectures that don't need any special
handling.
- A patch from Will Deacon to remove the {read,write}s{b,w,l} as
discussed during the arm64 review
- A patch from James Hogan that helps with the meta architecture
series."
* tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
xtensa: Use generic asm/mmu.h for nommu
h8300: Use generic asm/mmu.h
c6x: Use generic asm/mmu.h
asm-generic/mmu.h: Add support for FDPIC
asm-generic/mmu.h: Remove unused vmlist field from mm_context_t
asm-generic: io: remove {read,write} string functions
asm-generic/io.h: remove asm/cacheflush.h include
Kukjin Kim [Fri, 21 Dec 2012 18:02:13 +0000 (10:02 -0800)]
ARM: dts: fix duplicated build target and alphabetical sort out for exynos
Commit db5b0ae00712 ("Merge tag 'dt' of git://git.kernel.org/.../arm-soc")
causes a duplicated build target. This patch fixes it and sorts out the
build target alphabetically so that we can recognize something wrong
easily.
Cc: Olof Johansson <olof@lixom.net> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm snapshot: do not use map_context
Eliminate struct map_info from dm-snap.
map_info->ptr was used in dm-snap to indicate if the bio was tracked.
If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.
This patch removes the use of map_info->ptr. We determine if the bio was
tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
the bio is not tracked, if it is false, the bio is tracked.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:40 +0000 (20:23 +0000)]
dm raid1: dont use map_context
Don't use map_info any more in dm-raid1.
map_info was used for writes to hold the region number. For this purpose
we add a new field dm_bio_details to dm_raid1_bio_record.
map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
the pointer was non-NULL, bio details were saved; if the pointer was
NULL, bio details were not saved). We use
dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
NULL, details were not saved, if bi_bdev is non-NULL, details were
saved.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm: introduce per_bio_data
Introduce a field per_bio_data_size in struct dm_target.
Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.
Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.
If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.
WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.
The dm thin target uses this.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm: add WRITE SAME support
WRITE SAME bios have a payload that contain a single page. When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental). DM need only alter the
start and end of the WRITE SAME bio accordingly.
Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm ioctl: use kmalloc if possible
If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().
vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.
In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.
This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm block manager: reinstate message when validator fails
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm raid: round region_size to power of two
If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically. If this is not a power of 2,
the md code will report an error later.
This patch catches the problem early and rounds the region_size to the
next power of two.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm snapshot: optimize track_chunk
track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm thin: emit ignore_discard in status when discards disabled
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device. The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).
Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm persistent data: fix nested btree deletion
When deleting nested btrees, the code forgets to delete the innermost
btree. The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.
This patch corrects both problems.
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: wake worker when discard is prepared
When discards are prepared it is best to directly wake the worker that
will process them. The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.
Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding. DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.
The race manifests as follows:
1. A non-discard bio is mapped in thin_bio_map.
This doesn't lock out parallel activity to the same block.
2. A discard bio is issued to the same block as the non-discard bio.
3. The discard bio is locked in a dm_bio_prison_cell in process_discard
to lock out parallel activity against the same block.
4. The non-discard bio's mapping continues and its all_io_entry is
incremented so the bio is accounted for in the thin pool's all_io_ds
which is a dm_deferred_set used to track time locality of non-discard IO.
5. The non-discard bio is finally locked in a dm_bio_prison_cell in
process_bio.
The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:
The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.
The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks. That cell locking wasn't
occurring early enough in thin_bio_map. This patch fixes this.
Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.
Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so. Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.
This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@vger.kernel.org
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: replace dm_cell_release_singleton with cell_defer_except
Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.
Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.
If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton. Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.
Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.
This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:30 +0000 (20:23 +0000)]
dm: disable WRITE SAME
WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.
As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm ioctl: prevent unsafe change to dm_ioctl data_size
Abort dm ioctl processing if userspace changes the data_size parameter
after we validated it but before we finished copying the data buffer
from userspace.
The dm ioctl parameters are processed in the following sequence:
1. ctl_ioctl() calls copy_params();
2. copy_params() makes a first copy of the fixed-sized portion of the
userspace parameters into the local variable "tmp";
3. copy_params() then validates tmp.data_size and allocates a new
structure big enough to hold the complete data and copies the whole
userspace buffer there;
4. ctl_ioctl() reads userspace data the second time and copies the whole
buffer into the pointer "param";
5. ctl_ioctl() reads param->data_size without any validation and stores it
in the variable "input_param_size";
6. "input_param_size" is further used as the authoritative size of the
kernel buffer.
The problem is that userspace code could change the contents of user
memory between steps 2 and 4. In particular, the data_size parameter
can be changed to an invalid value after the kernel has validated it.
This lets userspace force the kernel to access invalid kernel memory.
The fix is to ensure that the size has not changed at step 4.
This patch shouldn't have a security impact because CAP_SYS_ADMIN is
required to run this code, but it should be fixed anyway.