Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: ensure auth ops are defined before use
In the create_authorizer method for both the mds and osd clients,
the auth_client->ops pointer is blindly dereferenced. There is no
obvious guarantee that this pointer has been assigned. And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.
Add checks in both routines to make sure they are defined (non-null)
before use. Add similar checks in a few other spots in these files
while we're at it.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: messenger: reduce args to create_authorizer
Make use of the new ceph_auth_handshake structure in order to reduce
the number of arguments passed to the create_authorizor method in
ceph_auth_client_ops. Use a local variable of that type as a
shorthand in the get_authorizer method definitions.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: define ceph_auth_handshake type
The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers." Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.
Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: check return from get_authorizer
In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value. This function can
return an error, so check for it and return it if that ever occurs.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 17 May 2012 02:51:59 +0000 (21:51 -0500)]
ceph: messenger: check prepare_write_connect() result
prepare_write_connect() can return an error, but only one of its
callers checks for it. All the rest are in functions that already
return errors, so it should be fine to return the error if one
gets returned.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: don't set WRITE_PENDING too early
prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection. Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending. It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.
Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: drop msgr argument from prepare_write_connect()
In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con->msgr. Just get the msgr
value from the ceph connection and drop the unneeded argument.
The only msgr passed to prepare_write_banner() is also therefore
just the one from con->msgr, so change that function to drop the
msgr argument as well.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: send banner in process_connect()
prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message. It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the "include_banner"
argument from prepare_write_connect().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: reset connection kvec caller
Reset a connection's kvec fields in the caller rather than in
prepare_write_connect(). This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: change read_partial() to take "end" arg
Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents. This
amounts to moving the addition "to + size" into the caller.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: update "to" in read_partial() caller
read_partial() always increases whatever "to" value is supplied by
adding the requested size to it, and that's the only thing it does
with that pointed-to value.
Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the "to"
parameter to be an in-only and non-pointer value.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: use read_partial() in read_partial_message()
There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial(). Do that.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Josh Durgin [Tue, 6 Dec 2011 02:25:13 +0000 (18:25 -0800)]
rbd: fix snapshot size type
Snapshot sizes should be the same type as regular image sizes. This
only affects their displayed size in sysfs, not the reported size of
an actual block device sizes.
Josh Durgin [Mon, 21 Nov 2011 21:04:42 +0000 (13:04 -0800)]
rbd: store snapshot id instead of index
When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.
Xi Wang [Fri, 20 Apr 2012 20:49:44 +0000 (15:49 -0500)]
rbd: fix integer overflow in rbd_header_from_disk()
ondisk->snap_count is read from disk via rbd_req_sync_read() and thus
needs validation. Otherwise, a bogus `snap_count' could overflow the
kmalloc() size, leading to memory corruption.
Also use `u32' consistently for `snap_count'.
[elder@dreamhost.com: changed to use UINT_MAX rather than ULONG_MAX]
Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Alex Elder <elder@dreamhost.com>
Alex Elder [Fri, 20 Apr 2012 20:49:43 +0000 (15:49 -0500)]
ceph: osd_client: fix endianness bug in osd_req_encode_op()
From Al Viro <viro@zeniv.linux.org.uk>
Al Viro noticed that we were using a non-cpu-encoded value in
a switch statement in osd_req_encode_op(). The result would
clearly not work correctly on a big-endian machine.
Sage Weil [Mon, 7 May 2012 22:35:09 +0000 (15:35 -0700)]
crush: adjust local retry threshold
This small adjustment reflects a change that was made in ceph.git commit af6a9f30696c900a2a8bd7ae24e8ed15fb4964bb, about 6 months ago. An N-1
search is not exhaustive. Fixed ceph.git bug #1594.
Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 7 May 2012 22:33:36 +0000 (15:33 -0700)]
ceph: drop support for preferred_osd pgs
This was an ill-conceived feature that has been removed from Ceph. Do
this gracefully:
- reject attempts to specify a preferred_osd via the ioctl
- stop exposing this information via virtual xattrs
- always fill in -1 for requests, in case we talk to an older server
- don't calculate preferred_osd placements/pgids
Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 4 Apr 2012 18:35:44 +0000 (13:35 -0500)]
rbd: don't hold spinlock during messenger flush
A recent change made changes to the rbd_client_list be protected by
a spinlock. Unfortunately in rbd_put_client(), the lock is taken
before possibly dropping the last reference to an rbd_client, and on
the last reference that eventually calls flush_workqueue() which can
sleep.
The problem was flagged by a debug spinlock warning:
BUG: spinlock wrong CPU on CPU#3, rbd/27814
The solution is to move the spinlock acquisition and release inside
rbd_client_release(), which is the spot where it's really needed for
protecting the removal of the rbd_client from the client list.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Josh Durgin [Tue, 22 Nov 2011 01:11:12 +0000 (17:11 -0800)]
rbd: move snap_rwsem to the device, rename to header_rwsem
A new temporary header is allocated each time the header changes, but
only the changed properties are copied over. We don't need a new
semaphore for each header update.
This addresses http://tracker.newdream.net/issues/2174
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Alex Elder <elder@dreamhost.com>
Alex Elder [Thu, 8 Mar 2012 22:50:09 +0000 (16:50 -0600)]
ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
In ceph_vxattrcb_file_layout(), there is a check to determine
whether a preferred PG should be formatted into the output buffer.
That check assumes that a preferred PG number of 0 indicates "no
preference," but that is wrong. No preference is indicated by a
negative (specifically, -1) PG number.
In addition, if that condition yields true, the preferred value
is formatted into a sized buffer, but the size consumed by the
earlier snprintf() call is not accounted for, opening up the
possibilty of a buffer overrun.
Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
the time displayed did not include leading 0's, which led to
erroneous (sub-second portion of) time values being shown.
This fixes these three issues:
http://tracker.newdream.net/issues/2155
http://tracker.newdream.net/issues/2156
http://tracker.newdream.net/issues/2157
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: isolate kmap() call in write_partial_msg_pages()
In write_partial_msg_pages(), every case now does an identical call
to kmap(page). Instead, just call it once inside the CRC-computing
block where it's needed. Move the definition of kaddr inside that
block, and make it a (char *) to ensure portable pointer arithmetic.
We still don't kunmap() it until after the sendpage() call, in case
that also ends up needing to use the mapping.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: rename "page_shift" variable to something sensible
In write_partial_msg_pages() there is a local variable used to
track the starting offset within a bio segment to use. Its name,
"page_shift" defies the Linux convention of using that name for
log-base-2(page size).
Since it's only used in the bio case rename it "bio_offset". Use it
along with the page_pos field to compute the memory offset when
computing CRC's in that function. This makes the bio case match the
others more closely.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: get rid of zero_page_address
There's not a lot of benefit to zero_page_address, which basically
holds a mapping of the zero page through the life of the messenger
module. Even with our own mapping, the sendpage interface where
it's used may need to kmap() it again. It's almost certain to
be in low memory anyway.
So stop treating the zero page specially in write_partial_msg_pages()
and just get rid of zero_page_address entirely.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: use kernel_sendpage() for sending zeroes
If a message queued for send gets revoked, zeroes are sent over the
wire instead of any unsent data. This is done by constructing a
message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().
Since we are already working with a page in this case we can use
the sendpage interface instead. Create a new ceph_tcp_sendpage()
helper that sets up flags to match the way ceph_tcp_sendmsg()
does now.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: fix inverted crc option logic
CRC's are computed for all messages between ceph entities. The CRC
computation for the data portion of message can optionally be
disabled using the "nocrc" (common) ceph option. The default is
for CRC computation for the data portion to be enabled.
Unfortunately, the code that implements this feature interprets the
feature flag wrong, meaning that by default the CRC's have *not*
been computed (or checked) for the data portion of messages unless
the "nocrc" option was supplied.
Fix this, in write_partial_msg_pages() and read_partial_message().
Also change the flag variable in write_partial_msg_pages() to be
"no_datacrc" to match the usage elsewhere in the file.
This fixes http://tracker.newdream.net/issues/2064
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:55 +0000 (07:43 -0600)]
libceph: some simple changes
Nothing too big here.
- define the size of the buffer used for consuming ignored
incoming data using a symbolic constant
- simplify the condition determining whether to unmap the page
in write_partial_msg_pages(): do it for crc but not if the
page is the zero page
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
libceph: small refactor in write_partial_kvec()
Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call. Same functionality, just blocked out
a little differently.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
libceph: do crc calculations outside loop
Move blocks of code out of loops in read_partial_message_section()
and read_partial_message(). They were only was getting called at
the end of the last iteration of the loop anyway.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
libceph: use "do" in CRC-related Boolean variables
Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.
Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages(): the value of "do_crc"
assigned appears to be the opposite of what it should be. No
attempt to fix this is made here; this change preserves the
erroneous behavior. The problem I found is documented here:
http://tracker.newdream.net/issues/2064
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature. For all those that don't offer this, add
a new option so that they do.
Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: a few small changes
This gathers a number of very minor changes:
- use %hu when formatting the a socket address's address family
- null out the ceph_msgr_wq pointer after the queue has been
destroyed
- drop a needless cast in ceph_write_space()
- add a WARN() call in ceph_state_change() in the event an
unrecognized socket state is encountered
- rearrange the logic in ceph_con_get() and ceph_con_put() so
that:
- the reference counts are only atomically read once
- the values displayed via dout() calls are known to
be meaningful at the time they are formatted
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: make ceph_tcp_connect() return int
There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller. Instead, have it return an error code,
which tidies things up a bit.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: encapsulate connection kvec operations
Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array. Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.
Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: move prepare_write_banner()
One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.
This was split out from the next patch to highlight this change in
logic.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 8 Feb 2012 22:11:14 +0000 (16:11 -0600)]
rbd: don't drop the rbd_id too early
Currently an rbd device's id is released when it is removed, but it
is done before the code is run to clean up sysfs-related files (such
as /sys/bus/rbd/devices/1).
It's possible that an rbd is still in use after the rbd_remove()
call has been made. It's essentially the same as an active inode
that stays around after it has been removed--until its final close
operation. This means that the id shows up as free for reuse at a
time it should not be.
The effect of this was seen by Jens Rehpoehler, who:
- had a filesystem mounted on an rbd device
- unmapped that filesystem (without unmounting)
- found that the mount still worked properly
- but hit a panic when he attempted to re-map a new rbd device
This re-map attempt found the previously-unmapped id available.
The subsequent attempt to reuse it was met with a panic while
attempting to (re-)install the sysfs entry for the new mapped
device.
Fix this by holding off "putting" the rbd id, until the rbd_device
release function is called--when the last reference is finally
dropped.
Note: This fixes: http://tracker.newdream.net/issues/1907
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 7 Feb 2012 18:03:37 +0000 (12:03 -0600)]
rbd: small changes
Here is another set of small code tidy-ups:
- Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
names throughout. Tell the blk_queue system our physical
block size, in the (unlikely) event we want to use something
other than the default.
- Delete the definition of struct rbd_info, which is never used.
- Move the definition of dev_to_rbd() down in its source file,
just above where it gets first used, and change its name to
dev_to_rbd_dev().
- Replace an open-coded operation in rbd_dev_release() to use
dev_to_rbd_dev() instead.
- Calculate the segment size for a given rbd_device just once in
rbd_init_disk().
- Use the '%zd' conversion specifier in rbd_snap_size_show(),
since the value formatted is a size_t.
- Switch to the '%llu' conversion specifier in rbd_snap_id_show().
since the value formatted is unsigned.
Alex Elder [Tue, 7 Feb 2012 18:03:36 +0000 (12:03 -0600)]
rbd: do some refactoring
A few blocks of code are rearranged a bit here:
- In rbd_header_from_disk():
- Don't bother computing snap_count until we're sure the
on-disk header starts with a good signature.
- Move a few independent lines of code so they are *after* a
check for a failed memory allocation.
- Get rid of unnecessary local variable "ret".
- Make a few other changes in rbd_read_header(), similar to the
above--just moving things around a bit while preserving the
functionality.
- In rbd_rq_fn(), just assign rq in the while loop's controlling
expression rather than duplicating it before and at the end of
the loop body. This allows the use of "continue" rather than
"goto next" in a number of spots.
- Rearrange the logic in snap_by_name(). End result is the same.
Alex Elder [Tue, 7 Feb 2012 18:03:36 +0000 (12:03 -0600)]
rbd: fix module sysfs setup/teardown code
Once rbd_bus_type is registered, it allows an "add" operation via
the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
way establishes a connection between the device and rbd_root_dev.
But rbd_root_dev is not registered until after the rbd_bus_type
registration is complete. This could (in principle anyway) result
in an invalid state.
Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
two initializations and never be faced with this scenario.
In addition, unregister the device in the event the bus registration
fails at module init time.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: don't allocate mon_addrs buffer in rbd_add()
The mon_addrs buffer in rbd_add is used to hold a copy of the
monitor IP addresses supplied via /sys/bus/rbd/add. That is
passed to rbd_get_client(), which never modifies it (nor do
any of the functions it gets passed to thereafter)--the mon_addr
parameter to rbd_get_client() is a pointer to constant data, so it
can't be modifed. Furthermore, rbd_get_client() has the length of
the mon_addrs buffer and that is used to ensure nothing goes beyond
its end.
Based on all this, there is no reason that a buffer needs to
be used to hold a copy of the mon_addrs provided via
/sys/bus/rbd/add. Instead, the location within that passed-in
buffer can be provided, along with the length of the "token"
therein which represents the monitor IP's.
A small change to rbd_add_parse_args() allows the address within the
buffer to be passed back, and the length is already returned. This
now means that, at least from the perspective of this interface,
there is no such thing as a list of monitor addresses that is too
long.
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: have rbd_parse_args() report found mon_addrs size
The argument parsing routine already computes the size of the
mon_addrs buffer it extracts from the "command." Pass it to the
caller so it can use it to provide the length to rbd_get_client().
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: don't use sscanf() in rbd_add_parse_args()
Make use of a few simple helper routines to parse the arguments
rather than sscanf(). This will treat both missing and too-long
arguments as invalid input (rather than silently truncating the
input in the too-long case). In time this can also be used by
rbd_add() to use the passed-in buffer in place, rather than copying
its contents into new buffers.
It appears to me that the sscanf() previously used would not
correctly handle a supplied snapshot--the two final "%s" conversion
specifications were not separated by a space, and I'm not sure
how sscanf() handles that situation. It may not be well-defined.
So that may be a bug this change fixes (but I didn't verify that).
The sizes of the mon_addrs and options buffers are now passed to
rbd_add_parse_args(), so they can be supplied to copy_token().
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: encapsulate argument parsing for rbd_add()
Move the code that parses the arguments provided to rbd_add() (which
are supplied via /sys/bus/rbd/add) into a separate function.
Also rename the "mon_dev_name" variable in rbd_add() to be
"mon_addrs". The variable represents a list of one or more
comma-separated monitor IP addresses, each with an optional port
number. I think "mon_addrs" captures that notion a little better.
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: reduce memory used for rbd_dev fields
The length of the string containing the monitor address
specification(s) will never exceed the length of the string passed
in to rbd_add(). The same holds true for the ceph + rbd options
string. So reduce the amount of memory allocated for these to
that length rather than the maximum (1024 bytes).
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: have rbd_get_client() return a rbd_client
Since rbd_get_client() currently returns an error code. It assigns
the rbd_client field of the rbd_device structure it is passed if
successful. Instead, have it return the created rbd_client
structure and return a pointer-coded error if there is an error.
This makes the assignment of the client pointer more obvious at the
call site.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Sun, 29 Jan 2012 19:57:44 +0000 (13:57 -0600)]
rbd: a few simple changes
Here are a few very simple cleanups:
- Add a "RBD_" prefix to the two driver name string definitions.
- Move the definition of struct rbd_request below struct rbd_req_coll
to avoid the need for an empty declaration of the latter.
- Move and group the definitions of rbd_root_dev_release() and
rbd_root_dev, as well as rbd_bus_type and rbd_bus_attrs[],
close to the top of the file. Arrange the latter so
rbd_bus_type.bus_attrs can be initialized statically.
- Get rid of an unnecessary local variable in rbd_open().
- Rework some hokey logic in rbd_bus_add_dev(), so the value of
"ret" at the end is either 0 or -ENOENT to avoid the need for
the code duplication that was there.
- Rename a goto target in rbd_add().
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Sun, 29 Jan 2012 19:57:44 +0000 (13:57 -0600)]
rbd: move ctl_mutex lock inside rbd_get_client()
Since rbd_get_client() is only called in one place, move the
acquisition of the mutex around that call inside that function.
Furthermore, within rbd_get_client(), it appears the mutex only
needs to be held while calling rbd_client_create(). (Moving
the lock inside that function will wait for the next patch.)
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Sun, 29 Jan 2012 19:57:44 +0000 (13:57 -0600)]
rbd: restore previous rbd id sequence behavior
It used to be that selecting a new unique identifier for an added
rbd device required searching all existing ones to find the highest
id is used. A recent change made that unnecessary, but made it
so that id's used were monotonically non-decreasing. It's a bit
more pleasant to have smaller rbd id's though, and this change
makes ids get allocated as they were before--each new id is one more
than the maximum currently in use.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Thu, 2 Feb 2012 14:13:29 +0000 (08:13 -0600)]
rbd: tie rbd_dev_list changes to rbd_id operations
The only time entries are added to or removed from the global
rbd_dev_list is exactly when a "put" or "get" operation is being
performed on a rbd_dev's id. So just move the list management code
into get/put routines.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Sun, 29 Jan 2012 19:57:44 +0000 (13:57 -0600)]
rbd: protect the rbd_dev_list with a spinlock
The rbd_dev_list is just a simple list of all the current
rbd_devices. Using the ctl_mutex as a concurrency guard is
overkill. Instead, use a spinlock for that specific purpose.
This also reduces the window that the ctl_mutex needs to be held in
rbd_add().
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Sun, 29 Jan 2012 19:57:44 +0000 (13:57 -0600)]
rbd: rework calculation of new rbd id's
In order to select a new unique identifier for an added rbd device,
the list of all existing ones is searched and a value one greater
than the highest id is used.
The list search can be avoided by using an atomic variable that
keeps track of the current highest id. Using a get/put model for
id's we can limit the boundless growth of id numbers a bit by
arranging to reuse the current highest id once it gets released.
Add these calls to "put" the id when an rbd is getting removed.
Note that this changes the pattern of device id's used--new values
will never be below the highest one seen so far (even if there
exists an unused lower one). I assert this is OK because the key
property of an rbd id is its uniqueness, not its magnitude.
Regardless, a follow-on patch will restore the old way of doing
things, I just think this commit just makes the incremental change
to atomics a little easier to understand.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Josh Durgin [Tue, 22 Nov 2011 02:19:13 +0000 (18:19 -0800)]
rbd: use a single value of snap_name to mean no snap
There's already a constant for this anyway.
Since rbd_header_set_snap() is only used to set the rbd device
snap_name field, just do that within that function rather than
having it take the snap_name as an argument.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
v2: Changed interface rbd_header_set_snap() so it explicitly updates
the snap_name in the rbd_device. Also added a BUILD_BUG_ON()
to verify the size of the snap_name field is sufficient for
SNAP_HEAD_NAME.
Alex Elder [Tue, 24 Jan 2012 16:08:37 +0000 (10:08 -0600)]
rbd: do not duplicate ceph_client pointer in rbd_device
The rbd_device structure maintains a duplicate copy of the
ceph_client pointer maintained in its rbd_client structure. There
appears to be no good reason for this, and its presence presents a
risk of them getting out of synch or otherwise misused. So kill it
off, and use the rbd_client copy only.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 24 Jan 2012 16:08:36 +0000 (10:08 -0600)]
rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful. With this interface is not evident at call sites that
the pointer is always initialized. Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 24 Jan 2012 16:08:36 +0000 (10:08 -0600)]
rbd: a few small cleanups
Some minor cleanups in "drivers/block/rbd.c:
- Use the more meaningful "RBD_MAX_OBJ_NAME_LEN" in place if "96"
in the definition of RBD_MAX_MD_NAME_LEN.
- Use DEFINE_SPINLOCK() to define and initialize node_lock.
- Drop a needless (char *) cast in parse_rbd_opts_token().
- Make a few minor formatting changes.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:28 +0000 (15:49 -0600)]
ceph: avoid repeatedly computing the size of constant vxattr names
All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time. So there's no need to compute their length every
time any file's attribute is listed.
Record the length of each string and use it when needed to determine
the space need to represent them. In addition, compute the
aggregate size of strings in each table just once at initialization
time.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:28 +0000 (15:49 -0600)]
ceph: encode type in vxattr callback routines
The names of the callback functions used for virtual extended
attributes are based only on the last component of the attribute
name. Because of the way these are defined, this precludes allowing
a single (lowest) attribute name for different callbacks, dependent
on the type of file being operated on. (For example, it might be
nice to support both "ceph.dir.layout" and "ceph.file.layout".)
Just change the callback names to avoid this problem.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:28 +0000 (15:49 -0600)]
ceph: drop "_cb" from name of struct ceph_vxattr_cb
A struct ceph_vxattr_cb does not represent a callback at all, but
rather a virtual extended attribute itself. Drop the "_cb" suffix
from its name to reflect that.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:28 +0000 (15:49 -0600)]
ceph: use macros to normalize vxattr table definitions
Entries in the ceph virtual extended attribute tables all follow a
distinct pattern in their definition. Enforce this pattern through
the use of a macro.
Also, a null name field signals the end of the table, so make that
be the first field in the ceph_vxattr_cb structure.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:27 +0000 (15:49 -0600)]
ceph: pass inode rather than table to ceph_match_vxattr()
All callers of ceph_match_vxattr() determine what to pass as the
first argument by calling ceph_inode_vxattrs(inode). Just do that
inside ceph_match_vxattr() itself, changing it to take an inode
rather than the vxattr pointer as its first argument.
Also ensure the function works correctly for an empty table (i.e.,
containing only a terminating null entry).
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:27 +0000 (15:49 -0600)]
ceph: don't null-terminate xattr values
For some reason, ceph_setxattr() allocates an extra byte in which a
'\0' is stored past the end of an extended attribute value. This is
not needed, and is potentially misleading, so get rid of it.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:27 +0000 (15:49 -0600)]
ceph: eliminate some abusive casts
This fixes some spots where a type cast to (void *) was used as
as a universal type hiding mechanism. Instead, properly cast the
type to the intended target type.
Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:27 +0000 (15:49 -0600)]
ceph: kill addr_str_lock spinlock; use atomic instead
A spinlock is used to protect a value used for selecting an array
index for a string used for formatting a socket address for human
consumption. The index is reset to 0 if it ever reaches the maximum
index value.
Instead, use an ever-increasing atomic variable as a sequence
number, and compute the array index by masking off all but the
sequence number's lowest bits. Make the number of entries in the
array a power of two to allow the use of such a mask (to avoid jumps
in the index value when the sequence number wraps).
The length of these strings is somewhat arbitrarily set at 60 bytes.
The worst-case length of a string produced is 54 bytes, for an IPv6
address that can't be shortened, e.g.:
[1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
Change it so we arbitrarily use 64 bytes instead; if nothing else
it will make the array of these line up better in hex dumps.
Rename a few things to reinforce the distinction between the number
of strings in the array and the length of individual strings.
Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:27 +0000 (15:49 -0600)]
ceph: make use of "else" where appropriate
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
re-testing a value with consecutive "if" statements. Don't record a
connection's socket pointer unless the connect operation is
successful.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Mon, 23 Jan 2012 21:49:27 +0000 (15:49 -0600)]
ceph: use a shared zero page rather than one per messenger
Each messenger allocates a page to be used when writing zeroes
out in the event of error or other abnormal condition. Instead,
use the kernel ZERO_PAGE() for that purpose.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Xi Wang [Thu, 16 Feb 2012 16:55:48 +0000 (11:55 -0500)]
libceph: fix overflow check in crush_decode()
The existing overflow check (n > ULONG_MAX / b) didn't work, because
n = ULONG_MAX / b would both bypass the check and still overflow the
allocation size a + n * b.
The correct check should be (n > (ULONG_MAX - a) / b).
Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>