Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: make ceph_con_revoke_message() a msg op
ceph_con_revoke_message() is passed both a message and a ceph
connection. A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.
Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer. The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field. That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.
Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: make ceph_con_revoke() a msg operation
ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.
This has the added benefit of precluding the possibility of the
providing the wrong connection pointer. If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op. This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."
Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 4 Jun 2012 19:43:33 +0000 (14:43 -0500)]
libceph: have messages take a connection reference
There are essentially two types of ceph messages: incoming and
outgoing. Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection. Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.
When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent. The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.
This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).
For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.
Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send(). Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection. At that
time the message is inserted into a connection's out_queue list.
When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list. At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred. When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.
So basically, any time a message is on one of a connection's lists,
it is associated with that connection. Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.
Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc(). At that
point the message is associated with the connection. Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.
Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.
We should never fail to get a connection reference for a
message--the since the caller should already hold one.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Fri, 1 Jun 2012 19:56:43 +0000 (14:56 -0500)]
libceph: have messages point to their connection
When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue). When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent). When acknowledgement for the
message is received, it is removed from the sent messages list.
During that entire time the message is "in the possession" of a
single ceph connection. Keep track of that connection in the
message. This will be used in the next patch (and is a helpful
bit of information for debugging anyway).
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Mon, 4 Jun 2012 19:43:32 +0000 (14:43 -0500)]
libceph: tweak ceph_alloc_msg()
The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer. Rename the
function so this implied usage is more clear.
In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for). This allows us
to return what is now provided via the passed-in address of a "skip"
variable. The return type is now Boolean to be explicit that there
are only two possible outcomes.
Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: fully initialize connection in con_init()
Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init(). Rearrange the arguments so the connection pointer
is first. Hide the byte-swapping of the peer entity number inside
ceph_con_init()
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 1 Jun 2012 03:27:50 +0000 (20:27 -0700)]
libceph: drop connection refcounting for mon_client
All references to the embedded ceph_connection come from the msgr
workqueue, which is drained prior to mon_client destruction. That
means we can ignore con refcounting entirely.
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: embed ceph connection structure in mon_client
A monitor client has a pointer to a ceph connection structure in it.
This is the only one of the three ceph client types that do it this
way; the OSD and MDS clients embed the connection into their main
structures. There is always exactly one ceph connection for a
monitor client, so there is no need to allocate it separate from the
monitor client structure.
So switch the ceph_mon_client structure to embed its
ceph_connection structure.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 1 Jun 2012 03:22:18 +0000 (20:22 -0700)]
libceph: use con get/put ops from osd_client
There were a few direct calls to ceph_con_{get,put}() instead of the con
ops from osd_client.c. This is a bug since those ops aren't defined to
be ceph_con_get/put.
This breaks refcounting on the ceph_osd structs that contain the
ceph_connections, and could lead to all manner of strangeness.
The purpose of the ->get and ->put methods in a ceph connection are
to allow the connection to indicate it has a reference to something
external to the messaging system, *not* to indicate something
external has a reference to the connection.
[elder@inktank.com: added that last sentence]
Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Mon, 4 Jun 2012 19:43:32 +0000 (14:43 -0500)]
libceph: osd_client: don't drop reply reference too early
In ceph_osdc_release_request(), a reference to the r_reply message
is dropped. But just after that, that same message is revoked if it
was in use to receive an incoming reply. Reorder these so we are
sure we hold a reference until we're actually done with the message.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Carpenter [Wed, 6 Jun 2012 14:15:33 +0000 (09:15 -0500)]
rbd: endian bug in rbd_req_cb()
Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16
These are set in osd_req_encode_op() and they are le16.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Tue, 29 May 2012 16:04:58 +0000 (11:04 -0500)]
libceph: set CLOSED state bit in con_init
Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.
It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.
Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 23 May 2012 03:15:49 +0000 (22:15 -0500)]
libceph: start tracking connection socket state
Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself. Create
placeholder functions to encapsulate the state transitions.
Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible "intermediate/both" states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held). We can back out the atomicity later.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com>
Alex Elder [Tue, 22 May 2012 16:41:43 +0000 (11:41 -0500)]
libceph: start separating connection flags from state
A ceph_connection holds a mixture of connection state (as in "state
machine" state) and connection flags in a single "state" field. To
make the distinction more clear, define a new "flags" field and use
it rather than the "state" field to hold Boolean flag values.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com>
Alex Elder [Sun, 27 May 2012 04:26:43 +0000 (23:26 -0500)]
libceph: embed ceph messenger structure in ceph_client
A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.
Switch the ceph_client structure to embed its ceph_messenger
structure.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 23 May 2012 19:35:23 +0000 (14:35 -0500)]
libceph: rename kvec_reset and kvec_add functions
The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the "ceph_" prefix in their
name to make them slightly more wieldy.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Tue, 22 May 2012 16:41:43 +0000 (11:41 -0500)]
libceph: rename socket callbacks
Change the names of the three socket callback functions to make it
more obvious they're specifically associated with a connection's
socket (not the ceph connection that uses it).
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Mon, 28 May 2012 06:44:30 +0000 (14:44 +0800)]
ceph: check PG_Private flag before accessing page->private
I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 21 May 2012 16:45:23 +0000 (09:45 -0700)]
libceph: fix pg_temp updates
Usually, we are adding pg_temp entries or removing them. Occasionally they
update. In that case, osdmap_apply_incremental() was failing because the
rbtree entry already exists.
Fix by removing the existing entry before inserting a new one.
Fixes http://tracker.newdream.net/issues/2446
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
Sage Weil [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
libceph: avoid unregistering osd request when not registered
There is a race between two __unregister_request() callers: the
reply path and the ceph_osdc_wait_request(). If we get a reply
*and* the timeout expires at roughly the same time, both callers
will try to unregister the request, and the second one will do bad
things.
Simply check if the request is still already unregistered; if so,
return immediately and do nothing.
Fixes http://tracker.newdream.net/issues/2420
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: add auth buf in prepare_write_connect()
Move the addition of the authorizer buffer to a connection's
out_kvec out of get_connect_authorizer() and into its caller. This
way, the caller--prepare_write_connect()--can avoid adding the
connect header to out_kvec before it has been fully initialized.
Prior to this patch, it was possible for a connect header to be
sent over the wire before the authorizer protocol or buffer length
fields were initialized. An authorizer buffer associated with that
header could also be queued to send only after the connection header
that describes it was on the wire.
Fixes http://tracker.newdream.net/issues/2424
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: rename prepare_connect_authorizer()
Change the name of prepare_connect_authorizer(). The next
patch is going to make this function no longer add anything to the
connection's out_kvec, so it will no longer fit the pattern of
the rest of the prepare_connect_*() functions.
In addition, pass the address of a variable that will hold the
authorization protocol to use. Move the assignment of that to the
connection's out_connect structure into prepare_write_connect().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: use info returned by get_authorizer
Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: have get_authorizer methods return pointers
Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value. This is to pave the way for making use of the
returned value in an upcoming patch.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: ensure auth ops are defined before use
In the create_authorizer method for both the mds and osd clients,
the auth_client->ops pointer is blindly dereferenced. There is no
obvious guarantee that this pointer has been assigned. And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.
Add checks in both routines to make sure they are defined (non-null)
before use. Add similar checks in a few other spots in these files
while we're at it.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:39 +0000 (15:16 -0500)]
ceph: messenger: reduce args to create_authorizer
Make use of the new ceph_auth_handshake structure in order to reduce
the number of arguments passed to the create_authorizor method in
ceph_auth_client_ops. Use a local variable of that type as a
shorthand in the get_authorizer method definitions.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: define ceph_auth_handshake type
The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers." Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.
Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: check return from get_authorizer
In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value. This function can
return an error, so check for it and return it if that ever occurs.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 17 May 2012 02:51:59 +0000 (21:51 -0500)]
ceph: messenger: check prepare_write_connect() result
prepare_write_connect() can return an error, but only one of its
callers checks for it. All the rest are in functions that already
return errors, so it should be fine to return the error if one
gets returned.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: don't set WRITE_PENDING too early
prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection. Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending. It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.
Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: drop msgr argument from prepare_write_connect()
In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con->msgr. Just get the msgr
value from the ceph connection and drop the unneeded argument.
The only msgr passed to prepare_write_banner() is also therefore
just the one from con->msgr, so change that function to drop the
msgr argument as well.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: send banner in process_connect()
prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message. It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the "include_banner"
argument from prepare_write_connect().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 16 May 2012 20:16:38 +0000 (15:16 -0500)]
ceph: messenger: reset connection kvec caller
Reset a connection's kvec fields in the caller rather than in
prepare_write_connect(). This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: change read_partial() to take "end" arg
Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents. This
amounts to moving the addition "to + size" into the caller.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: update "to" in read_partial() caller
read_partial() always increases whatever "to" value is supplied by
adding the requested size to it, and that's the only thing it does
with that pointed-to value.
Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the "to"
parameter to be an in-only and non-pointer value.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Alex Elder [Thu, 10 May 2012 15:29:50 +0000 (10:29 -0500)]
ceph: messenger: use read_partial() in read_partial_message()
There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial(). Do that.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Josh Durgin [Tue, 6 Dec 2011 02:25:13 +0000 (18:25 -0800)]
rbd: fix snapshot size type
Snapshot sizes should be the same type as regular image sizes. This
only affects their displayed size in sysfs, not the reported size of
an actual block device sizes.
Josh Durgin [Mon, 21 Nov 2011 21:04:42 +0000 (13:04 -0800)]
rbd: store snapshot id instead of index
When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.
Xi Wang [Fri, 20 Apr 2012 20:49:44 +0000 (15:49 -0500)]
rbd: fix integer overflow in rbd_header_from_disk()
ondisk->snap_count is read from disk via rbd_req_sync_read() and thus
needs validation. Otherwise, a bogus `snap_count' could overflow the
kmalloc() size, leading to memory corruption.
Also use `u32' consistently for `snap_count'.
[elder@dreamhost.com: changed to use UINT_MAX rather than ULONG_MAX]
Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Alex Elder <elder@dreamhost.com>
Alex Elder [Fri, 20 Apr 2012 20:49:43 +0000 (15:49 -0500)]
ceph: osd_client: fix endianness bug in osd_req_encode_op()
From Al Viro <viro@zeniv.linux.org.uk>
Al Viro noticed that we were using a non-cpu-encoded value in
a switch statement in osd_req_encode_op(). The result would
clearly not work correctly on a big-endian machine.
Sage Weil [Mon, 7 May 2012 22:35:09 +0000 (15:35 -0700)]
crush: adjust local retry threshold
This small adjustment reflects a change that was made in ceph.git commit af6a9f30696c900a2a8bd7ae24e8ed15fb4964bb, about 6 months ago. An N-1
search is not exhaustive. Fixed ceph.git bug #1594.
Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 7 May 2012 22:33:36 +0000 (15:33 -0700)]
ceph: drop support for preferred_osd pgs
This was an ill-conceived feature that has been removed from Ceph. Do
this gracefully:
- reject attempts to specify a preferred_osd via the ioctl
- stop exposing this information via virtual xattrs
- always fill in -1 for requests, in case we talk to an older server
- don't calculate preferred_osd placements/pgids
Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Alex Elder [Wed, 4 Apr 2012 18:35:44 +0000 (13:35 -0500)]
rbd: don't hold spinlock during messenger flush
A recent change made changes to the rbd_client_list be protected by
a spinlock. Unfortunately in rbd_put_client(), the lock is taken
before possibly dropping the last reference to an rbd_client, and on
the last reference that eventually calls flush_workqueue() which can
sleep.
The problem was flagged by a debug spinlock warning:
BUG: spinlock wrong CPU on CPU#3, rbd/27814
The solution is to move the spinlock acquisition and release inside
rbd_client_release(), which is the spot where it's really needed for
protecting the removal of the rbd_client from the client list.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Josh Durgin [Tue, 22 Nov 2011 01:11:12 +0000 (17:11 -0800)]
rbd: move snap_rwsem to the device, rename to header_rwsem
A new temporary header is allocated each time the header changes, but
only the changed properties are copied over. We don't need a new
semaphore for each header update.
This addresses http://tracker.newdream.net/issues/2174
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Reviewed-by: Alex Elder <elder@dreamhost.com>
Alex Elder [Thu, 8 Mar 2012 22:50:09 +0000 (16:50 -0600)]
ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
In ceph_vxattrcb_file_layout(), there is a check to determine
whether a preferred PG should be formatted into the output buffer.
That check assumes that a preferred PG number of 0 indicates "no
preference," but that is wrong. No preference is indicated by a
negative (specifically, -1) PG number.
In addition, if that condition yields true, the preferred value
is formatted into a sized buffer, but the size consumed by the
earlier snprintf() call is not accounted for, opening up the
possibilty of a buffer overrun.
Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
the time displayed did not include leading 0's, which led to
erroneous (sub-second portion of) time values being shown.
This fixes these three issues:
http://tracker.newdream.net/issues/2155
http://tracker.newdream.net/issues/2156
http://tracker.newdream.net/issues/2157
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: isolate kmap() call in write_partial_msg_pages()
In write_partial_msg_pages(), every case now does an identical call
to kmap(page). Instead, just call it once inside the CRC-computing
block where it's needed. Move the definition of kaddr inside that
block, and make it a (char *) to ensure portable pointer arithmetic.
We still don't kunmap() it until after the sendpage() call, in case
that also ends up needing to use the mapping.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: rename "page_shift" variable to something sensible
In write_partial_msg_pages() there is a local variable used to
track the starting offset within a bio segment to use. Its name,
"page_shift" defies the Linux convention of using that name for
log-base-2(page size).
Since it's only used in the bio case rename it "bio_offset". Use it
along with the page_pos field to compute the memory offset when
computing CRC's in that function. This makes the bio case match the
others more closely.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: get rid of zero_page_address
There's not a lot of benefit to zero_page_address, which basically
holds a mapping of the zero page through the life of the messenger
module. Even with our own mapping, the sendpage interface where
it's used may need to kmap() it again. It's almost certain to
be in low memory anyway.
So stop treating the zero page specially in write_partial_msg_pages()
and just get rid of zero_page_address entirely.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: use kernel_sendpage() for sending zeroes
If a message queued for send gets revoked, zeroes are sent over the
wire instead of any unsent data. This is done by constructing a
message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().
Since we are already working with a page in this case we can use
the sendpage interface instead. Create a new ceph_tcp_sendpage()
helper that sets up flags to match the way ceph_tcp_sendmsg()
does now.
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 7 Mar 2012 17:40:08 +0000 (11:40 -0600)]
libceph: fix inverted crc option logic
CRC's are computed for all messages between ceph entities. The CRC
computation for the data portion of message can optionally be
disabled using the "nocrc" (common) ceph option. The default is
for CRC computation for the data portion to be enabled.
Unfortunately, the code that implements this feature interprets the
feature flag wrong, meaning that by default the CRC's have *not*
been computed (or checked) for the data portion of messages unless
the "nocrc" option was supplied.
Fix this, in write_partial_msg_pages() and read_partial_message().
Also change the flag variable in write_partial_msg_pages() to be
"no_datacrc" to match the usage elsewhere in the file.
This fixes http://tracker.newdream.net/issues/2064
Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:55 +0000 (07:43 -0600)]
libceph: some simple changes
Nothing too big here.
- define the size of the buffer used for consuming ignored
incoming data using a symbolic constant
- simplify the condition determining whether to unmap the page
in write_partial_msg_pages(): do it for crc but not if the
page is the zero page
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
libceph: small refactor in write_partial_kvec()
Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call. Same functionality, just blocked out
a little differently.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
libceph: do crc calculations outside loop
Move blocks of code out of loops in read_partial_message_section()
and read_partial_message(). They were only was getting called at
the end of the last iteration of the loop anyway.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
libceph: use "do" in CRC-related Boolean variables
Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.
Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages(): the value of "do_crc"
assigned appears to be the opposite of what it should be. No
attempt to fix this is made here; this change preserves the
erroneous behavior. The problem I found is documented here:
http://tracker.newdream.net/issues/2064
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 15 Feb 2012 13:43:54 +0000 (07:43 -0600)]
ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature. For all those that don't offer this, add
a new option so that they do.
Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: a few small changes
This gathers a number of very minor changes:
- use %hu when formatting the a socket address's address family
- null out the ceph_msgr_wq pointer after the queue has been
destroyed
- drop a needless cast in ceph_write_space()
- add a WARN() call in ceph_state_change() in the event an
unrecognized socket state is encountered
- rearrange the logic in ceph_con_get() and ceph_con_put() so
that:
- the reference counts are only atomically read once
- the values displayed via dout() calls are known to
be meaningful at the time they are formatted
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: make ceph_tcp_connect() return int
There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller. Instead, have it return an error code,
which tidies things up a bit.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: encapsulate connection kvec operations
Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array. Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.
Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 14 Feb 2012 20:05:33 +0000 (14:05 -0600)]
libceph: move prepare_write_banner()
One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.
This was split out from the next patch to highlight this change in
logic.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Wed, 8 Feb 2012 22:11:14 +0000 (16:11 -0600)]
rbd: don't drop the rbd_id too early
Currently an rbd device's id is released when it is removed, but it
is done before the code is run to clean up sysfs-related files (such
as /sys/bus/rbd/devices/1).
It's possible that an rbd is still in use after the rbd_remove()
call has been made. It's essentially the same as an active inode
that stays around after it has been removed--until its final close
operation. This means that the id shows up as free for reuse at a
time it should not be.
The effect of this was seen by Jens Rehpoehler, who:
- had a filesystem mounted on an rbd device
- unmapped that filesystem (without unmounting)
- found that the mount still worked properly
- but hit a panic when he attempted to re-map a new rbd device
This re-map attempt found the previously-unmapped id available.
The subsequent attempt to reuse it was met with a panic while
attempting to (re-)install the sysfs entry for the new mapped
device.
Fix this by holding off "putting" the rbd id, until the rbd_device
release function is called--when the last reference is finally
dropped.
Note: This fixes: http://tracker.newdream.net/issues/1907
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Tue, 7 Feb 2012 18:03:37 +0000 (12:03 -0600)]
rbd: small changes
Here is another set of small code tidy-ups:
- Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
names throughout. Tell the blk_queue system our physical
block size, in the (unlikely) event we want to use something
other than the default.
- Delete the definition of struct rbd_info, which is never used.
- Move the definition of dev_to_rbd() down in its source file,
just above where it gets first used, and change its name to
dev_to_rbd_dev().
- Replace an open-coded operation in rbd_dev_release() to use
dev_to_rbd_dev() instead.
- Calculate the segment size for a given rbd_device just once in
rbd_init_disk().
- Use the '%zd' conversion specifier in rbd_snap_size_show(),
since the value formatted is a size_t.
- Switch to the '%llu' conversion specifier in rbd_snap_id_show().
since the value formatted is unsigned.
Alex Elder [Tue, 7 Feb 2012 18:03:36 +0000 (12:03 -0600)]
rbd: do some refactoring
A few blocks of code are rearranged a bit here:
- In rbd_header_from_disk():
- Don't bother computing snap_count until we're sure the
on-disk header starts with a good signature.
- Move a few independent lines of code so they are *after* a
check for a failed memory allocation.
- Get rid of unnecessary local variable "ret".
- Make a few other changes in rbd_read_header(), similar to the
above--just moving things around a bit while preserving the
functionality.
- In rbd_rq_fn(), just assign rq in the while loop's controlling
expression rather than duplicating it before and at the end of
the loop body. This allows the use of "continue" rather than
"goto next" in a number of spots.
- Rearrange the logic in snap_by_name(). End result is the same.
Alex Elder [Tue, 7 Feb 2012 18:03:36 +0000 (12:03 -0600)]
rbd: fix module sysfs setup/teardown code
Once rbd_bus_type is registered, it allows an "add" operation via
the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
way establishes a connection between the device and rbd_root_dev.
But rbd_root_dev is not registered until after the rbd_bus_type
registration is complete. This could (in principle anyway) result
in an invalid state.
Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
two initializations and never be faced with this scenario.
In addition, unregister the device in the event the bus registration
fails at module init time.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: don't allocate mon_addrs buffer in rbd_add()
The mon_addrs buffer in rbd_add is used to hold a copy of the
monitor IP addresses supplied via /sys/bus/rbd/add. That is
passed to rbd_get_client(), which never modifies it (nor do
any of the functions it gets passed to thereafter)--the mon_addr
parameter to rbd_get_client() is a pointer to constant data, so it
can't be modifed. Furthermore, rbd_get_client() has the length of
the mon_addrs buffer and that is used to ensure nothing goes beyond
its end.
Based on all this, there is no reason that a buffer needs to
be used to hold a copy of the mon_addrs provided via
/sys/bus/rbd/add. Instead, the location within that passed-in
buffer can be provided, along with the length of the "token"
therein which represents the monitor IP's.
A small change to rbd_add_parse_args() allows the address within the
buffer to be passed back, and the length is already returned. This
now means that, at least from the perspective of this interface,
there is no such thing as a list of monitor addresses that is too
long.
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: have rbd_parse_args() report found mon_addrs size
The argument parsing routine already computes the size of the
mon_addrs buffer it extracts from the "command." Pass it to the
caller so it can use it to provide the length to rbd_get_client().
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: don't use sscanf() in rbd_add_parse_args()
Make use of a few simple helper routines to parse the arguments
rather than sscanf(). This will treat both missing and too-long
arguments as invalid input (rather than silently truncating the
input in the too-long case). In time this can also be used by
rbd_add() to use the passed-in buffer in place, rather than copying
its contents into new buffers.
It appears to me that the sscanf() previously used would not
correctly handle a supplied snapshot--the two final "%s" conversion
specifications were not separated by a space, and I'm not sure
how sscanf() handles that situation. It may not be well-defined.
So that may be a bug this change fixes (but I didn't verify that).
The sizes of the mon_addrs and options buffers are now passed to
rbd_add_parse_args(), so they can be supplied to copy_token().
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: encapsulate argument parsing for rbd_add()
Move the code that parses the arguments provided to rbd_add() (which
are supplied via /sys/bus/rbd/add) into a separate function.
Also rename the "mon_dev_name" variable in rbd_add() to be
"mon_addrs". The variable represents a list of one or more
comma-separated monitor IP addresses, each with an optional port
number. I think "mon_addrs" captures that notion a little better.
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: reduce memory used for rbd_dev fields
The length of the string containing the monitor address
specification(s) will never exceed the length of the string passed
in to rbd_add(). The same holds true for the ceph + rbd options
string. So reduce the amount of memory allocated for these to
that length rather than the maximum (1024 bytes).
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
Alex Elder [Thu, 2 Feb 2012 14:13:30 +0000 (08:13 -0600)]
rbd: have rbd_get_client() return a rbd_client
Since rbd_get_client() currently returns an error code. It assigns
the rbd_client field of the rbd_device structure it is passed if
successful. Instead, have it return the created rbd_client
structure and return a pointer-coded error if there is an error.
This makes the assignment of the client pointer more obvious at the
call site.
Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>