[openib-general] RFC fix for userspace rdma cm crashes
Sean Hefty
mshefty at ichips.intel.com
Thu Jan 4 12:11:16 PST 2007
There's a problem with how rdma cm events are reported to userspace that can
lead to application crashes.
When a new connection request arrives, a context for the connection is allocated
in the kernel. The connection event is then reported to userspace. The
userspace library retrieves the event and allocates its own context for the
connection. The userspace context is associated with the kernel's context when
accepting. This allows the kernel to give userspace context with other events.
A problem occurs if a second event for the same connection occurs before the
user has had a chance to call accept. The userspace context has not yet been
set, which causes the librdmacm to crash. (This has been seen when the app
takes too long to call accept, resulting in the remote side timing out and
rejecting the connection.)
I can think of a couple possible fixes for this, but wanted to get input. I
believe that this can be fixed in either the kernel or userspace code. A kernel
fix could queue events until the context has been set. A userspace fix could
store its contexts in a map, and lookup the correct one if it is not given.
- Sean
More information about the general
mailing list