[openib-general] RFC fix for userspace rdma cm crashes

Steve Wise swise at opengridcomputing.com
Thu Jan 4 13:04:25 PST 2007


On Thu, 2007-01-04 at 12:11 -0800, Sean Hefty wrote:
> There's a problem with how rdma cm events are reported to userspace that can 
> lead to application crashes.
> 
> When a new connection request arrives, a context for the connection is allocated 
> in the kernel.  The connection event is then reported to userspace.  The 
> userspace library retrieves the event and allocates its own context for the 
> connection.  The userspace context is associated with the kernel's context when 
> accepting.  This allows the kernel to give userspace context with other events.
> 
> A problem occurs if a second event for the same connection occurs before the 
> user has had a chance to call accept.  The userspace context has not yet been 
> set, which causes the librdmacm to crash.   (This has been seen when the app 
> takes too long to call accept, resulting in the remote side timing out and 
> rejecting the connection.)
> 
> I can think of a couple possible fixes for this, but wanted to get input.  I 
> believe that this can be fixed in either the kernel or userspace code.  A kernel 
> fix could queue events until the context has been set.  A userspace fix could 
> store its contexts in a map, and lookup the correct one if it is not given.
> 

Does this affect kernel rdma cm users too in some way?  If not, then
perhaps the place to fix it is in userspace.









More information about the general mailing list