[Openib-windows] IPoIB crash

Fabian Tillier ftillier at silverstorm.com
Wed Sep 6 10:00:40 PDT 2006


Hi Yossi,

On 9/5/06, Yossi Leybovich <sleybo at mellanox.co.il> wrote:
>
> Fab
>
> Regarding the lock when changing endpnt fields
>
> what about this code:
>
>  /* Create the AV. */
>  status = p_port->p_adapter->p_ifc->create_av(
>   p_port->ib_mgr.h_pd, &av_attr, &p_endpt->h_av );
>  if( status != IB_SUCCESS )
>  {
>   p_port->p_adapter->hung = TRUE;
>   ipoib_endpt_deref( p_endpt );
>   cl_obj_unlock( &p_endpt->obj );
>   IPOIB_PRINT_EXIT( TRACE_LEVEL_ERROR, IPOIB_DBG_ERROR,
>    ("ib_create_av failed with %s\n",
>    p_port->p_adapter->p_ifc->get_err_str( status )) );
>   return;
>  }
>
> 1. here we change the av without taking any lock, is it bug ?

I would say yes.  But the problem is actually larger.  All the changes
we did to handle SM reregister stuff totally screwed up
synchronization.  For example, the __endpt_mgr_reset_all function
destroys the endpoint AVs, but these could still be referenced in a
send posted to the send queue.  I think the only safe way to destroy
the AV is to destroy the endpoint, because that is properly reference
counted.

So anytime we want to actually preserve the endpoint, we need to
destroy it and then create a new one with the same attributes.  This
however open up a window where we fail to create the new endpoint due
to resource constraints, and we end up with a hole in our endpoint
cache while the OS's ARP table might still have an entry for this
destination.

Alternatively, we can avoid destroying the AVs all together and just
modify them.  When the endpoint is created we could create an AV with
some bogus settings, and then modify that AV with the proper settings
when needed.  Whenever we want to force an update (i.e. a new path
query), we would set a flag.  There is no harm in updating the AV
while it is in use.  I'll start implementing this and see how it turns
out.

> 2. why do we need to release the lock on the endpt->obj we never took it ?

That's definitely a bug, probably from some prior incomplete code cleanup.

- Fab




More information about the ofw mailing list