[openib-general] Bugzilla Bug 329: HCA_FATAL_EVENT cause to OpenSM to stop functioning

Michael S. Tsirkin mst at mellanox.co.il
Wed Jan 31 10:31:22 PST 2007


> Quoting Hal Rosenstock <halr at voltaire.com>:
> Subject: Re: Bugzilla Bug 329: HCA_FATAL_EVENT cause to OpenSM to stop functioning
> 
> Hi Yevgeny,
> 
> On Wed, 2007-01-31 at 05:16, Yevgeny Kliteynik wrote:
> > Hi Hal.
> > 
> > I noticed the following bug in Bugzilla:
> > 
> > 	Bugzilla Bug 329: HCA_FATAL_EVENT cause to opensm to stop functioning
> > 	  https://bugs.openfabrics.org/show_bug.cgi?id=329
> > 
> > 	When there is a HCA fatal event on the host that opensm is running on it,
> > 	the opensm stop to function (After the event, the driver restart the device,
> > 	and the port does not return to active state).
> > 
> > 	If the opensm run in sweep mode , after the event you can see that the opensm
> > 	stop sweeping.
> > 
> > I remember that a couple of months ago I sent a patch that takes care of this problem:
> >  - in case of IBV_EVENT_DEVICE_FATAL, osm was forced to exit
> >  - in case of IBV_EVENT_PORT_ERROR, osm initiated heavy sweep
> > 
> > The problem with my patch was that it made osm to depend on uverbs module.
> > To resolve this problem, support should be added in umad, and then osm could
> > use this support.
> > 
> > Do you know if some work in this area was done in umad?
> 
> This has been on the list but unfortunately there has been no time yet
> to work on the local events support in libibumad.

I do not think making libibmad depend on ib_uverbs module is a good idea either.
More properly, the problem is in ib_umad which does not report hotplug events.
If we just make ib_umad return an error code to user on hotplug,
the problem will go away without userspace changes.

-- 
MST




More information about the general mailing list