[openib-general] Re: opensm - umad_receiver break on alloc errors

Hal Rosenstock halr at voltaire.com
Sun Nov 20 05:43:57 PST 2005


Hi Yael,

On Sun, 2005-11-20 at 04:20, Yael Kalka wrote:
> Hi Hal,
> 
> While reviewing the umad_receiver function in osm_vendor_ibumad.c we've
> noticed
> that when umad_alloc() calls fail, the receiver breaks.
> What happens then is that SM continues to live, though the umad_receiver
> thread 
> doesn't exist anymore.
> I think that there is no use in keeping the SM alive in this case.
> As a result, I think we should do one of the following when umad_alloc()
> failes:
> 1. If umad_alloc() fails - issue an error to the syslog, and exit SM.
> This is a 
> fatal case.
> 2. Use continue instead of break. Assuming that if umad_alloc() fails
> this time - 
> doesn't mean it'll fail again.

In general, I was afraid of a tight loop with this failing and just
retrying over and over. I thought about some other strategies to dial
this back (some artificial timeout before the next alloc was retried).

There are 2 calls to umad_alloc in the umad_receiver. The first one is
just to allocate a normal sized MAD. This is the one which has the issue
above IMO. The second call is for a larger send. That one should
definitely be changed from a break to a continue. Either you can issue a
patch for this or I can fix it. This part is a one liner :-)

Should we do something about the first alloc failure ?

Thanks.

-- Hal

> What do you think?
> Yael
> 




More information about the general mailing list