[openib-general] SM Bad Port Handling

Eitan Zahavi eitan at mellanox.co.il
Fri Apr 8 13:05:57 PDT 2005


Hi Hal,

I have looked up the mail thread. 
I could not find a log file indicating there was a repetitive query of the
bad port.

I know the code and I can not find a reason for it to do a repetitive port. 
 
Can you explain how this can happen? 


Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Friday, April 08, 2005 10:07 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > This is a physical port attribute so the file is osm_port.h and the
> > structure is osm_physp_t.
> > From the doc on the structure:
> > *
> > *  healthy
> > *     Tracks the health of the port. Normally should be TRUE but
> > *     might change as a result of incoming traps indicating the port
> > *     healthy is questionable.
> > *
> 
> Yup. It's definitely there in the gen2 code base.
> 
> > I have been trying my best to find how it can happen that a port that
> > does not respond will cause OpenSM to continuously poll it. This can
> > not happen so unless you can explain how it happens please do not
> > contaminate the code with un-needed code.
> 
> This part of the code has not been touched. I've put all meaningful
> patches and ideas on how things might change out on this list.
> 
> > The only thing that comes to mind it the case of failure to "Set" some
> > attributes of devices in the fabric.
> 
> I dug out Ron's emails on this. I don't think that was what was going
> on. It was a SubnGet(NodeInfo) which failed. See
> http://openib.org/pipermail/openib-general/2005-February/009125.html
> for more details.
> 
> -- Hal
> 
> > This happens only after discovery is completed and only after the
> > validity of the data base is verified (i.e. each node has ports, each
> > port have a node ...)
> >
> > In that case of failure to set some attributes (aka LFT, PortInfo etc)
> > OpenSM will output a clear error message: "Errors in intialization"
> > and will restart a full sweep of the fabric.
> >
> > This is the only way one get an infinite polling on the entire subnet.
> 
> > In general, it might make sense to try and improve how OpenSM
> > qualifies each fabric port for the statistics of the number of packet
> > drops versus good packets it passed through. Note this is complex due
> > to the fact a port might affect packets that goes through it. And
> > there is no way to know on which hop on the path the packet was
> > dropped.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/bd1c5217/attachment.html>


More information about the general mailing list