[openib-general] Re: Some OpenSM 1.8.0 Anomalies

Hal Rosenstock halr at voltaire.com
Fri Sep 9 08:29:46 PDT 2005


On Thu, 2005-09-08 at 09:02, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> >>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> >>>>
> >>>>This means that the LID of the port registered as the source for this inform info is not recognized as a valid LID.
> >>>>
> >>>>
> >>>>>...
> >>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> >>>>
> >>>>The meaning of this is that the incoming trap source is not a recognized (included in the SM database) guid
> >>>
> >>>

> > It looks like it occurs on SM port down which seems OK. 
> OK that explains it:
> The errors are when the SM port has turned down. In that case all the ports that were previously
> found on the fabric are now inaccessible. The SM should Report(Notice with trap #65) for each of these ports.

Right, GID out of service should be and is indicated.

> For that sake it scans through the InformInfo database.
> Apparently an InformInfo with LID=7 has requested for this report.
> But LID 7 does not exist anymore

It exists. It is just not reachable via GS (SA) LID routed packets.

>  - so the first message is valid:

Not sure what you mean exactly by valid here.

>  > Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> (actually this should have caused the InformInfo record to be deleted... which I do not think happening)

What should have caused the InformInfo record to be deleted ? This error
being detected ? If so, should it wait for the error or should it occur
when the SM port goes down do this (clear the inform list perhaps with
the exception of the local node) ? That would require/mean
reregistration is required when the node comes back. SA clients won't
necessarily do this when the SM port comes back without something like
ClientReregistration.

> Later we see the following error:
>  > Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> This is sent during the section where node 0x0008f10403960559 is being teared off from the SMDB.
> 
> The code in osm_inform.c say:
>    /* Check if there is a pkey match. o13-17.1.1*/

Where is this performed ?

>    /* Check if the issuer of the trap is the SM. If it is, then the pkey
                                                                      ^^
                                                                     gid
>       comparison should be done on the trap source (saved as the gid in the
>       data details field).
>       If the issuer gid is not the SM - then it is the guid of the trap
>       source. */
>    if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) == p_subn->opt.subnet_prefix) &&
>         (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) == p_subn->sm_port_guid) )
>    {
>      /* The issuer is the SM this is trap 64-67 - compare the pkey
>         with the gid saved on the data details. */
>      source_gid = p_ntc->data_details.ntc_64_67.gid;
>    }
>    else
>    {
>      source_gid = p_ntc->issuer_gid;
>    }
> 
> In our case the trap is 65 and sent by the SM. However, the spec required to check
> the tear down port and the target of the Report will share a PKey.

I'm not sure what you are referring to in the spec. In any case,
shouldn't the local ports perhaps be an exception to this ?

>  In out case the
> source of the event is considered to be the port that is tear down. (As we want to
> prevent any case where port not sharing PKey will get reports on each other).
> But since the "source" port is being teared down we can not find it's PKey table ...
> (actually we look first in the  Port by LID table - and can not find it).
> 
> This means we will never send Report(Notice trap#65) to any node.
> How do we solve that bug? Maybe we have a way to find the "source" port PKey that
> is not yet corrupted.

I'm not totally following this because of the PKey v. GID issue above and 
I think local ports may be (needed to be) treated differently.

-- Hal




More information about the general mailing list