[openib-general] Re: Some OpenSM 1.8.0 Anomalies
Hal Rosenstock
halr at voltaire.com
Fri Sep 9 08:29:46 PDT 2005
On Thu, 2005-09-08 at 09:02, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> >>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> >>>>
> >>>>This means that the LID of the port registered as the source for this inform info is not recognized as a valid LID.
> >>>>
> >>>>
> >>>>>...
> >>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> >>>>
> >>>>The meaning of this is that the incoming trap source is not a recognized (included in the SM database) guid
> >>>
> >>>
> > It looks like it occurs on SM port down which seems OK.
> OK that explains it:
> The errors are when the SM port has turned down. In that case all the ports that were previously
> found on the fabric are now inaccessible. The SM should Report(Notice with trap #65) for each of these ports.
Right, GID out of service should be and is indicated.
> For that sake it scans through the InformInfo database.
> Apparently an InformInfo with LID=7 has requested for this report.
> But LID 7 does not exist anymore
It exists. It is just not reachable via GS (SA) LID routed packets.
> - so the first message is valid:
Not sure what you mean exactly by valid here.
> > Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> (actually this should have caused the InformInfo record to be deleted... which I do not think happening)
What should have caused the InformInfo record to be deleted ? This error
being detected ? If so, should it wait for the error or should it occur
when the SM port goes down do this (clear the inform list perhaps with
the exception of the local node) ? That would require/mean
reregistration is required when the node comes back. SA clients won't
necessarily do this when the SM port comes back without something like
ClientReregistration.
> Later we see the following error:
> > Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> This is sent during the section where node 0x0008f10403960559 is being teared off from the SMDB.
>
> The code in osm_inform.c say:
> /* Check if there is a pkey match. o13-17.1.1*/
Where is this performed ?
> /* Check if the issuer of the trap is the SM. If it is, then the pkey
^^
gid
> comparison should be done on the trap source (saved as the gid in the
> data details field).
> If the issuer gid is not the SM - then it is the guid of the trap
> source. */
> if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) == p_subn->opt.subnet_prefix) &&
> (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) == p_subn->sm_port_guid) )
> {
> /* The issuer is the SM this is trap 64-67 - compare the pkey
> with the gid saved on the data details. */
> source_gid = p_ntc->data_details.ntc_64_67.gid;
> }
> else
> {
> source_gid = p_ntc->issuer_gid;
> }
>
> In our case the trap is 65 and sent by the SM. However, the spec required to check
> the tear down port and the target of the Report will share a PKey.
I'm not sure what you are referring to in the spec. In any case,
shouldn't the local ports perhaps be an exception to this ?
> In out case the
> source of the event is considered to be the port that is tear down. (As we want to
> prevent any case where port not sharing PKey will get reports on each other).
> But since the "source" port is being teared down we can not find it's PKey table ...
> (actually we look first in the Port by LID table - and can not find it).
>
> This means we will never send Report(Notice trap#65) to any node.
> How do we solve that bug? Maybe we have a way to find the "source" port PKey that
> is not yet corrupted.
I'm not totally following this because of the PKey v. GID issue above and
I think local ports may be (needed to be) treated differently.
-- Hal
More information about the general
mailing list