[openib-general] Re: Some OpenSM 1.8.0 Anomalies

Hal Rosenstock halr at voltaire.com
Fri Sep 9 13:03:14 PDT 2005


On Fri, 2005-09-09 at 13:22, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Thu, 2005-09-08 at 09:02, Eitan Zahavi wrote:
> > 
> >>Hal Rosenstock wrote:
> >>
> >>>>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> >>>>>>
> >>>>>>This means that the LID of the port registered as the source for this inform info is not recognized as a valid LID.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>...
> >>>>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> >>>>>>
> >>>>>>The meaning of this is that the incoming trap source is not a recognized (included in the SM database) guid
> >>>>>
> >>>>>
> > 
> >>>It looks like it occurs on SM port down which seems OK. 
> >>
> >>OK that explains it:
> >>The errors are when the SM port has turned down. In that case all the ports that were previously
> >>found on the fabric are now inaccessible. The SM should Report(Notice with trap #65) for each of these ports.
> > 
> > 
> > Right, GID out of service should be and is indicated.
> > 
> > 
> >>For that sake it scans through the InformInfo database.
> >>Apparently an InformInfo with LID=7 has requested for this report.
> >>But LID 7 does not exist anymore
> > 
> > 
> > It exists. It is just not reachable via GS (SA) LID routed packets.
> Well from the point of view of the SM it does not once the SM can not reach it.

OK.
 
> >> - so the first message is valid:
> > 
> > 
> > Not sure what you mean exactly by valid here.
> Valid means that it is correct. The destination port to send the Report to is not part of any partition any more.
> I would rephrase the error message and make it Info. There is no ERROR in loosing some ports.

Right. This should be made into something less than error.
 
> >> > Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> >>(actually this should have caused the InformInfo record to be deleted... which I do not think happening)
> > 
> > What should have caused the InformInfo record to be deleted ? 
> "o13-17.1.2: If a Set(InformInfo) specified a valid trap source at the time of
> subscription (see o13-14.1.1: on page 746), yet Trap() forwarding fails because
> the subscriber and trap source are no longer permitted to access
> each other according to current partitioning (see o13-17.1.1: on page
> 747), then the manager shall permanently discontinue all event forwarding
> caused by the Set(InformInfo) which created a subscription to
> that trap source, except if InformInfo:LIDRangeBegin was 0xFFFF; in the
> latter case, event forwarding is discontinued only for the now-invalid trap
> source."
> Later on the same page:
> "Note also that “permanently discontinue all event forwarding” is meant to
> indicate that the subscription for forwarding is dropped by the manager; if
> the source later becomes reachable again by the subscriber, a new
> Set(InformInfo) is required to re-establish event forwarding, if that is what
> is desired. (This may not be desired; when the source becomes reachable
> again, it may have acquired new characteristics, such as new, different
> software functions, that make such forwarding inappropriate.)"
> 
> > This error being detected ? 
> Not currently
> > If so, should it wait for the error or should it occur
> > when the SM port goes down do this (clear the inform list perhaps with
> > the exception of the local node) ? 
> Maybe or just code the generic code to handle 013-17.1.2
> > That would require/mean
> > reregistration is required when the node comes back. SA clients won't
> > necessarily do this when the SM port comes back without something like
> > ClientReregistration.
> Correct. This is another reason why ClientReRegistration is an important feature of the
> access layer.

I would have ended that sentence after feature. It does not need to be
implemented in the access layer.

> >>Later we see the following error:
> >> > Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> >>This is sent during the section where node 0x0008f10403960559 is being teared off from the SMDB.
> >>
> >>The code in osm_inform.c say:
> >>   /* Check if there is a pkey match. o13-17.1.1*/
> > 
> > 
> > Where is this performed ?
> osm_inform.c
> __match_notice_to_inf_rec
> 
> > 
> > 
> >>   /* Check if the issuer of the trap is the SM. If it is, then the pkey
> > 
> >                                                                       ^^
> >                                                                      gid
> The requirement is to have a shared PKey according to PKey sharing rules between the
> InformInfo requester and the Trap generator. However, in the case of traps 64-67
> the SM is the Trap generator. So we need the spacial logic below to obtain the port gid
> that the trap refers to from within the notice data details fields and not from the issuer field.

I think the comment in the code is wrong here and should be gid rather
than pkey. I do agree that the pkey sharing needs checking but that is
separate.
 
> >>      comparison should be done on the trap source (saved as the gid in the
> >>      data details field).
> >>      If the issuer gid is not the SM - then it is the guid of the trap
> >>      source. */
> >>   if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) == p_subn->opt.subnet_prefix) &&
> >>        (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) == p_subn->sm_port_guid) )
> >>   {
> >>     /* The issuer is the SM this is trap 64-67 - compare the pkey
> >>        with the gid saved on the data details. */
> >>     source_gid = p_ntc->data_details.ntc_64_67.gid;
> >>   }
> >>   else
> >>   {
> >>     source_gid = p_ntc->issuer_gid;
> >>   }
> >>
> >>In our case the trap is 65 and sent by the SM. However, the spec required to check
> >>the tear down port and the target of the Report will share a PKey.
> > 
> > 
> > I'm not sure what you are referring to in the spec. In any case,
> > shouldn't the local ports perhaps be an exception to this ?
> I do not think so. The requirement make sense for all traps:
> If the Trap describes a port A then it should not be forwarded to another port B unless they
> share a PKey:
> "o13-17.1.1: Managers that support event forwarding and have confirmed
> a request for event subscription shall forward corresponding events to the
> subscriber using a Report(Notice) MAD, as long as the subscriber and
> Trap() source are permitted to access each other according to current partitioning."
> > 
> > 
> >> In out case the
> >>source of the event is considered to be the port that is tear down. (As we want to
> >>prevent any case where port not sharing PKey will get reports on each other).
> >>But since the "source" port is being teared down we can not find it's PKey table ...
> >>(actually we look first in the  Port by LID table - and can not find it).
> >>
> >>This means we will never send Report(Notice trap#65) to any node.
> >>How do we solve that bug? Maybe we have a way to find the "source" port PKey that
> >>is not yet corrupted.
> > 
> > 
> > I'm not totally following this because of the PKey v. GID issue above and 
> > I think local ports may be (needed to be) treated differently.
> I hope the above 17.1.1 convinced you. The GID vs PKey is just unclear documentation.
> The idea is that for trap# 64-67 which are generated by the SM you can not simply use the SM PKey but
> lookup the gid of the reported port from within the notice data details and then lookup that port PKey.

OK. I'm convinced.

I'm still not sure what is the bug you are referring to above though.

-- Hal




More information about the general mailing list