[openib-general] RE: Some OpenSM 1.8.0 Anomalies
Eitan Zahavi
eitan at mellanox.co.il
Fri Sep 9 13:14:35 PDT 2005
> On Fri, 2005-09-09 at 13:22, Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> > > On Thu, 2005-09-08 at 09:02, Eitan Zahavi wrote:
> > >
> > >>Hal Rosenstock wrote:
> > >>
> > >>>>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec:
ERR 0207:
> Cannot find destination port with LID:0x0007
> > >>>>>>
> > >>>>>>This means that the LID of the port registered as the source for
this inform
> info is not recognized as a valid LID.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>>...
> > >>>>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec:
ERR 0207:
> Cannot find source port with GUID:0x0008f10403960559
> > >>>>>>
> > >>>>>>The meaning of this is that the incoming trap source is not a
recognized
> (included in the SM database) guid
> > >>>>>
> > >>>>>
> > >
> > >>>It looks like it occurs on SM port down which seems OK.
> > >>
> > >>OK that explains it:
> > >>The errors are when the SM port has turned down. In that case all the
ports that
> were previously
> > >>found on the fabric are now inaccessible. The SM should Report(Notice
with trap
> #65) for each of these ports.
> > >
> > >
> > > Right, GID out of service should be and is indicated.
> > >
> > >
> > >>For that sake it scans through the InformInfo database.
> > >>Apparently an InformInfo with LID=7 has requested for this report.
> > >>But LID 7 does not exist anymore
> > >
> > >
> > > It exists. It is just not reachable via GS (SA) LID routed packets.
> > Well from the point of view of the SM it does not once the SM can not
reach it.
>
> OK.
>
> > >> - so the first message is valid:
> > >
> > >
> > > Not sure what you mean exactly by valid here.
> > Valid means that it is correct. The destination port to send the Report
to is not part
> of any partition any more.
> > I would rephrase the error message and make it Info. There is no ERROR
in loosing
> some ports.
>
> Right. This should be made into something less than error.
>
> > >> > Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR
0207:
> Cannot find destination port with LID:0x0007
> > >>(actually this should have caused the InformInfo record to be
deleted... which I do
> not think happening)
> > >
> > > What should have caused the InformInfo record to be deleted ?
> > "o13-17.1.2: If a Set(InformInfo) specified a valid trap source at the
time of
> > subscription (see o13-14.1.1: on page 746), yet Trap() forwarding fails
because
> > the subscriber and trap source are no longer permitted to access
> > each other according to current partitioning (see o13-17.1.1: on page
> > 747), then the manager shall permanently discontinue all event
forwarding
> > caused by the Set(InformInfo) which created a subscription to
> > that trap source, except if InformInfo:LIDRangeBegin was 0xFFFF; in the
> > latter case, event forwarding is discontinued only for the now-invalid
trap
> > source."
> > Later on the same page:
> > "Note also that "permanently discontinue all event forwarding" is meant
to
> > indicate that the subscription for forwarding is dropped by the manager;
if
> > the source later becomes reachable again by the subscriber, a new
> > Set(InformInfo) is required to re-establish event forwarding, if that is
what
> > is desired. (This may not be desired; when the source becomes reachable
> > again, it may have acquired new characteristics, such as new, different
> > software functions, that make such forwarding inappropriate.)"
> >
> > > This error being detected ?
> > Not currently
> > > If so, should it wait for the error or should it occur
> > > when the SM port goes down do this (clear the inform list perhaps with
> > > the exception of the local node) ?
> > Maybe or just code the generic code to handle 013-17.1.2
> > > That would require/mean
> > > reregistration is required when the node comes back. SA clients won't
> > > necessarily do this when the SM port comes back without something like
> > > ClientReregistration.
> > Correct. This is another reason why ClientReRegistration is an important
feature of
> the
> > access layer.
>
> I would have ended that sentence after feature. It does not need to be
> implemented in the access layer.
>
> > >>Later we see the following error:
> > >> > Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR
0207:
> Cannot find source port with GUID:0x0008f10403960559
> > >>This is sent during the section where node 0x0008f10403960559 is being
teared off
> from the SMDB.
> > >>
> > >>The code in osm_inform.c say:
> > >> /* Check if there is a pkey match. o13-17.1.1*/
> > >
> > >
> > > Where is this performed ?
> > osm_inform.c
> > __match_notice_to_inf_rec
> >
> > >
> > >
> > >> /* Check if the issuer of the trap is the SM. If it is, then the
pkey
> > >
> > >
^^
> > >
gid
> > The requirement is to have a shared PKey according to PKey sharing rules
between
> the
> > InformInfo requester and the Trap generator. However, in the case of
traps 64-67
> > the SM is the Trap generator. So we need the spacial logic below to
obtain the port
> gid
> > that the trap refers to from within the notice data details fields and
not from the
> issuer field.
>
> I think the comment in the code is wrong here and should be gid rather
> than pkey. I do agree that the pkey sharing needs checking but that is
> separate.
[EZ] OK we can improve the comments accuracy and readability. As always
comments written by the developers are somewhat biased by the fact he
already understands the code. So the first time reader can do a better
job...
>
> > >> comparison should be done on the trap source (saved as the gid
in the
> > >> data details field).
> > >> If the issuer gid is not the SM - then it is the guid of the
trap
> > >> source. */
> > >> if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) ==
p_subn->opt.subnet_prefix)
> &&
> > >> (cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) ==
p_subn->sm_port_guid)
> )
> > >> {
> > >> /* The issuer is the SM this is trap 64-67 - compare the pkey
> > >> with the gid saved on the data details. */
> > >> source_gid = p_ntc->data_details.ntc_64_67.gid;
> > >> }
> > >> else
> > >> {
> > >> source_gid = p_ntc->issuer_gid;
> > >> }
> > >>
> > >>In our case the trap is 65 and sent by the SM. However, the spec
required to check
> > >>the tear down port and the target of the Report will share a PKey.
> > >
> > >
> > > I'm not sure what you are referring to in the spec. In any case,
> > > shouldn't the local ports perhaps be an exception to this ?
> > I do not think so. The requirement make sense for all traps:
> > If the Trap describes a port A then it should not be forwarded to
another port B
> unless they
> > share a PKey:
> > "o13-17.1.1: Managers that support event forwarding and have confirmed
> > a request for event subscription shall forward corresponding events to
the
> > subscriber using a Report(Notice) MAD, as long as the subscriber and
> > Trap() source are permitted to access each other according to current
partitioning."
> > >
> > >
> > >> In out case the
> > >>source of the event is considered to be the port that is tear down.
(As we want to
> > >>prevent any case where port not sharing PKey will get reports on each
other).
> > >>But since the "source" port is being teared down we can not find it's
PKey table ...
> > >>(actually we look first in the Port by LID table - and can not find
it).
> > >>
> > >>This means we will never send Report(Notice trap#65) to any node.
> > >>How do we solve that bug? Maybe we have a way to find the "source"
port PKey
> that
> > >>is not yet corrupted.
> > >
> > >
> > > I'm not totally following this because of the PKey v. GID issue above
and
> > > I think local ports may be (needed to be) treated differently.
> > I hope the above 17.1.1 convinced you. The GID vs PKey is just unclear
> documentation.
> > The idea is that for trap# 64-67 which are generated by the SM you can
not simply
> use the SM PKey but
> > lookup the gid of the reported port from within the notice data details
and then
> lookup that port PKey.
>
> OK. I'm convinced.
>
> I'm still not sure what is the bug you are referring to above though.
[EZ] The bug is that the code does not perform the required operations to
meet o13-17.1.2 compliancy.
>
> -- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050909/2d60d53c/attachment.html>
More information about the general
mailing list