[openib-general] Re: Some OpenSM 1.8.0 Anomalies
Eitan Zahavi
eitan at mellanox.co.il
Thu Sep 8 06:02:58 PDT 2005
Hal Rosenstock wrote:
>>>>>Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
>>>>
>>>>This means that the LID of the port registered as the source for this inform info is not recognized as a valid LID.
>>>>
>>>>
>>>>>...
>>>>>Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
>>>>
>>>>The meaning of this is that the incoming trap source is not a recognized (included in the SM database) guid
>>>
>>>
> It looks like it occurs on SM port down which seems OK.
OK that explains it:
The errors are when the SM port has turned down. In that case all the ports that were previously
found on the fabric are now inaccessible. The SM should Report(Notice with trap #65) for each of these ports.
For that sake it scans through the InformInfo database.
Apparently an InformInfo with LID=7 has requested for this report.
But LID 7 does not exist anymore - so the first message is valid:
> Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
(actually this should have caused the InformInfo record to be deleted... which I do not think happening)
Later we see the following error:
> Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
This is sent during the section where node 0x0008f10403960559 is being teared off from the SMDB.
The code in osm_inform.c say:
/* Check if there is a pkey match. o13-17.1.1*/
/* Check if the issuer of the trap is the SM. If it is, then the pkey
comparison should be done on the trap source (saved as the gid in the
data details field).
If the issuer gid is not the SM - then it is the guid of the trap
source. */
if ( (cl_ntoh64(p_ntc->issuer_gid.unicast.prefix) == p_subn->opt.subnet_prefix) &&
(cl_ntoh64(p_ntc->issuer_gid.unicast.interface_id) == p_subn->sm_port_guid) )
{
/* The issuer is the SM this is trap 64-67 - compare the pkey
with the gid saved on the data details. */
source_gid = p_ntc->data_details.ntc_64_67.gid;
}
else
{
source_gid = p_ntc->issuer_gid;
}
In our case the trap is 65 and sent by the SM. However, the spec required to check
the tear down port and the target of the Report will share a PKey. In out case the
source of the event is considered to be the port that is tear down. (As we want to
prevent any case where port not sharing PKey will get reports on each other).
But since the "source" port is being teared down we can not find it's PKey table ...
(actually we look first in the Port by LID table - and can not find it).
This means we will never send Report(Notice trap#65) to any node.
How do we solve that bug? Maybe we have a way to find the "source" port PKey that
is not yet corrupted.
> Here's an
> extract of that portion of the log:
>
> Sep 06 15:41:48 724961 [B76A4C40] -> __osm_state_mgr_is_sm_port_down: ]
> Sep 06 15:41:48 724980 [0000] -> SM port is down.
> Sep 06 15:41:48 724980 [B76A4C40] -> SM port is down.Sep 06 15:41:48 725261 [B76A4C40] -> __osm_state_mgr_sm_port_down_msg:
>
>
> ******************************************************************
> ************************** SM PORT DOWN **************************
> ******************************************************************
>
>
> Sep 06 15:41:48 725283 [B76A4C40] -> osm_drop_mgr_process: [
> Sep 06 15:41:48 725303 [B76A4C40] -> osm_drop_mgr_process: Checking node 0x0008f1040396040c.
> Sep 06 15:41:48 725324 [B76A4C40] -> __osm_drop_mgr_process_node: [
> Sep 06 15:41:48 725342 [B76A4C40] -> __osm_drop_mgr_process_node: Unreachable node 0x0008f1040396040c.
> Sep 06 15:41:48 725364 [B76A4C40] -> __osm_drop_mgr_remove_port: [
> Sep 06 15:41:48 725383 [B76A4C40] -> __osm_drop_mgr_remove_port: Unreachable port 0x0008f1040396040e.
> Sep 06 15:41:48 725417 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing abandoned LID range [0x7,0x7].
> Sep 06 15:41:48 725480 [B76A4C40] -> __osm_drop_mgr_remove_port: Unlinking local node 0x0008f1040396040c, port 0x2
> and remote node 0x0008f10403960558, port 0x1.
> Sep 06 15:41:48 725504 [B76A4C40] -> __osm_drop_mgr_remove_port: resetting discovery count of node: 0x0008f10403960558 port num:1.
> Sep 06 15:41:48 725525 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing physical port number 2.
> Sep 06 15:41:48 725563 [B76A4C40] -> osm_report_notice: [
> Sep 06 15:41:48 725583 [B76A4C40] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0003 GID:0xfe80000000000000,0x0008f10403960559
> Sep 06 15:41:48 725612 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 725632 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by Node Type: II=0x000003 Trap=0x000004
> Sep 06 15:41:48 725653 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 725671 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 725691 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find destination port with LID:0x0007
> Sep 06 15:41:48 725710 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 725728 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 725747 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by Node Type: II=0x000001 Trap=0x000004
> Sep 06 15:41:48 725767 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 725785 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 725804 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by Node Type: II=0x000002 Trap=0x000004
> Sep 06 15:41:48 725823 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 725843 [B76A4C40] -> osm_report_notice: ]
> Sep 06 15:41:48 725862 [B76A4C40] -> Removed port with GUID:0x0008f1040396040e LID range [0x7,0x7] of node:Voltaire HCA400
> Sep 06 15:41:48 725883 [B76A4C40] -> __osm_drop_mgr_remove_port: ]
> Sep 06 15:41:48 725904 [B76A4C40] -> __osm_drop_mgr_process_node: ]
> Sep 06 15:41:48 725923 [B76A4C40] -> osm_drop_mgr_process: Checking node 0x0008f10403960558.
> Sep 06 15:41:48 725943 [B76A4C40] -> osm_drop_mgr_process: Checking full discovery of node 0x0008f10403960558.
> Sep 06 15:41:48 725964 [B76A4C40] -> osm_drop_mgr_process: Checking port 0x0008f10403960559.
> Sep 06 15:41:48 725984 [B76A4C40] -> __osm_drop_mgr_remove_port: [
> Sep 06 15:41:48 726002 [B76A4C40] -> __osm_drop_mgr_remove_port: Unreachable port 0x0008f10403960559.
> Sep 06 15:41:48 726023 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing abandoned LID range [0x3,0x3].
> Sep 06 15:41:48 726043 [B76A4C40] -> __osm_drop_mgr_remove_port: Clearing physical port number 1.
> Sep 06 15:41:48 726067 [B76A4C40] -> osm_report_notice: [
> Sep 06 15:41:48 726086 [B76A4C40] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0003 GID:0xfe80000000000000,0x0008f10403960559
> Sep 06 15:41:48 726110 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 726129 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by Node Type: II=0x000003 Trap=0x000004
> Sep 06 15:41:48 726149 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 726167 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 726186 [B76A4C40] -> __match_notice_to_inf_rec: ERR 0207: Cannot find source port with GUID:0x0008f10403960559
> Sep 06 15:41:48 726206 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 726225 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 726243 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by Node Type: II=0x000001 Trap=0x000004
> Sep 06 15:41:48 726263 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 726281 [B76A4C40] -> __match_notice_to_inf_rec: [
> Sep 06 15:41:48 726300 [B76A4C40] -> __match_notice_to_inf_rec: Mismatch by Node Type: II=0x000002 Trap=0x000004
> Sep 06 15:41:48 726319 [B76A4C40] -> __match_notice_to_inf_rec: ]
> Sep 06 15:41:48 726339 [B76A4C40] -> osm_report_notice: ]
> Sep 06 15:41:48 726357 [B76A4C40] -> Removed port with GUID:0x0008f10403960559 LID range [0x3,0x3] of node:MT23108 InfiniHost Mellanox Technologies
> Sep 06 15:41:48 726378 [B76A4C40] -> __osm_drop_mgr_remove_port: ]
> Sep 06 15:41:48 726426 [B76A4C40] -> osm_drop_mgr_process: ]
>
>
>>Then if you can send us the log file it will help.
>
>
> I'll send you the whole log offline if you still want it.
No no need to.
More information about the general
mailing list