[openib-general] osm unreliable unless -d1

Sasha Khapyorsky sashak at voltaire.com
Mon Mar 6 13:44:19 PST 2006


On 11:44 Mon 06 Mar     , Jean-Christophe Hugly wrote:
> 
> One more detail, I am running with LMC=2 betcause I wanted to check that
> the LMC>0 were fixed (they seem to be; I do not see any LMC-related
> missbehaviour.

Hmm, and I have the some problems with LMC (even before the test, not
investigated yet)...

Could you try without LMC?

Sasha.

> With -d1 everything looks shipshape).
> 
> > Also I see that finally port becomes active but after delay. Those
> > delays look strange and inconsistent, I will need to test more tomorrow.
> > Could you try such modification for your script?
> > 
> > i=1
> > while true; do
> > 	modprobe -r ib_mthca
> > 	sleep 3
> > 	modprobe ib_mthca
> > 	count=0
> > 	while true ; do
> > 		ibstat | egrep 'State: Active$' > /dev/null
> > 		test $? -eq 0 && break
> > 		count=`expr $count + 1`
> > 		sleep 1
> > 	done
> > 	echo $i: delay $count
> > 	sleep 3
> > 	i=`expr $i + 1`
> > done
> > 
> Here's the output from your script. After the last line in doesn't make
> further progress (I waited something like 10 minutes).
> Addressing Eitan comment, I tried the same thing with a delay of 7
> seconds rather than 3 between modprobe -r and modprobe. The results are
> the same:
> 
> 1: delay 0
> 2: delay 0
> 3: delay 0
> 4: delay 0
> 5: delay 0
> 6: delay 0
> <nothing happens>
> 
> In case it contains usefull clues, here's a sample of osm's log at
> around the point things start falling appart:
> 
> Mar 06 11:31:36 036291 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 TID:0x00000000000000c4
> Mar 06 11:31:36 036452 [40A04960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x001393010b186ba0
> Mar 06 11:31:36 044333 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 TID:0x00000000000000c8
> Mar 06 11:31:36 044921 [40A04960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x001393010b186b08
> Mar 06 11:31:36 056540 [40401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 056562 [40401960] -> Discovered new port with GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx Mellanox Technologies
> Mar 06 11:31:36 056570 [40401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 056578 [40401960] -> Discovered new port with GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx Mellanox Technologies
> Mar 06 11:31:36 056673 [40401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches
> Mar 06 11:31:36 082257 [40A04960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches
> Mar 06 11:31:36 446369 [40602960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0010 TID:0x0000000000000000
> Mar 06 11:31:36 446400 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0014 TID:0x0000000000000001
> Mar 06 11:31:36 446614 [40602960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0010 GID:0xfe80000000000000,0x001393000024a511
> Mar 06 11:31:36 446657 [40401960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0014 GID:0xfe80000000000000,0x001393000024a512
> Mar 06 11:31:36 465919 [40401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches
> Mar 06 11:31:36 473124 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 473151 [40A04960] -> Removed port with GUID:0x001393000024a601 LID range [0x18,0x1B] of node:MT25218 InfiniHostEx Mellanox Technologies
> Mar 06 11:31:36 473196 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 473209 [40A04960] -> Removed port with GUID:0x001393000024a602 LID range [0x1C,0x1F] of node:MT25218 InfiniHostEx Mellanox Technologies
> Mar 06 11:31:36 473526 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 473568 [40A04960] -> Removed port with GUID:0x001393010b186b08 LID range [0x8,0x8] of node:MT47396 Infiniscale-III Mellanox Technologies
> Mar 06 11:31:36 473710 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 473722 [40A04960] -> Removed port with GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx Mellanox Technologies
> Mar 06 11:31:36 473758 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 473770 [40A04960] -> Removed port with GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx Mellanox Technologies
> Mar 06 11:31:36 474015 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 474050 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x9,0x9] of node:MT47396 Infiniscale-III Mellanox Technologies
> Mar 06 11:31:36 474133 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 474165 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
> Mar 06 11:31:36 474238 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:36 474249 [40A04960] -> Removed port with GUID:0x0002c90200007afe LID range [0xC,0xF] of node:MT23108 InfiniHost Mellanox Technologies
> Mar 06 11:31:36 474267 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2756
> Mar 06 11:31:36 474283 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2758
> Mar 06 11:31:36 474541 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
> Mar 06 11:31:36 474577 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2757
> Mar 06 11:31:36 474807 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x275a
> Mar 06 11:31:36 474827 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2759
> Mar 06 11:31:36 474814 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x275b
> Mar 06 11:31:36 474903 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x275c
> Mar 06 11:31:36 474999 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x275f
> Mar 06 11:31:36 475003 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x275e
> Mar 06 11:31:36 475024 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2760
> Mar 06 11:31:36 475038 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x275d
> Mar 06 11:31:36 475089 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2761
> Mar 06 11:31:36 475140 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2762
> Mar 06 11:31:36 475158 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2763
> Mar 06 11:31:36 475173 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2764
> Mar 06 11:31:36 475231 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2765
> Mar 06 11:31:36 475248 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2766
> Mar 06 11:31:36 475295 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2767
> Mar 06 11:31:36 475332 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2768
> Mar 06 11:31:36 475367 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x276a
> Mar 06 11:31:36 475350 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x2769
> Mar 06 11:31:36 475432 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x276c
> Mar 06 11:31:36 475416 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x276b
> Mar 06 11:31:36 475492 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x276d
> Mar 06 11:31:36 475522 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x276e
> Mar 06 11:31:36 475634 [40401960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS(3) in state OSM_SM_STATE_IDLE
> Mar 06 11:31:38 040389 [40602960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:38 040409 [40602960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
> Mar 06 11:31:38 040419 [40602960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table
> Mar 06 11:31:38 040463 [40602960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:38 040474 [40602960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
> Mar 06 11:31:38 040486 [40803960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0
> Mar 06 11:31:38 040587 [40602960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
> Mar 06 11:31:44 280928 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 TID:0x00000000000000c5
> Mar 06 11:31:44 280976 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 TID:0x00000000000000c9
> Mar 06 11:31:44 282252 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:44 282266 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
> Mar 06 11:31:44 282274 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table
> Mar 06 11:31:44 282304 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:44 282315 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
> Mar 06 11:31:44 282327 [40602960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0
> Mar 06 11:31:44 282441 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
> Mar 06 11:31:44 283808 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:44 283821 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
> Mar 06 11:31:44 283829 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table
> Mar 06 11:31:44 283859 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:44 283869 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
> Mar 06 11:31:44 283882 [40401960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0
> Mar 06 11:31:44 283967 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
> Mar 06 11:31:48 047137 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:48 047201 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
> Mar 06 11:31:48 047290 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
> Mar 06 11:31:48 047310 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
> Mar 06 11:31:48 047451 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
> Mar 06 11:31:48 047537 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
>                                 for parent node GUID = 0x1393010b186ba0, TID = 0x278d
> Mar 06 11:31:48 047543 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
> 
> -- 
> Jean-Christophe Hugly <jice at pantasys.com>
> PANTA
> 



More information about the general mailing list