[openib-general] osm unreliable unless -d1

Jean-Christophe Hugly jice at pantasys.com
Mon Mar 6 11:44:28 PST 2006


Hi Eitan, Hi Sasha,


On Sun, 2006-03-05 at 00:22 +0200, Sasha Khapyorsky wrote:

> > On the other hand if you run osm with -d1 option (mostly
> > single-threaded), then it seems to work indefinitely.
> 
> I've tried your script and don't see any difference between modes with
> and without -d1, however my network is small - two hosts and switch,
> probably this is different from your.
> 
No, I am testing this on a small setup: two cpus, two switches, one
extra machine running osm. Here's the output from ibnetdiscover:

#
# Topology file: generated on Mon Mar  6 11:36:23 2006
#
# Max of 3 hops discovered
# Initiated from node 0002c90200007afc port 0002c90200007afd

vendid=0x2c9
devid=0xb924
switchguid=0x1393010b186ba0
Switch  24 "S-001393010b186ba0"         # MT47396 Infiniscale-III
Mellanox Technologies port 0 lid 9
[21]    "H-001393000024a510"[1]
[17]    "H-001393000024a600"[1]
[4]     "S-001393010b186b08"[12]
[3]     "S-001393010b186b08"[11]
[2]     "S-001393010b186b08"[10]
[1]     "S-001393010b186b08"[9]
[8]     "H-0002c90200007afc"[1]

vendid=0x2c9
devid=0xb924
switchguid=0x1393010b186b08
Switch  24 "S-001393010b186b08"         # MT47396 Infiniscale-III
Mellanox Technologies port 0 lid 8
[21]    "H-001393000024a510"[2]
[17]    "H-001393000024a600"[2]
[8]     "H-0002c90200007afc"[2]
[12]    "S-001393010b186ba0"[4]
[11]    "S-001393010b186ba0"[3]
[10]    "S-001393010b186ba0"[2]
[9]     "S-001393010b186ba0"[1]

vendid=0x2c9
devid=0x6282
sysimgguid=0x1393000024a516
caguid=0x1393000024a510
Ca      2 "H-001393000024a510"          # MT25218 InfiniHostEx Mellanox
Technologies
[2]     "S-001393010b186b08"[21]                # lid 20 lmc 2
[1]     "S-001393010b186ba0"[21]                # lid 16 lmc 2

vendid=0x2c9
devid=0x6282
sysimgguid=0x1393000024a606
caguid=0x1393000024a600
Ca      2 "H-001393000024a600"          # MT25218 InfiniHostEx Mellanox
Technologies
[2]     "S-001393010b186b08"[17]                # lid 28 lmc 2
[1]     "S-001393010b186ba0"[17]                # lid 24 lmc 2

vendid=0x2c9
devid=0x5a44
sysimgguid=0x2c90200007afc
caguid=0x2c90200007afc
Ca      2 "H-0002c90200007afc"          # MT23108 InfiniHost Mellanox
Technologies
[2]     "S-001393010b186b08"[8]         # lid 12 lmc 2
[1]     "S-001393010b186ba0"[8]         # lid 4 lmc 2

There may be a number of possibly significant differences between my
setup and yours, though:

Both CPUs are quad-opterons, the machine running osm is a dual xeon
where osm was also compiled. So it's all 64 bit and SMP.

The firmwares are 0.7.0 for the switches and 5.1.0 for the cpus. The osm
host has 3.3.3.

One more detail, I am running with LMC=2 betcause I wanted to check that
the LMC>0 were fixed (they seem to be; I do not see any LMC-related
missbehaviour. With -d1 everything looks shipshape).

> Also I see that finally port becomes active but after delay. Those
> delays look strange and inconsistent, I will need to test more tomorrow.
> Could you try such modification for your script?
> 
> i=1
> while true; do
> 	modprobe -r ib_mthca
> 	sleep 3
> 	modprobe ib_mthca
> 	count=0
> 	while true ; do
> 		ibstat | egrep 'State: Active$' > /dev/null
> 		test $? -eq 0 && break
> 		count=`expr $count + 1`
> 		sleep 1
> 	done
> 	echo $i: delay $count
> 	sleep 3
> 	i=`expr $i + 1`
> done
> 
Here's the output from your script. After the last line in doesn't make
further progress (I waited something like 10 minutes).
Addressing Eitan comment, I tried the same thing with a delay of 7
seconds rather than 3 between modprobe -r and modprobe. The results are
the same:

1: delay 0
2: delay 0
3: delay 0
4: delay 0
5: delay 0
6: delay 0
<nothing happens>

In case it contains usefull clues, here's a sample of osm's log at
around the point things start falling appart:

Mar 06 11:31:36 036291 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 TID:0x00000000000000c4
Mar 06 11:31:36 036452 [40A04960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x001393010b186ba0
Mar 06 11:31:36 044333 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 TID:0x00000000000000c8
Mar 06 11:31:36 044921 [40A04960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x001393010b186b08
Mar 06 11:31:36 056540 [40401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 056562 [40401960] -> Discovered new port with GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx Mellanox Technologies
Mar 06 11:31:36 056570 [40401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 056578 [40401960] -> Discovered new port with GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx Mellanox Technologies
Mar 06 11:31:36 056673 [40401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches
Mar 06 11:31:36 082257 [40A04960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches
Mar 06 11:31:36 446369 [40602960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0010 TID:0x0000000000000000
Mar 06 11:31:36 446400 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0014 TID:0x0000000000000001
Mar 06 11:31:36 446614 [40602960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0010 GID:0xfe80000000000000,0x001393000024a511
Mar 06 11:31:36 446657 [40401960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0014 GID:0xfe80000000000000,0x001393000024a512
Mar 06 11:31:36 465919 [40401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches
Mar 06 11:31:36 473124 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 473151 [40A04960] -> Removed port with GUID:0x001393000024a601 LID range [0x18,0x1B] of node:MT25218 InfiniHostEx Mellanox Technologies
Mar 06 11:31:36 473196 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 473209 [40A04960] -> Removed port with GUID:0x001393000024a602 LID range [0x1C,0x1F] of node:MT25218 InfiniHostEx Mellanox Technologies
Mar 06 11:31:36 473526 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 473568 [40A04960] -> Removed port with GUID:0x001393010b186b08 LID range [0x8,0x8] of node:MT47396 Infiniscale-III Mellanox Technologies
Mar 06 11:31:36 473710 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 473722 [40A04960] -> Removed port with GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx Mellanox Technologies
Mar 06 11:31:36 473758 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 473770 [40A04960] -> Removed port with GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx Mellanox Technologies
Mar 06 11:31:36 474015 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 474050 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x9,0x9] of node:MT47396 Infiniscale-III Mellanox Technologies
Mar 06 11:31:36 474133 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 474165 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
Mar 06 11:31:36 474238 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:36 474249 [40A04960] -> Removed port with GUID:0x0002c90200007afe LID range [0xC,0xF] of node:MT23108 InfiniHost Mellanox Technologies
Mar 06 11:31:36 474267 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2756
Mar 06 11:31:36 474283 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2758
Mar 06 11:31:36 474541 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
Mar 06 11:31:36 474577 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2757
Mar 06 11:31:36 474807 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x275a
Mar 06 11:31:36 474827 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2759
Mar 06 11:31:36 474814 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x275b
Mar 06 11:31:36 474903 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x275c
Mar 06 11:31:36 474999 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x275f
Mar 06 11:31:36 475003 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x275e
Mar 06 11:31:36 475024 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2760
Mar 06 11:31:36 475038 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x275d
Mar 06 11:31:36 475089 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2761
Mar 06 11:31:36 475140 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2762
Mar 06 11:31:36 475158 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2763
Mar 06 11:31:36 475173 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2764
Mar 06 11:31:36 475231 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2765
Mar 06 11:31:36 475248 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2766
Mar 06 11:31:36 475295 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2767
Mar 06 11:31:36 475332 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2768
Mar 06 11:31:36 475367 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x276a
Mar 06 11:31:36 475350 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x2769
Mar 06 11:31:36 475432 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x276c
Mar 06 11:31:36 475416 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x276b
Mar 06 11:31:36 475492 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x276d
Mar 06 11:31:36 475522 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x276e
Mar 06 11:31:36 475634 [40401960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS(3) in state OSM_SM_STATE_IDLE
Mar 06 11:31:38 040389 [40602960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:38 040409 [40602960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
Mar 06 11:31:38 040419 [40602960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table
Mar 06 11:31:38 040463 [40602960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:38 040474 [40602960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
Mar 06 11:31:38 040486 [40803960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0
Mar 06 11:31:38 040587 [40602960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
Mar 06 11:31:44 280928 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 TID:0x00000000000000c5
Mar 06 11:31:44 280976 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 TID:0x00000000000000c9
Mar 06 11:31:44 282252 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:44 282266 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
Mar 06 11:31:44 282274 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table
Mar 06 11:31:44 282304 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:44 282315 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
Mar 06 11:31:44 282327 [40602960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0
Mar 06 11:31:44 282441 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
Mar 06 11:31:44 283808 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:44 283821 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
Mar 06 11:31:44 283829 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table
Mar 06 11:31:44 283859 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:44 283869 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
Mar 06 11:31:44 283882 [40401960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0
Mar 06 11:31:44 283967 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
Mar 06 11:31:48 047137 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:48 047201 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies
Mar 06 11:31:48 047290 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd
Mar 06 11:31:48 047310 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies
Mar 06 11:31:48 047451 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd
Mar 06 11:31:48 047537 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0
                                for parent node GUID = 0x1393010b186ba0, TID = 0x278d
Mar 06 11:31:48 047543 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0

-- 
Jean-Christophe Hugly <jice at pantasys.com>
PANTA




More information about the general mailing list