[ofa-general] OpenSM "Dead end on path to LID"

Nathan Dauchy Nathan.Dauchy at noaa.gov
Wed Jul 16 18:35:08 PDT 2008


Nathan Dauchy wrote:
> Greetings,
> 
> We have recently expanded our Infiniband tree and are running into
> problems when all hosts are booted.  Details are below.  Please let me
> know if there is a more appropriate forum for this issue.  Thanks!
> 
> 
> With less than 600 hosts, everything seems to be working fine.  With
> more than 650 or so, we start seeing the following symptoms:
> 
> # ibdiagnet -o . -lw 4x -pc
> -I- Discovering ... 721 nodes (68 Switches & 653 CA-s) discovered.
> ...
> -I---------------------------------------------------
> -I- PM Counters Info
> -I---------------------------------------------------
> -E- Could not get PM info:
>     "pmGetPortCounters 0x0139 1" failed 4 consecutive times.
> -E- Could not get PM info:
>     "pmGetPortCounters 0x0139 4" failed 4 consecutive times.
> 
> There are 29 of those "Could not get PM info" errors.
> 
> Basic IB communication still works at this point, but after restarting
> the subnet manager, ping via IPoIB stops working between some of the
> switches, and a LOT of messages like the following show up in osm.log:
> 
> Jul 16 22:32:13 795167 [41E02940] 0x01 -> __osm_pr_rcv_get_path_parms:
> ERR 1F07: Dead end on path to LID 0x9 from switch for GUID
> 0x000002c900000023
> Jul 16 22:36:04 895497 [45007940] 0x01 -> __osm_pr_rcv_get_path_parms:
> ERR 1F07: Dead end on path to LID 0x5D7 from switch for GUID
> 0x000002c900000052
> 

Looking through osm.log a bit more, I also found a handful of errors
like these:

Jul 17 01:31:29 345329 [46E0A940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for
node 0x000002c900000048(MT47396 Infiniscale-III Mellanox Technologies)
port 14. Adding to light sweep sampling list
Jul 17 01:31:29 345340 [46E0A940] 0x01 -> Directed Path Dump of 4 hop path:
                                Path = 0,1,20,7,15
Jul 17 01:31:29 345381 [46E0A940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for
node 0x000002c900000049(MT47396 Infiniscale-III Mellanox Technologies)
port 15. Adding to light sweep sampling list
Jul 17 01:31:29 345390 [46E0A940] 0x01 -> Directed Path Dump of 3 hop path:
                                Path = 0,1,22,11

Does that indicate a problem as well?

Thanks,
Nathan



More information about the general mailing list