[ofa-general] OpenSM "Dead end on path to LID"
Nathan Dauchy
Nathan.Dauchy at noaa.gov
Wed Jul 16 18:35:08 PDT 2008
Nathan Dauchy wrote:
> Greetings,
>
> We have recently expanded our Infiniband tree and are running into
> problems when all hosts are booted. Details are below. Please let me
> know if there is a more appropriate forum for this issue. Thanks!
>
>
> With less than 600 hosts, everything seems to be working fine. With
> more than 650 or so, we start seeing the following symptoms:
>
> # ibdiagnet -o . -lw 4x -pc
> -I- Discovering ... 721 nodes (68 Switches & 653 CA-s) discovered.
> ...
> -I---------------------------------------------------
> -I- PM Counters Info
> -I---------------------------------------------------
> -E- Could not get PM info:
> "pmGetPortCounters 0x0139 1" failed 4 consecutive times.
> -E- Could not get PM info:
> "pmGetPortCounters 0x0139 4" failed 4 consecutive times.
>
> There are 29 of those "Could not get PM info" errors.
>
> Basic IB communication still works at this point, but after restarting
> the subnet manager, ping via IPoIB stops working between some of the
> switches, and a LOT of messages like the following show up in osm.log:
>
> Jul 16 22:32:13 795167 [41E02940] 0x01 -> __osm_pr_rcv_get_path_parms:
> ERR 1F07: Dead end on path to LID 0x9 from switch for GUID
> 0x000002c900000023
> Jul 16 22:36:04 895497 [45007940] 0x01 -> __osm_pr_rcv_get_path_parms:
> ERR 1F07: Dead end on path to LID 0x5D7 from switch for GUID
> 0x000002c900000052
>
Looking through osm.log a bit more, I also found a handful of errors
like these:
Jul 17 01:31:29 345329 [46E0A940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for
node 0x000002c900000048(MT47396 Infiniscale-III Mellanox Technologies)
port 14. Adding to light sweep sampling list
Jul 17 01:31:29 345340 [46E0A940] 0x01 -> Directed Path Dump of 4 hop path:
Path = 0,1,20,7,15
Jul 17 01:31:29 345381 [46E0A940] 0x01 ->
__osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for
node 0x000002c900000049(MT47396 Infiniscale-III Mellanox Technologies)
port 15. Adding to light sweep sampling list
Jul 17 01:31:29 345390 [46E0A940] 0x01 -> Directed Path Dump of 3 hop path:
Path = 0,1,22,11
Does that indicate a problem as well?
Thanks,
Nathan
More information about the general
mailing list