[ofa-general] OpenSM "Dead end on path to LID"

Nathan Dauchy Nathan.Dauchy at noaa.gov
Wed Jul 16 16:46:51 PDT 2008


Greetings,

We have recently expanded our Infiniband tree and are running into
problems when all hosts are booted.  Details are below.  Please let me
know if there is a more appropriate forum for this issue.  Thanks!


With less than 600 hosts, everything seems to be working fine.  With
more than 650 or so, we start seeing the following symptoms:

# ibdiagnet -o . -lw 4x -pc
-I- Discovering ... 721 nodes (68 Switches & 653 CA-s) discovered.
...
-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-E- Could not get PM info:
    "pmGetPortCounters 0x0139 1" failed 4 consecutive times.
-E- Could not get PM info:
    "pmGetPortCounters 0x0139 4" failed 4 consecutive times.

There are 29 of those "Could not get PM info" errors.

Basic IB communication still works at this point, but after restarting
the subnet manager, ping via IPoIB stops working between some of the
switches, and a LOT of messages like the following show up in osm.log:

Jul 16 22:32:13 795167 [41E02940] 0x01 -> __osm_pr_rcv_get_path_parms:
ERR 1F07: Dead end on path to LID 0x9 from switch for GUID
0x000002c900000023
Jul 16 22:36:04 895497 [45007940] 0x01 -> __osm_pr_rcv_get_path_parms:
ERR 1F07: Dead end on path to LID 0x5D7 from switch for GUID
0x000002c900000052

I have tried modifying "opensm.conf" to include:
	LMC=0 (was 2)
	TIMEOUT=500 (was 200)
but that did not seem to help.

Subnet manager host is running CentOS-5.1, kernel 2.6.18-53.1.21.el5,
OFED-1.3.1, OpenSM 3.1.11

Hosts are running either
	RHEL-4.4, kernel 2.6.20.20, OFED-1.2.5.1
	CentOS-5.1, kernel 2.6.22.19, OFED-1.3.1
	Storage vendor OS based on CentOS, kernel 2.6.9-42.0.10.ELsmp, OFED-1.2.5.1

Can anyone suggest a fix or other diagnostics we can run to help narrow
down the problem?

-Nathan




More information about the general mailing list