[Users] InfiniBand Troubleshooting

Kenja, Krishna (kenjakt) kenjakt at mail.uc.edu
Sun Oct 5 11:02:05 PDT 2014


You are right, I don't see the device driver. Here is the output from "lsmod | grep ib"

ib_ipoib               81001  0
ib_ucm                 12121  0
ib_uverbs              36124  2 rdma_ucm,ib_ucm
ib_umad                11802  0
ib_cm                  36580  3 ib_ipoib,ib_ucm,rdma_cm
ib_addr                 6440  2 rdma_ucm,rdma_cm
ib_sa                  23964  4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
ib_mad                 39162  3 ib_umad,ib_cm,ib_sa
ib_core                74355  10 ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mad
libfcoe                56791  2 bnx2fc,fcoe
libfc                 108670  3 bnx2fc,fcoe,libfcoe
scsi_transport_fc      55299  3 bnx2fc,fcoe,libfc
ipv6                  317829  207 ib_ipoib,ib_addr,cnic

And "lsmod | grep mlx" returned nothing.

So how do you suggest I rectify this?

Regards
Krishna
________________________________________
From: Weiny, Ira <ira.weiny at intel.com>
Sent: Sunday, October 5, 2014 1:57 PM
To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
Subject: RE: [Users] InfiniBand Troubleshooting

lsmod | grep mlx

Or

lsmod | grep ib

Make sure you see the device driver (mlx4_ib  I think) and the ib_umad module.

Ira

________________________________
From: Kenja, Krishna (kenjakt)
Sent: Sunday, October 05, 2014 10:51:35 AM
To: Weiny, Ira; users at lists.openfabrics.org
Subject: Re: [Users] InfiniBand Troubleshooting

"lspci | grep Mell" returned "27:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]"

How do I make sure that the driver for HCA is loaded properly?

Regards
Krishna
________________________________________
From: Weiny, Ira <ira.weiny at intel.com>
Sent: Sunday, October 5, 2014 1:47 PM
To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
Subject: RE: [Users] InfiniBand Troubleshooting

If ibstat is not working are you sure the driver for your HCA is loaded properly?


lspci and lsmod. Can help here.


The output you show indicate no HCAs are present.

Ira


________________________________
From: users-bounces at lists.openfabrics.org on behalf of Kenja, Krishna (kenjakt)
Sent: Sunday, October 05, 2014 10:35:02 AM
To: users at lists.openfabrics.org
Subject: [Users] InfiniBand Troubleshooting


We have a Mellanox MT27500 Family, ConnectX-3 FDR InfiniBand card set up in the cluster. Everything was working fine until a week ago when InfiniBand suddenly stopped working for no apparent reason. I have been trying to troubleshoot this issue with no success and am need of some help.

When i try to start the subnet manager on the master node using the command,

[user at server ~]# /etc/init.d/opensm start

i get an error saying it failed to start and the following message gets logged in the log file.

Sep 30 10:36:58 137756 [DE707700] 0x80 -> OpenSM 3.3.15
Entering DISCOVERING state

Sep 30 10:36:58 144767 [DE707700] 0x02 -> osm_vendor_init: 1000 pending umads specified
Sep 30 10:36:58 148482 [DE707700] 0x80 -> Entering DISCOVERING state

No local ports detected!
Sep 30 10:36:58 148959 [DE707700] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
Sep 30 10:36:58 148969 [DE707700] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
Sep 30 10:36:58 149163 [DE707700] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
Exiting SM

The most curious thing is that the command ibstat returns nothing which is making it really hard for me to troubleshoot this issue. However trying it in debug mode gives the following output.

[user at server ~] ibstat -dd
ibwarn: [29989] umad_init: umad_init
ibwarn: [29989] umad_get_cas_names: max 32
ibwarn: [29989] umad_get_cas_names: return 0 cas

I am more than willing to provide any other information you need to get to the bottom of it.

Any help is greatly appreciated!​



More information about the Users mailing list