[Users] InfiniBand Troubleshooting

Kenja, Krishna (kenjakt) kenjakt at mail.uc.edu
Mon Oct 6 06:49:22 PDT 2014


RDMA status says that it can't find any low level hardware support loaded (this is supposed to be mlx4_ib). So i restarted the rdma service using 

service rdma stop
service rdma start

But i still get the same result.

service rdma status            
Low level hardware support loaded:
	none found

Upper layer protocol modules:
	ib_ipoib 

User space access modules:
	rdma_ucm ib_ucm ib_uverbs ib_umad 

Connection management modules:
	rdma_cm ib_cm iw_cm 

Configured IPoIB interfaces: none
Currently active IPoIB interfaces: none

So what do you think is happening here?
________________________________________
From: Weiny, Ira <ira.weiny at intel.com>
Sent: Sunday, October 5, 2014 7:30 PM
To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
Subject: RE: [Users] InfiniBand Troubleshooting

> -----Original Message-----
> From: Kenja, Krishna (kenjakt) [mailto:kenjakt at mail.uc.edu]
>
> You are right, I don't see the device driver. Here is the output from "lsmod |
> grep ib"
>
> ib_ipoib               81001  0
> ib_ucm                 12121  0
> ib_uverbs              36124  2 rdma_ucm,ib_ucm
> ib_umad                11802  0
> ib_cm                  36580  3 ib_ipoib,ib_ucm,rdma_cm
> ib_addr                 6440  2 rdma_ucm,rdma_cm
> ib_sa                  23964  4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
> ib_mad                 39162  3 ib_umad,ib_cm,ib_sa
> ib_core                74355  10
> ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib
> _mad
> libfcoe                56791  2 bnx2fc,fcoe
> libfc                 108670  3 bnx2fc,fcoe,libfcoe
> scsi_transport_fc      55299  3 bnx2fc,fcoe,libfc
> ipv6                  317829  207 ib_ipoib,ib_addr,cnic
>
> And "lsmod | grep mlx" returned nothing.
>
> So how do you suggest I rectify this?

Because you stated that this "used to work" I would suggest following whatever procedure you had before to load those drivers.

With RHEL the start up script to load RDMA drivers is "rdma".  I would have to look up how other distros start the rdma stack.  OFED used to use openibd or something like that.

Furthermore I suggest that "something" must have changed for the driver to now be failing.  Have you looked at dmesg or the syslog to see if the driver is trying to load and is getting some errors?

Did you update your kernel?  Your distro?  Some OFED distro?

Ira



>
> Regards
> Krishna
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 1:57 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> lsmod | grep mlx
>
> Or
>
> lsmod | grep ib
>
> Make sure you see the device driver (mlx4_ib  I think) and the ib_umad
> module.
>
> Ira
>
> ________________________________
> From: Kenja, Krishna (kenjakt)
> Sent: Sunday, October 05, 2014 10:51:35 AM
> To: Weiny, Ira; users at lists.openfabrics.org
> Subject: Re: [Users] InfiniBand Troubleshooting
>
> "lspci | grep Mell" returned "27:00.0 Network controller: Mellanox
> Technologies MT27500 Family [ConnectX-3]"
>
> How do I make sure that the driver for HCA is loaded properly?
>
> Regards
> Krishna
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 1:47 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> If ibstat is not working are you sure the driver for your HCA is loaded properly?
>
>
> lspci and lsmod. Can help here.
>
>
> The output you show indicate no HCAs are present.
>
> Ira
>
>
> ________________________________
> From: users-bounces at lists.openfabrics.org on behalf of Kenja, Krishna (kenjakt)
> Sent: Sunday, October 05, 2014 10:35:02 AM
> To: users at lists.openfabrics.org
> Subject: [Users] InfiniBand Troubleshooting
>
>
> We have a Mellanox MT27500 Family, ConnectX-3 FDR InfiniBand card set up
> in the cluster. Everything was working fine until a week ago when InfiniBand
> suddenly stopped working for no apparent reason. I have been trying to
> troubleshoot this issue with no success and am need of some help.
>
> When i try to start the subnet manager on the master node using the
> command,
>
> [user at server ~]# /etc/init.d/opensm start
>
> i get an error saying it failed to start and the following message gets logged in
> the log file.
>
> Sep 30 10:36:58 137756 [DE707700] 0x80 -> OpenSM 3.3.15 Entering
> DISCOVERING state
>
> Sep 30 10:36:58 144767 [DE707700] 0x02 -> osm_vendor_init: 1000 pending
> umads specified Sep 30 10:36:58 148482 [DE707700] 0x80 -> Entering
> DISCOVERING state
>
> No local ports detected!
> Sep 30 10:36:58 148959 [DE707700] 0x01 -> perfmgr_mad_unbind: ERR 5405:
> No previous bind Sep 30 10:36:58 148969 [DE707700] 0x01 ->
> osm_congestion_control_shutdown: ERR C108: No previous bind Sep 30
> 10:36:58 149163 [DE707700] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No
> previous bind Exiting SM
>
> The most curious thing is that the command ibstat returns nothing which is
> making it really hard for me to troubleshoot this issue. However trying it in
> debug mode gives the following output.
>
> [user at server ~] ibstat -dd
> ibwarn: [29989] umad_init: umad_init
> ibwarn: [29989] umad_get_cas_names: max 32
> ibwarn: [29989] umad_get_cas_names: return 0 cas
>
> I am more than willing to provide any other information you need to get to the
> bottom of it.
>
> Any help is greatly appreciated!​



More information about the Users mailing list