[Users] InfiniBand Troubleshooting
Weiny, Ira
ira.weiny at intel.com
Sun Oct 5 16:30:32 PDT 2014
> -----Original Message-----
> From: Kenja, Krishna (kenjakt) [mailto:kenjakt at mail.uc.edu]
>
> You are right, I don't see the device driver. Here is the output from "lsmod |
> grep ib"
>
> ib_ipoib 81001 0
> ib_ucm 12121 0
> ib_uverbs 36124 2 rdma_ucm,ib_ucm
> ib_umad 11802 0
> ib_cm 36580 3 ib_ipoib,ib_ucm,rdma_cm
> ib_addr 6440 2 rdma_ucm,rdma_cm
> ib_sa 23964 4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
> ib_mad 39162 3 ib_umad,ib_cm,ib_sa
> ib_core 74355 10
> ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib
> _mad
> libfcoe 56791 2 bnx2fc,fcoe
> libfc 108670 3 bnx2fc,fcoe,libfcoe
> scsi_transport_fc 55299 3 bnx2fc,fcoe,libfc
> ipv6 317829 207 ib_ipoib,ib_addr,cnic
>
> And "lsmod | grep mlx" returned nothing.
>
> So how do you suggest I rectify this?
Because you stated that this "used to work" I would suggest following whatever procedure you had before to load those drivers.
With RHEL the start up script to load RDMA drivers is "rdma". I would have to look up how other distros start the rdma stack. OFED used to use openibd or something like that.
Furthermore I suggest that "something" must have changed for the driver to now be failing. Have you looked at dmesg or the syslog to see if the driver is trying to load and is getting some errors?
Did you update your kernel? Your distro? Some OFED distro?
Ira
>
> Regards
> Krishna
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 1:57 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> lsmod | grep mlx
>
> Or
>
> lsmod | grep ib
>
> Make sure you see the device driver (mlx4_ib I think) and the ib_umad
> module.
>
> Ira
>
> ________________________________
> From: Kenja, Krishna (kenjakt)
> Sent: Sunday, October 05, 2014 10:51:35 AM
> To: Weiny, Ira; users at lists.openfabrics.org
> Subject: Re: [Users] InfiniBand Troubleshooting
>
> "lspci | grep Mell" returned "27:00.0 Network controller: Mellanox
> Technologies MT27500 Family [ConnectX-3]"
>
> How do I make sure that the driver for HCA is loaded properly?
>
> Regards
> Krishna
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 1:47 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> If ibstat is not working are you sure the driver for your HCA is loaded properly?
>
>
> lspci and lsmod. Can help here.
>
>
> The output you show indicate no HCAs are present.
>
> Ira
>
>
> ________________________________
> From: users-bounces at lists.openfabrics.org on behalf of Kenja, Krishna (kenjakt)
> Sent: Sunday, October 05, 2014 10:35:02 AM
> To: users at lists.openfabrics.org
> Subject: [Users] InfiniBand Troubleshooting
>
>
> We have a Mellanox MT27500 Family, ConnectX-3 FDR InfiniBand card set up
> in the cluster. Everything was working fine until a week ago when InfiniBand
> suddenly stopped working for no apparent reason. I have been trying to
> troubleshoot this issue with no success and am need of some help.
>
> When i try to start the subnet manager on the master node using the
> command,
>
> [user at server ~]# /etc/init.d/opensm start
>
> i get an error saying it failed to start and the following message gets logged in
> the log file.
>
> Sep 30 10:36:58 137756 [DE707700] 0x80 -> OpenSM 3.3.15 Entering
> DISCOVERING state
>
> Sep 30 10:36:58 144767 [DE707700] 0x02 -> osm_vendor_init: 1000 pending
> umads specified Sep 30 10:36:58 148482 [DE707700] 0x80 -> Entering
> DISCOVERING state
>
> No local ports detected!
> Sep 30 10:36:58 148959 [DE707700] 0x01 -> perfmgr_mad_unbind: ERR 5405:
> No previous bind Sep 30 10:36:58 148969 [DE707700] 0x01 ->
> osm_congestion_control_shutdown: ERR C108: No previous bind Sep 30
> 10:36:58 149163 [DE707700] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No
> previous bind Exiting SM
>
> The most curious thing is that the command ibstat returns nothing which is
> making it really hard for me to troubleshoot this issue. However trying it in
> debug mode gives the following output.
>
> [user at server ~] ibstat -dd
> ibwarn: [29989] umad_init: umad_init
> ibwarn: [29989] umad_get_cas_names: max 32
> ibwarn: [29989] umad_get_cas_names: return 0 cas
>
> I am more than willing to provide any other information you need to get to the
> bottom of it.
>
> Any help is greatly appreciated!
More information about the Users
mailing list