[Users] InfiniBand Troubleshooting
Kenja, Krishna (kenjakt)
kenjakt at mail.uc.edu
Mon Oct 6 07:48:45 PDT 2014
I checked dmesg and it pointed me to libmlx4.conf file. Fixing the contents of that file solved the problem.
Thank you all for the help.
Regards
Krishna
________________________________________
From: Rupert Dance <rsdance at soft-forge.com>
Sent: Monday, October 6, 2014 10:26 AM
To: Kenja, Krishna (kenjakt); 'Weiny, Ira'; users at lists.openfabrics.org
Subject: RE: [Users] InfiniBand Troubleshooting
Have you tried pulling the card and either re-seating it or putting it in another PCIe slot? If the card is then discovered, make sure you are running the latest firmware.
-----Original Message-----
From: users-bounces at lists.openfabrics.org [mailto:users-bounces at lists.openfabrics.org] On Behalf Of Kenja, Krishna (kenjakt)
Sent: Monday, October 06, 2014 9:49 AM
To: Weiny, Ira; users at lists.openfabrics.org
Subject: Re: [Users] InfiniBand Troubleshooting
RDMA status says that it can't find any low level hardware support loaded (this is supposed to be mlx4_ib). So i restarted the rdma service using
service rdma stop
service rdma start
But i still get the same result.
service rdma status
Low level hardware support loaded:
none found
Upper layer protocol modules:
ib_ipoib
User space access modules:
rdma_ucm ib_ucm ib_uverbs ib_umad
Connection management modules:
rdma_cm ib_cm iw_cm
Configured IPoIB interfaces: none
Currently active IPoIB interfaces: none
So what do you think is happening here?
________________________________________
From: Weiny, Ira <ira.weiny at intel.com>
Sent: Sunday, October 5, 2014 7:30 PM
To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
Subject: RE: [Users] InfiniBand Troubleshooting
> -----Original Message-----
> From: Kenja, Krishna (kenjakt) [mailto:kenjakt at mail.uc.edu]
>
> You are right, I don't see the device driver. Here is the output from
> "lsmod | grep ib"
>
> ib_ipoib 81001 0
> ib_ucm 12121 0
> ib_uverbs 36124 2 rdma_ucm,ib_ucm
> ib_umad 11802 0
> ib_cm 36580 3 ib_ipoib,ib_ucm,rdma_cm
> ib_addr 6440 2 rdma_ucm,rdma_cm
> ib_sa 23964 4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
> ib_mad 39162 3 ib_umad,ib_cm,ib_sa
> ib_core 74355 10
> ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,i
> b
> _mad
> libfcoe 56791 2 bnx2fc,fcoe
> libfc 108670 3 bnx2fc,fcoe,libfcoe
> scsi_transport_fc 55299 3 bnx2fc,fcoe,libfc
> ipv6 317829 207 ib_ipoib,ib_addr,cnic
>
> And "lsmod | grep mlx" returned nothing.
>
> So how do you suggest I rectify this?
Because you stated that this "used to work" I would suggest following whatever procedure you had before to load those drivers.
With RHEL the start up script to load RDMA drivers is "rdma". I would have to look up how other distros start the rdma stack. OFED used to use openibd or something like that.
Furthermore I suggest that "something" must have changed for the driver to now be failing. Have you looked at dmesg or the syslog to see if the driver is trying to load and is getting some errors?
Did you update your kernel? Your distro? Some OFED distro?
Ira
>
> Regards
> Krishna
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 1:57 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> lsmod | grep mlx
>
> Or
>
> lsmod | grep ib
>
> Make sure you see the device driver (mlx4_ib I think) and the ib_umad
> module.
>
> Ira
>
> ________________________________
> From: Kenja, Krishna (kenjakt)
> Sent: Sunday, October 05, 2014 10:51:35 AM
> To: Weiny, Ira; users at lists.openfabrics.org
> Subject: Re: [Users] InfiniBand Troubleshooting
>
> "lspci | grep Mell" returned "27:00.0 Network controller: Mellanox
> Technologies MT27500 Family [ConnectX-3]"
>
> How do I make sure that the driver for HCA is loaded properly?
>
> Regards
> Krishna
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 1:47 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> If ibstat is not working are you sure the driver for your HCA is loaded properly?
>
>
> lspci and lsmod. Can help here.
>
>
> The output you show indicate no HCAs are present.
>
> Ira
>
>
> ________________________________
> From: users-bounces at lists.openfabrics.org on behalf of Kenja, Krishna
> (kenjakt)
> Sent: Sunday, October 05, 2014 10:35:02 AM
> To: users at lists.openfabrics.org
> Subject: [Users] InfiniBand Troubleshooting
>
>
> We have a Mellanox MT27500 Family, ConnectX-3 FDR InfiniBand card set
> up in the cluster. Everything was working fine until a week ago when
> InfiniBand suddenly stopped working for no apparent reason. I have
> been trying to troubleshoot this issue with no success and am need of some help.
>
> When i try to start the subnet manager on the master node using the
> command,
>
> [user at server ~]# /etc/init.d/opensm start
>
> i get an error saying it failed to start and the following message
> gets logged in the log file.
>
> Sep 30 10:36:58 137756 [DE707700] 0x80 -> OpenSM 3.3.15 Entering
> DISCOVERING state
>
> Sep 30 10:36:58 144767 [DE707700] 0x02 -> osm_vendor_init: 1000
> pending umads specified Sep 30 10:36:58 148482 [DE707700] 0x80 ->
> Entering DISCOVERING state
>
> No local ports detected!
> Sep 30 10:36:58 148959 [DE707700] 0x01 -> perfmgr_mad_unbind: ERR 5405:
> No previous bind Sep 30 10:36:58 148969 [DE707700] 0x01 ->
> osm_congestion_control_shutdown: ERR C108: No previous bind Sep 30
> 10:36:58 149163 [DE707700] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11:
> No previous bind Exiting SM
>
> The most curious thing is that the command ibstat returns nothing
> which is making it really hard for me to troubleshoot this issue.
> However trying it in debug mode gives the following output.
>
> [user at server ~] ibstat -dd
> ibwarn: [29989] umad_init: umad_init
> ibwarn: [29989] umad_get_cas_names: max 32
> ibwarn: [29989] umad_get_cas_names: return 0 cas
>
> I am more than willing to provide any other information you need to
> get to the bottom of it.
>
> Any help is greatly appreciated!
_______________________________________________
Users mailing list
Users at lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/users
More information about the Users
mailing list