[Users] InfiniBand Troubleshooting

Lloyd Brown lloyd_brown at byu.edu
Mon Oct 6 10:15:08 PDT 2014


We've all typo'd something like that in the past.  If it weren't so
off-topic, I'm sure we could all relate at least a few stories along
those lines.

Glad you found the problem.



Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 10/06/2014 11:09 AM, Kenja, Krishna (kenjakt) wrote:
> Yes, the file I mentioned was /etc/modprobe.d/libmlx4.conf
> 
> Its embarrassing actually. I edited this file because I was having an issue with the memory and added the log_num_mtt option. Only now i realized that i made a spelling mistake which caused infiniband drivers to not load properly when i rebooted. Should be more careful in the future. 
> 
> Regards
> Krishna
> ________________________________________
> From: users-bounces at lists.openfabrics.org <users-bounces at lists.openfabrics.org> on behalf of Lloyd Brown <lloyd_brown at byu.edu>
> Sent: Monday, October 6, 2014 1:01 PM
> To: users at lists.openfabrics.org
> Subject: Re: [Users] InfiniBand Troubleshooting
> 
> Krishna,
> 
> I assume this is a file in /etc/modprobe.d/ or similar.  For future
> readers of this email thread, can you explain what was wrong with that
> file, what you changed, etc.?  That way someone else who has a similar
> problem in the future, may be able to find the solution more easily.
> 
> Also, I agree with Ira; something had to have changed to cause this.
> Any idea what that might have been?  Did the host recently get rebooted?
>  Is it possible that someone installed a package (kernel, etc.) that
> might not have changed anything until that reboot?
> 
> 
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 10/06/2014 08:48 AM, Kenja, Krishna (kenjakt) wrote:
>> I checked dmesg and it pointed me to libmlx4.conf file. Fixing the contents of that file solved the problem.
>>
>> Thank you all for the help.
>>
>> Regards
>> Krishna
>> ________________________________________
>> From: Rupert Dance <rsdance at soft-forge.com>
>> Sent: Monday, October 6, 2014 10:26 AM
>> To: Kenja, Krishna (kenjakt); 'Weiny, Ira'; users at lists.openfabrics.org
>> Subject: RE: [Users] InfiniBand Troubleshooting
>>
>> Have you tried pulling the card and either re-seating it or putting it in another PCIe slot? If the card is then discovered, make sure you are running the latest firmware.
>>
>> -----Original Message-----
>> From: users-bounces at lists.openfabrics.org [mailto:users-bounces at lists.openfabrics.org] On Behalf Of Kenja, Krishna (kenjakt)
>> Sent: Monday, October 06, 2014 9:49 AM
>> To: Weiny, Ira; users at lists.openfabrics.org
>> Subject: Re: [Users] InfiniBand Troubleshooting
>>
>> RDMA status says that it can't find any low level hardware support loaded (this is supposed to be mlx4_ib). So i restarted the rdma service using
>>
>> service rdma stop
>> service rdma start
>>
>> But i still get the same result.
>>
>> service rdma status
>> Low level hardware support loaded:
>>         none found
>>
>> Upper layer protocol modules:
>>         ib_ipoib
>>
>> User space access modules:
>>         rdma_ucm ib_ucm ib_uverbs ib_umad
>>
>> Connection management modules:
>>         rdma_cm ib_cm iw_cm
>>
>> Configured IPoIB interfaces: none
>> Currently active IPoIB interfaces: none
>>
>> So what do you think is happening here?
>> ________________________________________
>> From: Weiny, Ira <ira.weiny at intel.com>
>> Sent: Sunday, October 5, 2014 7:30 PM
>> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
>> Subject: RE: [Users] InfiniBand Troubleshooting
>>
>>> -----Original Message-----
>>> From: Kenja, Krishna (kenjakt) [mailto:kenjakt at mail.uc.edu]
>>>
>>> You are right, I don't see the device driver. Here is the output from
>>> "lsmod | grep ib"
>>>
>>> ib_ipoib               81001  0
>>> ib_ucm                 12121  0
>>> ib_uverbs              36124  2 rdma_ucm,ib_ucm
>>> ib_umad                11802  0
>>> ib_cm                  36580  3 ib_ipoib,ib_ucm,rdma_cm
>>> ib_addr                 6440  2 rdma_ucm,rdma_cm
>>> ib_sa                  23964  4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
>>> ib_mad                 39162  3 ib_umad,ib_cm,ib_sa
>>> ib_core                74355  10
>>> ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,i
>>> b
>>> _mad
>>> libfcoe                56791  2 bnx2fc,fcoe
>>> libfc                 108670  3 bnx2fc,fcoe,libfcoe
>>> scsi_transport_fc      55299  3 bnx2fc,fcoe,libfc
>>> ipv6                  317829  207 ib_ipoib,ib_addr,cnic
>>>
>>> And "lsmod | grep mlx" returned nothing.
>>>
>>> So how do you suggest I rectify this?
>>
>> Because you stated that this "used to work" I would suggest following whatever procedure you had before to load those drivers.
>>
>> With RHEL the start up script to load RDMA drivers is "rdma".  I would have to look up how other distros start the rdma stack.  OFED used to use openibd or something like that.
>>
>> Furthermore I suggest that "something" must have changed for the driver to now be failing.  Have you looked at dmesg or the syslog to see if the driver is trying to load and is getting some errors?
>>
>> Did you update your kernel?  Your distro?  Some OFED distro?
>>
>> Ira
>>
>>
>>
>>>
>>> Regards
>>> Krishna
>>> ________________________________________
>>> From: Weiny, Ira <ira.weiny at intel.com>
>>> Sent: Sunday, October 5, 2014 1:57 PM
>>> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
>>> Subject: RE: [Users] InfiniBand Troubleshooting
>>>
>>> lsmod | grep mlx
>>>
>>> Or
>>>
>>> lsmod | grep ib
>>>
>>> Make sure you see the device driver (mlx4_ib  I think) and the ib_umad
>>> module.
>>>
>>> Ira
>>>
>>> ________________________________
>>> From: Kenja, Krishna (kenjakt)
>>> Sent: Sunday, October 05, 2014 10:51:35 AM
>>> To: Weiny, Ira; users at lists.openfabrics.org
>>> Subject: Re: [Users] InfiniBand Troubleshooting
>>>
>>> "lspci | grep Mell" returned "27:00.0 Network controller: Mellanox
>>> Technologies MT27500 Family [ConnectX-3]"
>>>
>>> How do I make sure that the driver for HCA is loaded properly?
>>>
>>> Regards
>>> Krishna
>>> ________________________________________
>>> From: Weiny, Ira <ira.weiny at intel.com>
>>> Sent: Sunday, October 5, 2014 1:47 PM
>>> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
>>> Subject: RE: [Users] InfiniBand Troubleshooting
>>>
>>> If ibstat is not working are you sure the driver for your HCA is loaded properly?
>>>
>>>
>>> lspci and lsmod. Can help here.
>>>
>>>
>>> The output you show indicate no HCAs are present.
>>>
>>> Ira
>>>
>>>
>>> ________________________________
>>> From: users-bounces at lists.openfabrics.org on behalf of Kenja, Krishna
>>> (kenjakt)
>>> Sent: Sunday, October 05, 2014 10:35:02 AM
>>> To: users at lists.openfabrics.org
>>> Subject: [Users] InfiniBand Troubleshooting
>>>
>>>
>>> We have a Mellanox MT27500 Family, ConnectX-3 FDR InfiniBand card set
>>> up in the cluster. Everything was working fine until a week ago when
>>> InfiniBand suddenly stopped working for no apparent reason. I have
>>> been trying to troubleshoot this issue with no success and am need of some help.
>>>
>>> When i try to start the subnet manager on the master node using the
>>> command,
>>>
>>> [user at server ~]# /etc/init.d/opensm start
>>>
>>> i get an error saying it failed to start and the following message
>>> gets logged in the log file.
>>>
>>> Sep 30 10:36:58 137756 [DE707700] 0x80 -> OpenSM 3.3.15 Entering
>>> DISCOVERING state
>>>
>>> Sep 30 10:36:58 144767 [DE707700] 0x02 -> osm_vendor_init: 1000
>>> pending umads specified Sep 30 10:36:58 148482 [DE707700] 0x80 ->
>>> Entering DISCOVERING state
>>>
>>> No local ports detected!
>>> Sep 30 10:36:58 148959 [DE707700] 0x01 -> perfmgr_mad_unbind: ERR 5405:
>>> No previous bind Sep 30 10:36:58 148969 [DE707700] 0x01 ->
>>> osm_congestion_control_shutdown: ERR C108: No previous bind Sep 30
>>> 10:36:58 149163 [DE707700] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11:
>>> No previous bind Exiting SM
>>>
>>> The most curious thing is that the command ibstat returns nothing
>>> which is making it really hard for me to troubleshoot this issue.
>>> However trying it in debug mode gives the following output.
>>>
>>> [user at server ~] ibstat -dd
>>> ibwarn: [29989] umad_init: umad_init
>>> ibwarn: [29989] umad_get_cas_names: max 32
>>> ibwarn: [29989] umad_get_cas_names: return 0 cas
>>>
>>> I am more than willing to provide any other information you need to
>>> get to the bottom of it.
>>>
>>> Any help is greatly appreciated!​
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/users
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/users
>>
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users
> 



More information about the Users mailing list