[Users] InfiniBand Troubleshooting
Lloyd Brown
lloyd_brown at byu.edu
Mon Oct 6 10:01:29 PDT 2014
Krishna,
I assume this is a file in /etc/modprobe.d/ or similar. For future
readers of this email thread, can you explain what was wrong with that
file, what you changed, etc.? That way someone else who has a similar
problem in the future, may be able to find the solution more easily.
Also, I agree with Ira; something had to have changed to cause this.
Any idea what that might have been? Did the host recently get rebooted?
Is it possible that someone installed a package (kernel, etc.) that
might not have changed anything until that reboot?
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu
On 10/06/2014 08:48 AM, Kenja, Krishna (kenjakt) wrote:
> I checked dmesg and it pointed me to libmlx4.conf file. Fixing the contents of that file solved the problem.
>
> Thank you all for the help.
>
> Regards
> Krishna
> ________________________________________
> From: Rupert Dance <rsdance at soft-forge.com>
> Sent: Monday, October 6, 2014 10:26 AM
> To: Kenja, Krishna (kenjakt); 'Weiny, Ira'; users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
> Have you tried pulling the card and either re-seating it or putting it in another PCIe slot? If the card is then discovered, make sure you are running the latest firmware.
>
> -----Original Message-----
> From: users-bounces at lists.openfabrics.org [mailto:users-bounces at lists.openfabrics.org] On Behalf Of Kenja, Krishna (kenjakt)
> Sent: Monday, October 06, 2014 9:49 AM
> To: Weiny, Ira; users at lists.openfabrics.org
> Subject: Re: [Users] InfiniBand Troubleshooting
>
> RDMA status says that it can't find any low level hardware support loaded (this is supposed to be mlx4_ib). So i restarted the rdma service using
>
> service rdma stop
> service rdma start
>
> But i still get the same result.
>
> service rdma status
> Low level hardware support loaded:
> none found
>
> Upper layer protocol modules:
> ib_ipoib
>
> User space access modules:
> rdma_ucm ib_ucm ib_uverbs ib_umad
>
> Connection management modules:
> rdma_cm ib_cm iw_cm
>
> Configured IPoIB interfaces: none
> Currently active IPoIB interfaces: none
>
> So what do you think is happening here?
> ________________________________________
> From: Weiny, Ira <ira.weiny at intel.com>
> Sent: Sunday, October 5, 2014 7:30 PM
> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
> Subject: RE: [Users] InfiniBand Troubleshooting
>
>> -----Original Message-----
>> From: Kenja, Krishna (kenjakt) [mailto:kenjakt at mail.uc.edu]
>>
>> You are right, I don't see the device driver. Here is the output from
>> "lsmod | grep ib"
>>
>> ib_ipoib 81001 0
>> ib_ucm 12121 0
>> ib_uverbs 36124 2 rdma_ucm,ib_ucm
>> ib_umad 11802 0
>> ib_cm 36580 3 ib_ipoib,ib_ucm,rdma_cm
>> ib_addr 6440 2 rdma_ucm,rdma_cm
>> ib_sa 23964 4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
>> ib_mad 39162 3 ib_umad,ib_cm,ib_sa
>> ib_core 74355 10
>> ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,i
>> b
>> _mad
>> libfcoe 56791 2 bnx2fc,fcoe
>> libfc 108670 3 bnx2fc,fcoe,libfcoe
>> scsi_transport_fc 55299 3 bnx2fc,fcoe,libfc
>> ipv6 317829 207 ib_ipoib,ib_addr,cnic
>>
>> And "lsmod | grep mlx" returned nothing.
>>
>> So how do you suggest I rectify this?
>
> Because you stated that this "used to work" I would suggest following whatever procedure you had before to load those drivers.
>
> With RHEL the start up script to load RDMA drivers is "rdma". I would have to look up how other distros start the rdma stack. OFED used to use openibd or something like that.
>
> Furthermore I suggest that "something" must have changed for the driver to now be failing. Have you looked at dmesg or the syslog to see if the driver is trying to load and is getting some errors?
>
> Did you update your kernel? Your distro? Some OFED distro?
>
> Ira
>
>
>
>>
>> Regards
>> Krishna
>> ________________________________________
>> From: Weiny, Ira <ira.weiny at intel.com>
>> Sent: Sunday, October 5, 2014 1:57 PM
>> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
>> Subject: RE: [Users] InfiniBand Troubleshooting
>>
>> lsmod | grep mlx
>>
>> Or
>>
>> lsmod | grep ib
>>
>> Make sure you see the device driver (mlx4_ib I think) and the ib_umad
>> module.
>>
>> Ira
>>
>> ________________________________
>> From: Kenja, Krishna (kenjakt)
>> Sent: Sunday, October 05, 2014 10:51:35 AM
>> To: Weiny, Ira; users at lists.openfabrics.org
>> Subject: Re: [Users] InfiniBand Troubleshooting
>>
>> "lspci | grep Mell" returned "27:00.0 Network controller: Mellanox
>> Technologies MT27500 Family [ConnectX-3]"
>>
>> How do I make sure that the driver for HCA is loaded properly?
>>
>> Regards
>> Krishna
>> ________________________________________
>> From: Weiny, Ira <ira.weiny at intel.com>
>> Sent: Sunday, October 5, 2014 1:47 PM
>> To: Kenja, Krishna (kenjakt); users at lists.openfabrics.org
>> Subject: RE: [Users] InfiniBand Troubleshooting
>>
>> If ibstat is not working are you sure the driver for your HCA is loaded properly?
>>
>>
>> lspci and lsmod. Can help here.
>>
>>
>> The output you show indicate no HCAs are present.
>>
>> Ira
>>
>>
>> ________________________________
>> From: users-bounces at lists.openfabrics.org on behalf of Kenja, Krishna
>> (kenjakt)
>> Sent: Sunday, October 05, 2014 10:35:02 AM
>> To: users at lists.openfabrics.org
>> Subject: [Users] InfiniBand Troubleshooting
>>
>>
>> We have a Mellanox MT27500 Family, ConnectX-3 FDR InfiniBand card set
>> up in the cluster. Everything was working fine until a week ago when
>> InfiniBand suddenly stopped working for no apparent reason. I have
>> been trying to troubleshoot this issue with no success and am need of some help.
>>
>> When i try to start the subnet manager on the master node using the
>> command,
>>
>> [user at server ~]# /etc/init.d/opensm start
>>
>> i get an error saying it failed to start and the following message
>> gets logged in the log file.
>>
>> Sep 30 10:36:58 137756 [DE707700] 0x80 -> OpenSM 3.3.15 Entering
>> DISCOVERING state
>>
>> Sep 30 10:36:58 144767 [DE707700] 0x02 -> osm_vendor_init: 1000
>> pending umads specified Sep 30 10:36:58 148482 [DE707700] 0x80 ->
>> Entering DISCOVERING state
>>
>> No local ports detected!
>> Sep 30 10:36:58 148959 [DE707700] 0x01 -> perfmgr_mad_unbind: ERR 5405:
>> No previous bind Sep 30 10:36:58 148969 [DE707700] 0x01 ->
>> osm_congestion_control_shutdown: ERR C108: No previous bind Sep 30
>> 10:36:58 149163 [DE707700] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11:
>> No previous bind Exiting SM
>>
>> The most curious thing is that the command ibstat returns nothing
>> which is making it really hard for me to troubleshoot this issue.
>> However trying it in debug mode gives the following output.
>>
>> [user at server ~] ibstat -dd
>> ibwarn: [29989] umad_init: umad_init
>> ibwarn: [29989] umad_get_cas_names: max 32
>> ibwarn: [29989] umad_get_cas_names: return 0 cas
>>
>> I am more than willing to provide any other information you need to
>> get to the bottom of it.
>>
>> Any help is greatly appreciated!
>
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users
>
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users
>
More information about the Users
mailing list