[Users] mthca lockup

Orion Poplawski orion at cora.nwra.com
Thu Jan 2 10:15:23 PST 2014


On 09/30/2013 12:33 PM, Bart Van Assche wrote:
> On 09/23/13 20:37, Orion Poplawski wrote:
>> On 09/23/2013 12:00 PM, Rupert Dance wrote:
>>> When OFA software is installed from the OFED distribution, a utility is
>>> included called "ofed_info" which will spit out a lot of data about
>>> what was
>>> installed. A simpler command is available using "ofed_info -s" which
>>> gives
>>> just the version. Things may be slightly different in the packaging from
>>> various Distros.
>>>
>>> The reason I asked about the version is that OFED 3.5-2 includes an
>>> updated
>>> version of the mthca module and so I was curious if this could be
>>> related.
>>> If you want to try the latest build from the OFA you can find it here
>>> but be
>>> aware that you can get conflicts between the Distro version of OFA
>>> software
>>> and OFED itself. So try to remove all support for OFED before you
>>> installed
>>> the 3.5-2 package. If this is a production cluster, you may be best to
>>> try
>>> it on a test cluster first.
>>>
>>> http://www.openfabrics.org/downloads/OFED/ofed-3.5-2/OFED-3.5-2-rc1.tgz
>>>
>>
>> Thanks, but I don't see any evidence that 3.5-2 actually has an updated
>> libmthca.  It seems to have libmthca-1.0.6-1.src.rpm which seems to be
>> the same version I have via the distro.
>>
>> The release notes indicates an updated libmthca compared to 3.5-1, but
>> this appears to be a mistake.  It is updated compared to 3.5 though.
>>
>> Also, apparently err -16 indicates EBUSY so perhaps the hardware had
>> locked up somehow.
> 
> It might be a good idea to log queue pair numbers just after a queue pair has
> been created and just before a queue pair is destroyed. That will allow to
> figure out whether or not queue pair numbers are reused too quickly. A patch
> that resolved a similar issue for the mlx4 driver (but that is not in RHEL 6.4
> AFAIK) can be found here:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=f4ec9e9531ac79ee2521faf7ad3d98978f747e42.
> 
> 
> Bart.

Thanks, but it does appear that that change is in the 2.6.32-358 kernel.

Luckily I've not yet seen this lockup since.


-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com



More information about the Users mailing list