[Users] Question about CQ overrun

Moye,Roger V RVMoye at mdanderson.org
Tue Mar 29 14:45:10 PDT 2016


We have been running our HPC cluster on RHEL 6.5 with OFED 2.4 for several months.   Suddenly we are seeing errors on our compute nodes like this:

Mar 29 09:29:17 cnode301 kernel: mlx4_core 0000:87:00.0: CQ overrun on CQN 00009c
Mar 29 09:30:54 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..
Mar 29 09:30:59 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..
Mar 29 09:31:04 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..
Mar 29 09:31:09 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..
Mar 29 09:31:14 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..
Mar 29 09:31:14 cnode301 kernel: ib0: ipoib_cm_tx_destroy: 1 not completed force cleanup.

It is at this point that the compute node has to be rebooted.

The user is not running MPI code, though presumably he is doing I/O to the cluster filesystem which is mounted over the Infiniband network.    The particular application that is running seems to provoke this error more often than anything else, so we assume the app is hitting a bug somewhere within our cluster config (firmware or filesystem or OS or OFED stack).

Is an error like this a result of a driver issue, firmware issue, or something else?

Any suggestions on where to look to find the problem would be appreciated.

Thanks so much!
-Roger

===============================
Roger V. Moye
UNIX Systems Administrator
XSEDE Campus Champion
University of Texas - MD Anderson Cancer Center
Research Information Systems and Technology Services
Houston, Texas
1MC 13.2430
(713) 792-2134
===============================
The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20160329/c83e179a/attachment.html>


More information about the Users mailing list