[ofa-general] MPI_Test: ibv_poll_cq(): bad status 12

Glanfield, Wayne Wayne.Glanfield at uk.renaultf1.com
Fri Dec 12 09:04:12 PST 2008


Not sure if this is the correct forum, but we are experiencing problems with IB when running a commercial CFD code which is causing jobs to crash with the following errors. Could someone explain what is the likely cause of these and how we can minimise their occurrence.

Thanks Wayne

starccm+: Rank 0:172: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:172: MPI_Test: self cfd-cnsl-0230 peer cfd-cnsl-0144 (rank: 219)
starccm+: Rank 0:172: MPI_Test: error message: transport retry exceeded error

Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate'], 'Neo.Error': 'Error', 'Processor': 172, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}Synchronizing parallel nodes (attempt 0)


starccm+: Rank 0:71: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:68: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:71: MPI_Test: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank: 92)
starccm+: Rank 0:71: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:68: MPI_Test: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank: 93)
starccm+: Rank 0:68: MPI_Test: error message: transport retry exceeded error

Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate', 'AMGLinearSolver::solve'], 'Neo.Error': 'Error', 'Processor': 71, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}
Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:68: MPI_Gather: ibv_poll_cq(): bad status 5
starccm+: Rank 0:68: MPI_Gather: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank: 93)
starccm+: Rank 0:68: MPI_Gather: error message: work request flushed error
starccm+: Rank 0:71: MPI_Gather: ibv_poll_cq(): bad status 12
starccm+: Rank 0:71: MPI_Gather: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank: 91)
starccm+: Rank 0:71: MPI_Gather: error message: transport retry exceeded error
/apps/CFD/CD-ADAPCO/Linux/starccm+3.04.008/star/bin/starenv: line 961:  5745 Segmentation fault      "$@"

starccm+: Rank 0:118: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:46: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:42: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:118: MPI_Test: self cfd-cnsl-0408 peer cfd-cnsl-0452 (rank: 229)
starccm+: Rank 0:118: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:42: MPI_Test: self cfd-cnsl-0271 peer cfd-cnsl-0452 (rank: 229)
starccm+: Rank 0:42: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:46: MPI_Test: self cfd-cnsl-0271 peer cfd-cnsl-0452 (rank: 228)
starccm+: Rank 0:46: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:86: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:87: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:93: MPI_Test: ibv_poll_cq(): bad status 12

starccm+: Rank 0:244: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:244: MPI_Test: self cfd-cnsl-0342 peer cfd-cnsl-0257 (rank: 26)
starccm+: Rank 0:244: MPI_Test: error message: transport retry exceeded error

Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'RsTurbSolver::iterationUpdate'], 'Neo.Error': 'Error', 'Processor': 244, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}
Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:26: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:26: MPI_Cancel: self cfd-cnsl-0257 peer cfd-cnsl-0342 (rank: 244)
starccm+: Rank 0:26: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:244: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:244: MPI_Cancel: self cfd-cnsl-0342 peer cfd-cnsl-0257 (rank: 26)
starccm+: Rank 0:244: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:244: MPI_Cancel: MPI BUG: no requests done
/apps/CFD/CD-ADAPCO/Linux/starccm+3.04.008/star/bin/starenv: line 961:  5729 Segmentation fault      "$@"
MPI Application rank 244 exited before MPI_Finalize() with status 139

hung

starccm+: Rank 0:58: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:57: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:57: MPI_Test: self cfd-cnsl-0401 peer cfd-cnsl-0448 (rank: 40)
starccm+: Rank 0:57: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:58: MPI_Test: self cfd-cnsl-0401 peer cfd-cnsl-0448 (rank: 42)
starccm+: Rank 0:58: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:72: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:72: MPI_Test: self cfd-cnsl-0371 peer cfd-cnsl-0277 (rank: 1)
starccm+: Rank 0:72: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:74: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:74: MPI_Test: self cfd-cnsl-0371 peer cfd-cnsl-0277 (rank: 1)
starccm+: Rank 0:74: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:75: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:75: MPI_Test: self cfd-cnsl-0371 peer cfd-cnsl-0448 (rank: 40)
starccm+: Rank 0:75: MPI_Test: error message: transport retry exceeded error

starccm+: Rank 0:26: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:29: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:29: MPI_Test: self cfd-cnsl-0349 peer cfd-cnsl-0418 (rank: 252)
starccm+: Rank 0:29: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:26: MPI_Test: self cfd-cnsl-0349 peer cfd-cnsl-0418 (rank: 254)
starccm+: Rank 0:26: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:134: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:129: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:135: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:131: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:130: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:134: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank: 250)
starccm+: Rank 0:134: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:131: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank: 255)
starccm+: Rank 0:131: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:130: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank: 254)
starccm+: Rank 0:130: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:129: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank: 254)
starccm+: Rank 0:129: MPI_Test: error message: transport retry exceeded error



---------------------------------------------------------------------

For further information on the Renault F1 Team visit our web site at www.renaultf1.com. 
Renault F1 Team Limited
Registered in England no. 1806337
Registered Office: 16 Old Bailey London EC4M 7EG


WARNING: please ensure that you have adequate virus protection in place before you open or detach any documents attached to this email.

This e-mail may constitute privileged information. If you are not the intended recipient, you have received this confidential email and any attachments transmitted with it in error and you must not disclose copy, circulate or in any other way use or rely on this information.

E-mails to and from the Renault F1 Team are monitored for operational reasons and in accordance with lawful business practices.

The contents of this email are those of the individual and do not necessarily represent the views of the company.

Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.

If you have received this email in error please forward to: is.helpdesk at uk.renaultf1.com quoting the sender, then delete the message and any attached documents
---------------------------------------------------------------------




More information about the general mailing list