[ofa-general] Multiple job failures at same time

Glanfield, Wayne Wayne.Glanfield at uk.renaultf1.com
Wed Dec 3 09:37:33 PST 2008


We have just experienced a problem where 5 jobs failed at the same time ~15:50 GMT with similar messages in their respective output files. Does anybody have any idea what could have cause this and what the messages mean. One of the nodes "cfd-cnsl-0364" was found to have shutdown but could this take out other jobs? They were not running on this node,

This is a commercial CFD code which is using hp-mpi 2.2.5, we are running ofed 1.3.1 and using verbs api with Mellanox ConnectX HCA

Thanks
Wayne


JOB #1
starccm+: Rank 0:52: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:50: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:55: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:55: MPI_Test: self cfd-cnsl-0355 peer cfd-cnsl-0365 (rank: 126)
starccm+: Rank 0:55: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:52: MPI_Test: self cfd-cnsl-0355 peer cfd-cnsl-0365 (rank: 120)
starccm+: Rank 0:52: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:50: MPI_Test: self cfd-cnsl-0355 peer cfd-cnsl-0365 (rank: 127)
starccm+: Rank 0:50: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:38: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:35: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:33: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:35: MPI_Test: self cfd-cnsl-0352 peer cfd-cnsl-0364 (rank: 119)
starccm+: Rank 0:35: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:38: MPI_Test: self cfd-cnsl-0352 peer cfd-cnsl-0365 (rank: 121)
starccm+: Rank 0:38: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:33: MPI_Test: self cfd-cnsl-0352 peer cfd-cnsl-0365 (rank: 121)
starccm+: Rank 0:33: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:46: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:46: MPI_Test: self cfd-cnsl-0353 peer cfd-cnsl-0365 (rank: 123)
starccm+: Rank 0:46: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:87: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:80: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:87: MPI_Test: self cfd-cnsl-0360 peer cfd-cnsl-0365 (rank: 121)
starccm+: Rank 0:87: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:80: MPI_Test: self cfd-cnsl-0360 peer cfd-cnsl-0365 (rank: 124)
starccm+: Rank 0:80: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:126: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:122: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:122: MPI_Test: self cfd-cnsl-0365 peer cfd-cnsl-0364 (rank: 116)
starccm+: Rank 0:122: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:126: MPI_Test: self cfd-cnsl-0365 peer cfd-cnsl-0364 (rank: 117)
starccm+: Rank 0:126: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:124: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:124: MPI_Test: self cfd-cnsl-0365 peer cfd-cnsl-0350 (rank: 21)
starccm+: Rank 0:124: MPI_Test: error message: transport retry exceeded error

JOB #2
starccm+: Rank 0:138: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:138: MPI_Test: self cfd-cnsl-0143 peer cfd-cnsl-0343 (rank: 103)
starccm+: Rank 0:138: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:103: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:103: MPI_Test: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 138)
starccm+: Rank 0:103: MPI_Test: error message: transport retry exceeded error

Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate'], 'Neo.Error': 'Error', 'Processor': 138, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:98: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:99: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 140)
starccm+: Rank 0:98: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:98: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:99: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:99: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:99: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:101: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 140)
starccm+: Rank 0:98: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:99: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:99: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 139)
starccm+: Rank 0:100: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:101: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:101: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:101: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 139)
starccm+: Rank 0:100: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:101: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:101: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 139)
starccm+: Rank 0:100: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 136)
starccm+: Rank 0:100: MPI_Cancel: error message: transport retry exceeded error
JOB #3

starccm+: Rank 0:219: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:222: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:222: MPI_Test: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:222: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Test: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 215)
starccm+: Rank 0:219: MPI_Test: error message: transport retry exceeded error

Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate', 'AMGLinearSolver::solve'], 'Neo.Error': 'Error', 'Processor': 222, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}
Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 215)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:222: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 214)
starccm+: Rank 0:219: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:222: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:222: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:222: MPI_Cancel: MPI BUG: no requests done
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 214)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 213)
starccm+: Rank 0:219: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 213)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:219: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12

JOB #4
starccm+: Rank 0:25: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:24: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:28: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:30: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:27: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:31: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:29: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:30: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 47)
starccm+: Rank 0:30: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:30: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:25: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 40)
starccm+: Rank 0:25: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:25: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:28: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 44)
starccm+: Rank 0:28: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:28: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:27: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 46)
starccm+: Rank 0:27: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:27: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:31: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 46)
starccm+: Rank 0:31: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:31: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:29: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 44)
starccm+: Rank 0:29: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:29: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:24: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 47
JOB #5
starccm+: Rank 0:6: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:4: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:3: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:61: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:60: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:119: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:6: MPI_Test: self cfd-cnsl-0541 peer cfd-cnsl-0341 (rank: 18)
starccm+: Rank 0:6: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:60: MPI_Test: self cfd-cnsl-0506 peer cfd-cnsl-0341 (rank: 22)
starccm+: Rank 0:60: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:4: MPI_Test: self cfd-cnsl-0541 peer cfd-cnsl-0341 (rank: 16)
starccm+: Rank 0:4: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:3: MPI_Test: self cfd-cnsl-0541 peer cfd-cnsl-0341 (rank: 22)
starccm+: Rank 0:3: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:61: MPI_Test: self cfd-cnsl-0506 peer cfd-cnsl-0341 (rank: 23)
starccm+: Rank 0:61: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:119: MPI_Test: self cfd-cnsl-0514 peer cfd-cnsl-0341 (rank: 22)
starccm+: Rank 0:119: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:38: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:47: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:38: MPI_Test: self cfd-cnsl-0502 peer cfd-cnsl-0341 (rank: 23)
starccm+: Rank 0:38: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:53: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:49: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Test: self cfd-cnsl-0511 peer cfd-cnsl-0341 (rank: 23)
starccm+: Rank 0:98: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:75: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:47: MPI_Test: self cfd-cnsl-0503 peer cfd-cnsl-0341 (rank: 18)
starccm+: Rank 0:47: MPI_Test: error message: transport retry exceeded error



Regards
Wayne

---------------------------------------------------------------------

For further information on the Renault F1 Team visit our web site at www.renaultf1.com. 
Renault F1 Team Limited
Registered in England no. 1806337
Registered Office: 16 Old Bailey London EC4M 7EG


WARNING: please ensure that you have adequate virus protection in place before you open or detach any documents attached to this email.

This e-mail may constitute privileged information. If you are not the intended recipient, you have received this confidential email and any attachments transmitted with it in error and you must not disclose copy, circulate or in any other way use or rely on this information.

E-mails to and from the Renault F1 Team are monitored for operational reasons and in accordance with lawful business practices.

The contents of this email are those of the individual and do not necessarily represent the views of the company.

Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.

If you have received this email in error please forward to: is.helpdesk at uk.renaultf1.com quoting the sender, then delete the message and any attached documents
---------------------------------------------------------------------




More information about the general mailing list