[ofa-general] Multiple job failures at same time
Glanfield, Wayne
Wayne.Glanfield at uk.renaultf1.com
Wed Dec 3 09:37:33 PST 2008
We have just experienced a problem where 5 jobs failed at the same time ~15:50 GMT with similar messages in their respective output files. Does anybody have any idea what could have cause this and what the messages mean. One of the nodes "cfd-cnsl-0364" was found to have shutdown but could this take out other jobs? They were not running on this node,
This is a commercial CFD code which is using hp-mpi 2.2.5, we are running ofed 1.3.1 and using verbs api with Mellanox ConnectX HCA
Thanks
Wayne
JOB #1
starccm+: Rank 0:52: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:50: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:55: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:55: MPI_Test: self cfd-cnsl-0355 peer cfd-cnsl-0365 (rank: 126)
starccm+: Rank 0:55: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:52: MPI_Test: self cfd-cnsl-0355 peer cfd-cnsl-0365 (rank: 120)
starccm+: Rank 0:52: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:50: MPI_Test: self cfd-cnsl-0355 peer cfd-cnsl-0365 (rank: 127)
starccm+: Rank 0:50: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:38: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:35: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:33: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:35: MPI_Test: self cfd-cnsl-0352 peer cfd-cnsl-0364 (rank: 119)
starccm+: Rank 0:35: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:38: MPI_Test: self cfd-cnsl-0352 peer cfd-cnsl-0365 (rank: 121)
starccm+: Rank 0:38: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:33: MPI_Test: self cfd-cnsl-0352 peer cfd-cnsl-0365 (rank: 121)
starccm+: Rank 0:33: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:46: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:46: MPI_Test: self cfd-cnsl-0353 peer cfd-cnsl-0365 (rank: 123)
starccm+: Rank 0:46: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:87: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:80: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:87: MPI_Test: self cfd-cnsl-0360 peer cfd-cnsl-0365 (rank: 121)
starccm+: Rank 0:87: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:80: MPI_Test: self cfd-cnsl-0360 peer cfd-cnsl-0365 (rank: 124)
starccm+: Rank 0:80: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:126: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:122: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:122: MPI_Test: self cfd-cnsl-0365 peer cfd-cnsl-0364 (rank: 116)
starccm+: Rank 0:122: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:126: MPI_Test: self cfd-cnsl-0365 peer cfd-cnsl-0364 (rank: 117)
starccm+: Rank 0:126: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:124: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:124: MPI_Test: self cfd-cnsl-0365 peer cfd-cnsl-0350 (rank: 21)
starccm+: Rank 0:124: MPI_Test: error message: transport retry exceeded error
JOB #2
starccm+: Rank 0:138: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:138: MPI_Test: self cfd-cnsl-0143 peer cfd-cnsl-0343 (rank: 103)
starccm+: Rank 0:138: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:103: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:103: MPI_Test: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 138)
starccm+: Rank 0:103: MPI_Test: error message: transport retry exceeded error
Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate'], 'Neo.Error': 'Error', 'Processor': 138, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:98: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:99: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 140)
starccm+: Rank 0:98: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:98: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:99: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:99: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:99: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:101: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 140)
starccm+: Rank 0:98: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:99: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:99: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 139)
starccm+: Rank 0:100: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:101: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:101: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:101: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 139)
starccm+: Rank 0:100: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:101: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0159 (rank: 179)
starccm+: Rank 0:101: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 139)
starccm+: Rank 0:100: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:100: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:100: MPI_Cancel: self cfd-cnsl-0343 peer cfd-cnsl-0143 (rank: 136)
starccm+: Rank 0:100: MPI_Cancel: error message: transport retry exceeded error
JOB #3
starccm+: Rank 0:219: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:222: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:222: MPI_Test: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:222: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Test: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 215)
starccm+: Rank 0:219: MPI_Test: error message: transport retry exceeded error
Error: {'In': ['Machine::main', 'SimulationIterator::startIterating', 'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate', 'AMGLinearSolver::solve'], 'Neo.Error': 'Error', 'Processor': 222, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}
Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 215)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:222: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 214)
starccm+: Rank 0:219: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:222: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:222: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:222: MPI_Cancel: MPI BUG: no requests done
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 214)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 213)
starccm+: Rank 0:219: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 213)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:219: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:219: MPI_Cancel: self cfd-cnsl-0339 peer cfd-cnsl-0337 (rank: 212)
starccm+: Rank 0:219: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:219: MPI_Cancel: ibv_poll_cq(): bad status 12
JOB #4
starccm+: Rank 0:25: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:24: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:28: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:30: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:27: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:31: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:29: MPI_Waitall: ibv_poll_cq(): bad status 12
starccm+: Rank 0:30: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 47)
starccm+: Rank 0:30: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:30: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:25: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 40)
starccm+: Rank 0:25: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:25: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:28: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 44)
starccm+: Rank 0:28: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:28: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:27: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 46)
starccm+: Rank 0:27: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:27: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:31: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 46)
starccm+: Rank 0:31: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:31: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:29: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 44)
starccm+: Rank 0:29: MPI_Waitall: error message: transport retry exceeded error
starccm+: Rank 0:29: MPI_Allreduce: ibv_poll_cq(): bad status 5
starccm+: Rank 0:24: MPI_Waitall: self cfd-cnsl-0376 peer cfd-cnsl-0369 (rank: 47
JOB #5
starccm+: Rank 0:6: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:4: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:3: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:61: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:60: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:119: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:6: MPI_Test: self cfd-cnsl-0541 peer cfd-cnsl-0341 (rank: 18)
starccm+: Rank 0:6: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:60: MPI_Test: self cfd-cnsl-0506 peer cfd-cnsl-0341 (rank: 22)
starccm+: Rank 0:60: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:4: MPI_Test: self cfd-cnsl-0541 peer cfd-cnsl-0341 (rank: 16)
starccm+: Rank 0:4: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:3: MPI_Test: self cfd-cnsl-0541 peer cfd-cnsl-0341 (rank: 22)
starccm+: Rank 0:3: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:61: MPI_Test: self cfd-cnsl-0506 peer cfd-cnsl-0341 (rank: 23)
starccm+: Rank 0:61: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:119: MPI_Test: self cfd-cnsl-0514 peer cfd-cnsl-0341 (rank: 22)
starccm+: Rank 0:119: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:38: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:47: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:38: MPI_Test: self cfd-cnsl-0502 peer cfd-cnsl-0341 (rank: 23)
starccm+: Rank 0:38: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:53: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:49: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:98: MPI_Test: self cfd-cnsl-0511 peer cfd-cnsl-0341 (rank: 23)
starccm+: Rank 0:98: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:75: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:47: MPI_Test: self cfd-cnsl-0503 peer cfd-cnsl-0341 (rank: 18)
starccm+: Rank 0:47: MPI_Test: error message: transport retry exceeded error
Regards
Wayne
---------------------------------------------------------------------
For further information on the Renault F1 Team visit our web site at www.renaultf1.com.
Renault F1 Team Limited
Registered in England no. 1806337
Registered Office: 16 Old Bailey London EC4M 7EG
WARNING: please ensure that you have adequate virus protection in place before you open or detach any documents attached to this email.
This e-mail may constitute privileged information. If you are not the intended recipient, you have received this confidential email and any attachments transmitted with it in error and you must not disclose copy, circulate or in any other way use or rely on this information.
E-mails to and from the Renault F1 Team are monitored for operational reasons and in accordance with lawful business practices.
The contents of this email are those of the individual and do not necessarily represent the views of the company.
Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us.
If you have received this email in error please forward to: is.helpdesk at uk.renaultf1.com quoting the sender, then delete the message and any attached documents
---------------------------------------------------------------------
More information about the general
mailing list