[ofa-general] RE: uDAPL question
Boris Shpolyansky
boris at mellanox.com
Sun Mar 11 16:43:20 PDT 2007
On the other hand after reviewing source code it seems like
DAT_CONNECTION_EVENT_BROKEN
is returned in case of failure to establish connection - so it looks
more like a CM issue.
Any suggestion on how to debug this one ?
Thanks,
Boris.
________________________________
From: Boris Shpolyansky
Sent: Friday, March 09, 2007 8:40 PM
To: 'general at lists.openfabrics.org'
Subject: uDAPL question
Hi,
I'm trying to get simple Intel MPI benchmark running over IB (uDAPL)
using OFED-1.1 stack.
I'm consistently getting the following error:
[root at ibd005 ~]# ./runjob_I_MPI.boris 2
Task 0 of 2 tasks started on host ibd005.ibd.mti.com
clock_resolution = 1.00e-06 s
Task 1 of 2 tasks started on host ibd006.ibd.mti.com
[0:ibd005] unexpected DAPL event 4006 from 1:ibd006
[1:ibd006] unexpected DAPL event 4006 from 0:ibd005
rank 0 in job 14 ibd005_36193 caused collective abort of all ranks
exit status of rank 0: return code 254
I did some digging and found out that event 4006 (actually 0x4006) means
DAT_CONNECTION_EVENT_BROKEN
and it is returned by function dat_rmr_bind.
So my question is why this function consistently fails.
I'm using standard dat.conf file:
OpenIB-cma u1.2 nonthreadsafe default
/usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" ""
Appreciate your help,
Boris Shpolyansky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070311/dbc3eaf1/attachment.html>
More information about the general
mailing list