[ofa-general] uDAPL question

Woodruff, Robert J robert.j.woodruff at intel.com
Wed Apr 4 12:58:55 PDT 2007


Yes. this is a problem that I have seen also that we are investigating,
probably should open a bug against it,
looks like it is hung waiting for a connection, perhaps
a problem with running 32-bit applications using the
rdma_cm. Please open a bug against this and assign it
to Arlin and he will work with Sean to debug the problem.

woody
 

-----Original Message-----
From: Yong Qin [mailto:yong.qin at qlogic.com] 
Sent: Wednesday, April 04, 2007 12:43 PM
To: Woodruff, Robert J
Cc: general at lists.openfabrics.org
Subject: RE: [ofa-general] uDAPL question

Thanks for the tip, woody. The bug is gone in OFED 1.2. However, we are
still experiencing other issues here.

Let me explain, we are trying to run both 32-bit and 64-bit applications
on an Opteron cluster, with RHEL 4U4. When we were testing 64-bit
applications on OFED 1.2 beta1, the uDAPL works fine. However when we
switched to 32-bit applications, it hanged in RDMA progress engine:

0: [0] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine,
blocking=true
1: [1] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine,
blocking=true

With the night build 20070404, both 32-bit and 64-bit hanged on
RDMA_init. All the testing were done with Intel MPI 3.0.

Any thoughts?

Thanks again,

Yong

 

-----Original Message-----
From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] 
Sent: Tuesday, April 03, 2007 5:59 PM
To: Yong Qin; Boris Shpolyansky; Hefty, Sean
Cc: general at lists.openfabrics.org
Subject: RE: [ofa-general] uDAPL question

This should now be fixed in OFED 1.2.  

woody


-----Original Message-----
From: Yong Qin [mailto:yong.qin at qlogic.com] 
Sent: Tuesday, April 03, 2007 12:43 PM
To: Boris Shpolyansky; Woodruff, Robert J; Hefty, Sean
Cc: general at lists.openfabrics.org
Subject: RE: [ofa-general] uDAPL question

Is there any progress on this issue? We are seeing exactly the same
error on OFED 1.1 + Intel MPI 3.0 -- "unexpected DAPL event 4006" and
wondering if there is a fix.

Thanks,

Yong

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris
Shpolyansky
Sent: Monday, March 12, 2007 11:28 AM
To: Woodruff, Robert J; general at lists.openfabrics.org; Hefty, Sean
Subject: RE: [ofa-general] uDAPL question

Hi Woody,

Thanks for your help.
I guess the problem is in the CM - is it ?
Can you point me to relevant communication/bug reports that explain the
fix for this issue ?
Would Sean be the right person to ask regarding what exact patch should
be added/removed ?
I would prefer to stick to OFED-1.1 code with minimal changes - if
possible -
to avoid compatibility issues. 

Thanks,
Boris

-----Original Message-----
From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] 
Sent: Monday, March 12, 2007 8:24 AM
To: Boris Shpolyansky; general at lists.openfabrics.org; Hefty, Sean
Subject: RE: [ofa-general] uDAPL question

This is a known problem and should be fixed by now, There was a bad
patch that somehow got into OFED that was not in Sean main tree.
Assuming this bad patch has been removed, the problem should be fixed.

 
woody
 

________________________________

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Boris
Shpolyansky
Sent: Friday, March 09, 2007 8:40 PM
To: general at lists.openfabrics.org
Subject: [ofa-general] uDAPL question


Hi, 
 
I'm trying to get simple Intel MPI benchmark running over IB (uDAPL)
using OFED-1.1 stack.
I'm consistently getting the following error:
 
[root at ibd005 ~]# ./runjob_I_MPI.boris 2
Task 0 of 2 tasks started on host ibd005.ibd.mti.com clock_resolution =
1.00e-06 s Task 1 of 2 tasks started on host ibd006.ibd.mti.com
[0:ibd005] unexpected DAPL event 4006 from 1:ibd006 [1:ibd006]
unexpected DAPL event 4006 from 0:ibd005
rank 0 in job 14  ibd005_36193   caused collective abort of all ranks
  exit status of rank 0: return code 254 

I did some digging and found out that event 4006 (actually 0x4006) means
DAT_CONNECTION_EVENT_BROKEN and it is returned by function dat_rmr_bind.

So my question is why this function consistently fails.
I'm using standard dat.conf file:
 
OpenIB-cma u1.2 nonthreadsafe default
/usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" ""

Appreciate your help,
 
Boris Shpolyansky 
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list