[openib-general] uDAPL open HCA problem

LEI CHAI chai.15 at osu.edu
Fri Oct 21 16:43:51 PDT 2005


Hi,
I'm from the same lab as Sayantan. Thanks for your suggestion. Currently we could not reproduce the problem, however, we meet another problem.  When I try to tear down a connection between two nodes I often get some messages like this:

  [ 0] 005e0406
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 05f90000
  [14] 00000000
  [18] 00000008
  [1c] fe100000

The program can run and exit though.

After using the debug option as you suggested I got the following log. It  starts from the point where I start to free the resources and disconnect the nodes:

dapl_lmr_free (0x76f3b0)
dapl_lmr_free (0x76f4e0)
dapl_lmr_free (0x76f650)
 dapli_cq_event_cb(0x5c40c0)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x7 ID=0x76fa70 CTX=0x76fb00
 passive_cb: conn 0x76fb00 id 7797360 event 7
 dapli_async_event_cb(0x5c40c0)
dapl_lmr_free (0x76fee0)
dapl_lmr_free (0x7a9150)
dapl_lmr_free (0x7a9280)
dapl_lmr_free (0x7a93b0)
dapl_lmr_free (0x7a94e0)
dapl_lmr_free (0x7a9610)
dapl_lmr_free (0x7a9740)
dapl_lmr_free (0x7a9870)
dapl_lmr_free (0x7a99a0)
dapl_lmr_free (0x7a9ad0)
dapl_ep_disconnect (0x69b070, 1)
 disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 1)
dapl_ep_disconnect () returns 0x0
 dapli_cq_event_cb(0x5c4410)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x8 ID=0x76f9c0 CTX=0x76f7a0
 active_cb: conn 0x76f7a0 id 7797184 event 8
 dapli_async_event_cb(0x5c4410)
dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7bc)
dapl_evd_wait: EVD 0x5c89b0, CQ (nil)
dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b4c)
dapl_evd_wait: EVD 0x5c89b0, CQ (nil)
 dapli_cq_event_cb(0x5c4410)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x9 ID=0x76f9c0 CTX=0x76f7a0
 active_cb: conn 0x76f7a0 id 7797184 event 9
--> dapl_evd_connection_callback: ctxt: 0x69b070 event: 1 cm_handle 0x76f7a0
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
 disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 0)
 destroy_cm_id: conn 0x76f7a0 id 7797184
 modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406
dapli_evd_post_event: Called with event # 4005
dapl_evd_connection_callback () returns
 active_cb: DESTROY conn 0x76f7a0 id 7797184
 dapli_async_event_cb(0x5c4410)
dapl_evd_wait () returns 0x0
 dapli_cq_event_cb(0x5c40c0)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x9 ID=0x76fa70 CTX=0x76fb00
 passive_cb: conn 0x76fb00 id 7797360 event 9
--> dapl_cr_callback! context: 0x5c8b20 event: 1 cm_handle 0x76fb00
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
 disconnect(ep 0x69b070, conn 0x76fb00, id 7797360 flags 0)
 destroy_cm_id: conn 0x76fb00 id 7797360
 modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406
dapli_evd_post_event: Called with event # 4005
dapl_evd_wait () returns 0x0
 dapli_async_event_cb(0x5c40c0)
 dapli_cq_event_cb(0x5c40c0)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x7 ID=0x76f910 CTX=0x7a9120
 passive_cb: conn 0x7a9120 id 7797008 event 7
 dapli_async_event_cb(0x5c40c0)
dapl_ep_disconnect (0x69bd20, 1)
 disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 1)
dapl_ep_disconnect () returns 0x0
dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7b8)
dapl_evd_wait: EVD 0x5ccb00, CQ (nil)
 dapli_cq_event_cb(0x5c4410)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x8 ID=0x76fc70 CTX=0x76fa50
 active_cb: conn 0x76fa50 id 7797872 event 8
 dapli_async_event_cb(0x5c4410)
dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b48)
dapl_evd_wait: EVD 0x5ccb00, CQ (nil)
 dapli_cq_event_cb(0x5c4410)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x9 ID=0x76fc70 CTX=0x76fa50
 active_cb: conn 0x76fa50 id 7797872 event 9
--> dapl_evd_connection_callback: ctxt: 0x69bd20 event: 1 cm_handle 0x76fa50
 dapli_cq_event_cb(0x5c40c0)
 dapli_cm_event()
 dapli_cm_event: EVENT=0x9 ID=0x76f910 CTX=0x7a9120
 passive_cb: conn 0x7a9120 id 7797008 event 9
--> dapl_cr_callback! context: 0x5ccc70 event: 1 cm_handle 0x7a9120
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
 disconnect(ep 0x69bd20, conn 0x7a9120, id 7797008 flags 0)
 destroy_cm_id: conn 0x7a9120 id 7797008
 modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407
dapli_evd_post_event: Called with event # 4005
dapl_evd_wait () returns 0x0
dapl_ep_free (0x69b070)
dapl_ep_disconnect (0x69b070, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0
 qp_free:  ep_ptr 0x69b070 qp 0x69b3a0
 modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406
 dapli_async_event_cb(0x5c40c0)
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
 disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 0)
 destroy_cm_id: conn 0x76fa50 id 7797872
 modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407
dapli_evd_post_event: Called with event # 4005
dapl_evd_connection_callback () returns
 active_cb: DESTROY conn 0x76fa50 id 7797872
 dapli_async_event_cb(0x5c4410)
dapl_evd_wait () returns 0x0
dapl_ep_free (0x69b070)
dapl_ep_disconnect (0x69b070, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0
 qp_free:  ep_ptr 0x69b070 qp 0x69b3a0
 modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406
>>> dapl_psp_free 0x5c8b20
>>> dapl_psp_free: state 1 cr_list_count 0
 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0)
 destroy_cm_id: conn 0x5c8be0 id 6065664
dapl_evd_free (0x5c89b0)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5c8840)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5c85e0)
  [ 0] 002c0406
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 05f90000
  [14] 00000000
  [18] 00000008
  [1c] fe100000
>>> dapl_psp_free 0x5c8b20
>>> dapl_psp_free: state 1 cr_list_count 0
 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0)
 destroy_cm_id: conn 0x5c8be0 id 6065664
dapl_evd_free (0x5c89b0)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5c8840)
dapl_evd_free () returns 0x0
 cq_object_destroy: wait_obj=0x5c8750
dapl_evd_free () returns 0x0
dapl_ep_free (0x69bd20)
dapl_ep_disconnect (0x69bd20, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220
 qp_free:  ep_ptr 0x69bd20 qp 0x76f220
 modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407
>>> dapl_psp_free 0x5ccc70
>>> dapl_psp_free: state 1 cr_list_count 0
 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30)
 destroy_cm_id: conn 0x5ccd30 id 6082384
dapl_evd_free (0x5ccb00)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc990)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc730)
 cq_object_destroy: wait_obj=0x5cc8a0
dapl_evd_free () returns 0x0
dapl_pz_free (0x5c8510)
dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffff9a9900)
dapl_ia_query () returns 0x0
dapl_ia_close (0x5c8000, 1)
 setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil)
 setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil)
 setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil)
dapl_evd_free (0x5c80f0)
dapl_evd_free () returns 0x0
 close_hca: 0x5c4390->0x5ca3b0
 ib_thread_destroy: wait on hca 0x2 destroy
dapl_evd_free (0x5c85e0)
 cq_object_destroy: wait_obj=0x5c8750
dapl_evd_free () returns 0x0
dapl_ep_free (0x69bd20)
dapl_ep_disconnect (0x69bd20, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220
 qp_free:  ep_ptr 0x69bd20 qp 0x76f220
 modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407
>>> dapl_psp_free 0x5ccc70
>>> dapl_psp_free: state 1 cr_list_count 0
 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30)
 destroy_cm_id: conn 0x5ccd30 id 6082384
dapl_evd_free (0x5ccb00)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc990)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc730)
 cq_object_destroy: wait_obj=0x5cc8a0
dapl_evd_free () returns 0x0
dapl_pz_free (0x5c8510)
dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffffebf570)
dapl_ia_query () returns 0x0
dapl_ia_close (0x5c8000, 1)
 setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil)
 setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil)
 setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil)
dapl_evd_free (0x5c80f0)
dapl_evd_free () returns 0x0
 close_hca: 0x5c4040->0x5ca3b0
DAPL: Stopped (dapl_fini)
 dapl_ib_release:
 ib_thread_destroy(8512)
 ib_thread_destroy: waiting for ib_thread
 ib_thread(8512) EXIT
DAPL: Stopped (dapl_fini)
 dapl_ib_release:
 ib_thread_destroy(8081)
 ib_thread_destroy: waiting for ib_thread
 ib_thread(8081) EXIT
 ib_thread_destroy(8512) exit
 ib_thread_destroy(8081) exit

Any suggestions would be highly appreciated.

Thanks.
Lei



----- Original Message -----
From: Arlin Davis <ardavis at ichips.intel.com>
Date: Friday, October 21, 2005 2:59 pm
Subject: Re: [openib-general] uDAPL open HCA problem

> Sayantan Sur wrote:
> 
> >Hello,
> >
> >I have udapl over Gen2 setup on our cluster and am able to run udapl
> >programs. However, sometimes I get this error (after a few runs 
> of the
> >same program):
> >
> > open_hca: ERR ib_at_ips_by_gid for mthca0
> >dapls_ib_open_hca failed 40000
> >  
> >
> 
> uDAPL uses uAT to get the IP address using the GID (ATS records 
> via SA) 
> of the local device/port. The SA query for this record is failing 
> for 
> some reason. Did your SM bounce during this time? Did you bounce 
> or 
> reconfigure the IPoIB network device?
> 
> You can set "env DAPL_DBG_TYPE=0xffff"  for more information.
> 
> -arlin
> 
> >The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree 
> cards>(fw ver 5.1.0).
> >
> >lsmod on my machine shows this:
> >
> >[surs at ro0:~] lsmod | grep ^ib
> >ib_ipoib               48008  0 
> >ib_uat                 14840  0 
> >ib_at                  25696  1 ib_uat
> >ib_sa                  17804  2 ib_ipoib,ib_at
> >ib_ucm                 22280  0 
> >ib_cm                  37744  1 ib_ucm
> >ib_uverbs              35992  0 
> >ib_umad                18208  0 
> >ib_mthca              122656  0 
> >ib_mad                 44072  4 ib_sa,ib_cm,ib_umad,ib_mthca
> >ib_core                56192  8
> >ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad
> >
> >My infiniband devices are (created by hand):
> >
> >[surs at ro0:~] ls -l /dev/infiniband/
> >total 0
> >crw-rw-rw-  1 root root 231, 191 2005-10-20 21:13 uat
> >crw-rw-rw-  1 root root 231, 224 2005-10-20 21:12 ucm0
> >crwxrwxrwx  1 root root 231, 192 2005-09-21 04:37 umad0
> >crwxrwxrwx  1 root root 231, 192 2005-09-16 19:29 uverbs0
> >crwxrwxrwx  1 root root 231, 192 2005-09-16 19:29 uverbs1
> >
> >
> >I'd really appreciate if someone could help me understand what 
> might be
> >going wrong.
> >
> >Thanks,
> >Sayantan.
> >
> >  
> >
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 




More information about the general mailing list