[openib-general] uDAPL open HCA problem
LEI CHAI
chai.15 at osu.edu
Fri Oct 21 16:43:51 PDT 2005
Hi,
I'm from the same lab as Sayantan. Thanks for your suggestion. Currently we could not reproduce the problem, however, we meet another problem. When I try to tear down a connection between two nodes I often get some messages like this:
[ 0] 005e0406
[ 4] 00000000
[ 8] 00000000
[ c] 00000000
[10] 05f90000
[14] 00000000
[18] 00000008
[1c] fe100000
The program can run and exit though.
After using the debug option as you suggested I got the following log. It starts from the point where I start to free the resources and disconnect the nodes:
dapl_lmr_free (0x76f3b0)
dapl_lmr_free (0x76f4e0)
dapl_lmr_free (0x76f650)
dapli_cq_event_cb(0x5c40c0)
dapli_cm_event()
dapli_cm_event: EVENT=0x7 ID=0x76fa70 CTX=0x76fb00
passive_cb: conn 0x76fb00 id 7797360 event 7
dapli_async_event_cb(0x5c40c0)
dapl_lmr_free (0x76fee0)
dapl_lmr_free (0x7a9150)
dapl_lmr_free (0x7a9280)
dapl_lmr_free (0x7a93b0)
dapl_lmr_free (0x7a94e0)
dapl_lmr_free (0x7a9610)
dapl_lmr_free (0x7a9740)
dapl_lmr_free (0x7a9870)
dapl_lmr_free (0x7a99a0)
dapl_lmr_free (0x7a9ad0)
dapl_ep_disconnect (0x69b070, 1)
disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 1)
dapl_ep_disconnect () returns 0x0
dapli_cq_event_cb(0x5c4410)
dapli_cm_event()
dapli_cm_event: EVENT=0x8 ID=0x76f9c0 CTX=0x76f7a0
active_cb: conn 0x76f7a0 id 7797184 event 8
dapli_async_event_cb(0x5c4410)
dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7bc)
dapl_evd_wait: EVD 0x5c89b0, CQ (nil)
dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b4c)
dapl_evd_wait: EVD 0x5c89b0, CQ (nil)
dapli_cq_event_cb(0x5c4410)
dapli_cm_event()
dapli_cm_event: EVENT=0x9 ID=0x76f9c0 CTX=0x76f7a0
active_cb: conn 0x76f7a0 id 7797184 event 9
--> dapl_evd_connection_callback: ctxt: 0x69b070 event: 1 cm_handle 0x76f7a0
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 0)
destroy_cm_id: conn 0x76f7a0 id 7797184
modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406
dapli_evd_post_event: Called with event # 4005
dapl_evd_connection_callback () returns
active_cb: DESTROY conn 0x76f7a0 id 7797184
dapli_async_event_cb(0x5c4410)
dapl_evd_wait () returns 0x0
dapli_cq_event_cb(0x5c40c0)
dapli_cm_event()
dapli_cm_event: EVENT=0x9 ID=0x76fa70 CTX=0x76fb00
passive_cb: conn 0x76fb00 id 7797360 event 9
--> dapl_cr_callback! context: 0x5c8b20 event: 1 cm_handle 0x76fb00
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
disconnect(ep 0x69b070, conn 0x76fb00, id 7797360 flags 0)
destroy_cm_id: conn 0x76fb00 id 7797360
modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406
dapli_evd_post_event: Called with event # 4005
dapl_evd_wait () returns 0x0
dapli_async_event_cb(0x5c40c0)
dapli_cq_event_cb(0x5c40c0)
dapli_cm_event()
dapli_cm_event: EVENT=0x7 ID=0x76f910 CTX=0x7a9120
passive_cb: conn 0x7a9120 id 7797008 event 7
dapli_async_event_cb(0x5c40c0)
dapl_ep_disconnect (0x69bd20, 1)
disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 1)
dapl_ep_disconnect () returns 0x0
dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7b8)
dapl_evd_wait: EVD 0x5ccb00, CQ (nil)
dapli_cq_event_cb(0x5c4410)
dapli_cm_event()
dapli_cm_event: EVENT=0x8 ID=0x76fc70 CTX=0x76fa50
active_cb: conn 0x76fa50 id 7797872 event 8
dapli_async_event_cb(0x5c4410)
dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b48)
dapl_evd_wait: EVD 0x5ccb00, CQ (nil)
dapli_cq_event_cb(0x5c4410)
dapli_cm_event()
dapli_cm_event: EVENT=0x9 ID=0x76fc70 CTX=0x76fa50
active_cb: conn 0x76fa50 id 7797872 event 9
--> dapl_evd_connection_callback: ctxt: 0x69bd20 event: 1 cm_handle 0x76fa50
dapli_cq_event_cb(0x5c40c0)
dapli_cm_event()
dapli_cm_event: EVENT=0x9 ID=0x76f910 CTX=0x7a9120
passive_cb: conn 0x7a9120 id 7797008 event 9
--> dapl_cr_callback! context: 0x5ccc70 event: 1 cm_handle 0x7a9120
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
disconnect(ep 0x69bd20, conn 0x7a9120, id 7797008 flags 0)
destroy_cm_id: conn 0x7a9120 id 7797008
modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407
dapli_evd_post_event: Called with event # 4005
dapl_evd_wait () returns 0x0
dapl_ep_free (0x69b070)
dapl_ep_disconnect (0x69b070, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0
qp_free: ep_ptr 0x69b070 qp 0x69b3a0
modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406
dapli_async_event_cb(0x5c40c0)
dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005
disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 0)
destroy_cm_id: conn 0x76fa50 id 7797872
modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407
dapli_evd_post_event: Called with event # 4005
dapl_evd_connection_callback () returns
active_cb: DESTROY conn 0x76fa50 id 7797872
dapli_async_event_cb(0x5c4410)
dapl_evd_wait () returns 0x0
dapl_ep_free (0x69b070)
dapl_ep_disconnect (0x69b070, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0
qp_free: ep_ptr 0x69b070 qp 0x69b3a0
modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406
>>> dapl_psp_free 0x5c8b20
>>> dapl_psp_free: state 1 cr_list_count 0
remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0)
destroy_cm_id: conn 0x5c8be0 id 6065664
dapl_evd_free (0x5c89b0)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5c8840)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5c85e0)
[ 0] 002c0406
[ 4] 00000000
[ 8] 00000000
[ c] 00000000
[10] 05f90000
[14] 00000000
[18] 00000008
[1c] fe100000
>>> dapl_psp_free 0x5c8b20
>>> dapl_psp_free: state 1 cr_list_count 0
remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0)
destroy_cm_id: conn 0x5c8be0 id 6065664
dapl_evd_free (0x5c89b0)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5c8840)
dapl_evd_free () returns 0x0
cq_object_destroy: wait_obj=0x5c8750
dapl_evd_free () returns 0x0
dapl_ep_free (0x69bd20)
dapl_ep_disconnect (0x69bd20, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220
qp_free: ep_ptr 0x69bd20 qp 0x76f220
modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407
>>> dapl_psp_free 0x5ccc70
>>> dapl_psp_free: state 1 cr_list_count 0
remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30)
destroy_cm_id: conn 0x5ccd30 id 6082384
dapl_evd_free (0x5ccb00)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc990)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc730)
cq_object_destroy: wait_obj=0x5cc8a0
dapl_evd_free () returns 0x0
dapl_pz_free (0x5c8510)
dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffff9a9900)
dapl_ia_query () returns 0x0
dapl_ia_close (0x5c8000, 1)
setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil)
setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil)
setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil)
dapl_evd_free (0x5c80f0)
dapl_evd_free () returns 0x0
close_hca: 0x5c4390->0x5ca3b0
ib_thread_destroy: wait on hca 0x2 destroy
dapl_evd_free (0x5c85e0)
cq_object_destroy: wait_obj=0x5c8750
dapl_evd_free () returns 0x0
dapl_ep_free (0x69bd20)
dapl_ep_disconnect (0x69bd20, 0)
dapl_ep_disconnect () returns 0x0
dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220
qp_free: ep_ptr 0x69bd20 qp 0x76f220
modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407
>>> dapl_psp_free 0x5ccc70
>>> dapl_psp_free: state 1 cr_list_count 0
remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30)
destroy_cm_id: conn 0x5ccd30 id 6082384
dapl_evd_free (0x5ccb00)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc990)
dapl_evd_free () returns 0x0
dapl_evd_free (0x5cc730)
cq_object_destroy: wait_obj=0x5cc8a0
dapl_evd_free () returns 0x0
dapl_pz_free (0x5c8510)
dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffffebf570)
dapl_ia_query () returns 0x0
dapl_ia_close (0x5c8000, 1)
setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil)
setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil)
setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil)
dapl_evd_free (0x5c80f0)
dapl_evd_free () returns 0x0
close_hca: 0x5c4040->0x5ca3b0
DAPL: Stopped (dapl_fini)
dapl_ib_release:
ib_thread_destroy(8512)
ib_thread_destroy: waiting for ib_thread
ib_thread(8512) EXIT
DAPL: Stopped (dapl_fini)
dapl_ib_release:
ib_thread_destroy(8081)
ib_thread_destroy: waiting for ib_thread
ib_thread(8081) EXIT
ib_thread_destroy(8512) exit
ib_thread_destroy(8081) exit
Any suggestions would be highly appreciated.
Thanks.
Lei
----- Original Message -----
From: Arlin Davis <ardavis at ichips.intel.com>
Date: Friday, October 21, 2005 2:59 pm
Subject: Re: [openib-general] uDAPL open HCA problem
> Sayantan Sur wrote:
>
> >Hello,
> >
> >I have udapl over Gen2 setup on our cluster and am able to run udapl
> >programs. However, sometimes I get this error (after a few runs
> of the
> >same program):
> >
> > open_hca: ERR ib_at_ips_by_gid for mthca0
> >dapls_ib_open_hca failed 40000
> >
> >
>
> uDAPL uses uAT to get the IP address using the GID (ATS records
> via SA)
> of the local device/port. The SA query for this record is failing
> for
> some reason. Did your SM bounce during this time? Did you bounce
> or
> reconfigure the IPoIB network device?
>
> You can set "env DAPL_DBG_TYPE=0xffff" for more information.
>
> -arlin
>
> >The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree
> cards>(fw ver 5.1.0).
> >
> >lsmod on my machine shows this:
> >
> >[surs at ro0:~] lsmod | grep ^ib
> >ib_ipoib 48008 0
> >ib_uat 14840 0
> >ib_at 25696 1 ib_uat
> >ib_sa 17804 2 ib_ipoib,ib_at
> >ib_ucm 22280 0
> >ib_cm 37744 1 ib_ucm
> >ib_uverbs 35992 0
> >ib_umad 18208 0
> >ib_mthca 122656 0
> >ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca
> >ib_core 56192 8
> >ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad
> >
> >My infiniband devices are (created by hand):
> >
> >[surs at ro0:~] ls -l /dev/infiniband/
> >total 0
> >crw-rw-rw- 1 root root 231, 191 2005-10-20 21:13 uat
> >crw-rw-rw- 1 root root 231, 224 2005-10-20 21:12 ucm0
> >crwxrwxrwx 1 root root 231, 192 2005-09-21 04:37 umad0
> >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs0
> >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs1
> >
> >
> >I'd really appreciate if someone could help me understand what
> might be
> >going wrong.
> >
> >Thanks,
> >Sayantan.
> >
> >
> >
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list