[ofa-general] Re: [OMPI users] openMPI over uDAPL doesn't work

Jeff Squyres jsquyres at cisco.com
Tue May 8 06:16:57 PDT 2007


I'm forwarding this to the OpenFabrics general list -- as it just  
came up the other day, we know that Open MPI's UDAPL support works on  
Solaris, but we have done little/no testing of it on OFED (I  
personally know almost nothing about UDPAL).

Can the UDAPL OFED wizards shed any light on the error messages that  
are listed below?  In particular, these seem to be worrysome:

>  setup_listener Permission denied
>  setup_listener Address already in use
and
>  create_qp Address already in use

Thanks...


On May 8, 2007, at 5:37 AM, Boris Bierbaum wrote:

> Hi,
>
> we (my collegue Andreas and me) are still trying to solve this  
> problem.
> I have compiled some additional information, maybe somebody has an  
> idea
> about what's going on.
>
> OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit
> IB software: OFED 1.1
> SM: OpenSM from OFED 1.1
> uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL  
> from
> OFED 1.1 doesn't change anything, I suppose it's the same code, at  
> least
> roughly)
> Test program: Intel MPI Benchmarks Version 2.3
> OpenMPI version: 1.2.1
>
> Running OpenMPI directly over IB verbs (mpirun --mca btl  
> self,sm,openib
> ...) works. Here's the output of ibv_devinfo and ifconfig for the two
> nodes on which tried to run the benchmark (ulimit -l is unlimited on
> both machines):
>
> ------------ 1st node -------------------------------
>
> boris at pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         1.2.0
>         node_guid:                      0002:c902:0020:b528
>         sys_image_guid:                 0002:c902:0020:b52b
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25204
>         hw_ver:                         0xA0
>         board_id:                       MT_0230000001
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               9
>                         port_lmc:               0x00
>
> boris at pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig
>
> ...
>
> ib0       Protokoll:UNSPEC  Hardware Adresse
> 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet Adresse:192.168.0.14  Bcast:192.168.0.255
> Maske:255.255.255.0
>           inet6 Adresse: fe80::202:c902:20:b529/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:67 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:16 errors:0 dropped:2 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:128
>           RX bytes:3752 (3.6 KiB)  TX bytes:968 (968.0 b)
>
> ...
>
> ------------ 2nd node -------------------------------
>
> boris at pd-05:~$  /opt/infiniband/bin/ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         1.2.0
>         node_guid:                      0002:c902:0020:b4f4
>         sys_image_guid:                 0002:c902:0020:b4f7
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25204
>         hw_ver:                         0xA0
>         board_id:                       MT_0230000001
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               10
>                         port_lmc:               0x00
>
> boris at pd-05:~$ /sbin/ifconfig
>
> ...
>
> ib0       Protokoll:UNSPEC  Hardware Adresse
> 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet Adresse:192.168.0.15  Bcast:192.168.0.255
> Maske:255.255.255.0
>           inet6 Adresse: fe80::202:c902:20:b4f5/64
> Gültigkeitsbereich:Verbindung
>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>           RX packets:67 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:18 errors:0 dropped:2 overruns:0 carrier:0
>           Kollisionen:0 Sendewarteschlangenlänge:128
>           RX bytes:3752 (3.6 KiB)  TX bytes:1088 (1.0 KiB)
>
>
> ...
>
> ---------------------------------------------------------------------- 
> ---
>
>
> Here's the output from the failed run, with every DAT and DAPL debug
> output enabled:
>
>
>
> boris at pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x
> DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host  
> pd-04,pd-05
> /work/boris/IMB_2.3/src/IMB-MPI1 pingpong
> DAT Registry: Started (dat_init)
> DAT Registry: static registry file
> </home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>
>
> DAT Registry: token
>  type  string
>  value <OpenIB-cma>
>
>
> DAT Registry: token
>  type  string
>  value <u1.2>
>
>
> DAT Registry: token
>  type  string
>  value <nonthreadsafe>
>
>
> DAT Registry: token
>  type  string
>  value <default>
>
>
> DAT Registry: token
>  type  string
>  value
> </home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so>
>
>
> DAT Registry: token
>  type  string
>  value <mv_dapl.1.2>
>
>
> DAT Registry: token
>  type  string
>  value <ib0 0>
>
>
> DAT Registry: token
>  type  string
>  value <>
>
>
> DAT Registry: token
>  type  eor
>  value <>
>
>
> DAT Registry: entry
>  ia_name OpenIB-cma
>  api_version
>      type 0x0
>      major.minor 1.2
>  is_thread_safe 0
>  is_default 1
>  lib_path
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
>  provider_version
>      id mv_dapl
>      major.minor 1.2
>  ia_params ib0 0
>
> DAT Registry: loading provider for OpenIB-cma
>
> DAT Registry: token
>  type  eof
>  value <>
>
> DAT Registry: dat_registry_list_providers () called
> DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
> DAT Registry: IA OpenIB-cma, trying to load library
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
> DAPL: NOT Setting Loopback
>  dapl_ib_init:
> DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
>  open_hca: ib0 - 0x807cf28
>  ib_thread_init(17919)
>  ib_thread_init: waiting for ib_thread
>  ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12
> DAT Registry: Started (dat_init)
> DAT Registry: static registry file
> </home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>
>
> DAT Registry: token
>  type  string
>  value <OpenIB-cma>
>
>
> DAT Registry: token
>  type  string
>  value <u1.2>
>
>
> DAT Registry: token
>  type  string
>  value <nonthreadsafe>
>
>
> DAT Registry: token
>  type  string
>  value <default>
>
>
> DAT Registry: token
>  type  string
>  value
> </home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so>
>
>
> DAT Registry: token
>  type  string
>  value <mv_dapl.1.2>
>
>
> DAT Registry: token
>  type  string
>  value <ib0 0>
>
>
> DAT Registry: token
>  type  string
>  value <>
>
>
> DAT Registry: token
>  type  eor
>  value <>
>
>
> DAT Registry: entry
>  ia_name OpenIB-cma
>  api_version
>      type 0x0
>      major.minor 1.2
>  is_thread_safe 0
>  is_default 1
>  lib_path
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
>  provider_version
>      id mv_dapl
>      major.minor 1.2
>  ia_params ib0 0
>
> DAT Registry: loading provider for OpenIB-cma
>
> DAT Registry: token
>  type  eof
>  value <>
>
> DAT Registry: dat_registry_list_providers () called
> DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
> DAT Registry: IA OpenIB-cma, trying to load library
> /home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/ 
> libdapl_openib_cma.so
>  ib_thread_init(17919) exit
> DAPL: NOT Setting Loopback
>  dapl_ib_init:
> DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
>  open_hca: ib0 - 0x807cf18
>  ib_thread_init(12326)
>  ib_thread_init: waiting for ib_thread
>  ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12
>  ib_thread_init(12326) exit
>  getipaddr: family 2 port 0 addr 192.168.0.14
>  open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id
> 0002c9020020b529
>  open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128
>  ib_thread(17919) poll_event:  async=0x1 pipe=0x1 cm=0x0 cq=0x0
>  ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12  
> cm=13 cq=d
>  query_hca: ib0 AF_INET  192.168.0.14
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
>  setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx  
> 0x80a16d0
>  setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx  
> 0x80a16d0
>  setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx  
> 0x80a1648
> dat_set_handle 0x80a1648 to 1
> dat_get_ia_handle from 1 to 0x80a1648
>  pd_alloc: pd_handle=0x80a1928
> dat_get_ia_handle from 1 to 0x80a1648
>  query_hca: ib0 AF_INET  192.168.0.14
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
> dat_get_ia_handle from 1 to 0x80a1648
>  cq_object_create: (0x80a1958,0x80a1a44)
> dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32
> dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63
>  setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx
> 0x80a1958
> dat_get_ia_handle from 1 to 0x80a1648
> dat_get_ia_handle from 1 to 0x80a1648
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Address already in use
>  listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id  
> 134904736)
>  listen(conn=0x80a7a70 cm_id=134904736)
> dat_get_ia_handle from 1 to 0x80a1648
>  mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240  
> pv=0x0
>  mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0
> lkey=0x72002700 rkey=0x72002700 priv=41000
> dat_get_ia_handle from 1 to 0x80a1648
>  mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384  
> pv=0x0
>  mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0
> lkey=0xf2002800 rkey=0xf2002800 priv=81000
>  getipaddr: family 2 port 0 addr 192.168.0.15
>  open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id
> 0002c9020020b4f5
>  open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128
>  ib_thread(12326) poll_event:  async=0x1 pipe=0x1 cm=0x0 cq=0x0
>  ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12  
> cm=13 cq=d
>  query_hca: ib0 AF_INET  192.168.0.15
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
>  setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx  
> 0x80a16c0
>  setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx  
> 0x80a16c0
>  setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx  
> 0x80a1638
> dat_set_handle 0x80a1638 to 1
> dat_get_ia_handle from 1 to 0x80a1638
>  pd_alloc: pd_handle=0x80a1918
> dat_get_ia_handle from 1 to 0x80a1638
>  query_hca: ib0 AF_INET  192.168.0.15
>  query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
>  query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0  
> rd_io 4
> dat_get_ia_handle from 1 to 0x80a1638
>  cq_object_create: (0x80a1948,0x80a1a34)
> dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32
> dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63
>  setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx
> 0x80a1948
> dat_get_ia_handle from 1 to 0x80a1638
> dat_get_ia_handle from 1 to 0x80a1638
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  setup_listener Permission denied
>  listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id  
> 134904736)
>  listen(conn=0x80a7a70 cm_id=134904736)
> dat_get_ia_handle from 1 to 0x80a1638
>  mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240  
> pv=0x0
>  mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0
> lkey=0x60002400 rkey=0x60002400 priv=41000
> dat_get_ia_handle from 1 to 0x80a1638
>  mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384  
> pv=0x0
>  mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0
> lkey=0x60002500 rkey=0x60002500 priv=81000
> #---------------------------------------------------
> #    Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
> #---------------------------------------------------
> # Date       : Tue May  8 11:16:58 2007
> # Machine    : i686# System     : Linux
> # Release    : 2.6.18
> # Version    : #1 SMP Tue Nov 14 18:02:03 CET 2006
>
> #
> # Minimum message length in bytes:   0
> # Maximum message length in bytes:   16777216
> #
> # MPI_Datatype                   :   MPI_BYTE
> # MPI_Datatype for reductions    :   MPI_FLOAT
> # MPI_Op                         :   MPI_SUM
> #
> #
>
> # List of Benchmarks to run:
>
> # PingPong
> dat_get_ia_handle from 1 to 0x80a1638
>  query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4
>  qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8
>  create_qp Address already in use
>
> ---------------------------------------------------------------------- 
> ---
>
> The jobs hangs at this point. From the output of another simple test
> program I assume that it hangs inside of a receive operation. Of  
> course,
> I have noticed the "Permission denied" messages, but I don't think  
> that
> the probleme is there. These messages seem to come from RDMA CM when
> things are set up, but the execution continues from there on and I  
> have
> seen these messages on successful DAPL runs, too. I'm not very  
> familiar
> with RDMA CM, though, so I don't know the cause of these messages.
>
> That's a lot of information, I know, but it would be great if someone
> would have a look at it.
>
> Thanks in advance
> Boris
>
>
>
> Donald Kerr wrote:
>> I have not tried Open MPI uDAPL on Linux nor do I have access to a  
>> Linux
>> box so I am having a difficult time trying to find a way to help you
>> debug this issue.
>>
>> -DON
>>
>> Andreas Kuntze wrote:
>>
>>> On Linux you needn't initialise the dat registry. Your program  
>>> prints:
>>> "provider 1: OpenIB-cma". I successfully tested INTEL MPI  and   
>>> mvapich2
>>> with uDAPL .
>>>
>>> Andreas
>>>
>>> Donald Kerr wrote:
>>>
>>>
>>>> Andreas,
>>>>
>>>> I am going to guess at a minimum the interfaces are up and you can
>>>> ping them.  On Solaris there is an additional step required and  
>>>> that
>>>> is initializing the dat registry. If "/usr/sbin/datadm -v" does not
>>>> show some driver output then you would need to run "/usr/sbin/ 
>>>> datadm
>>>> -a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an
>>>> equivalent on Linux.
>>>>
>>>> Attached is a simple udapl program which will check if the  
>>>> interfaces
>>>> are available in the dat registry.
>>>>
>>>> -DON
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users at open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> -- 
> |  _  RWTH | Boris Bierbaum
> |_|_`_     | Lehrstuhl fuer Betriebssysteme
>    | |_) _  | RWTH Aachen D-52056 Aachen
>      |_)(_` | Tel: +49-241-80-27805
>         ._) | Fax: +49-241-80-22339
> <config.log.gz>
> <ompi_info.out.gz>
> _______________________________________________
> users mailing list
> users at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Cisco Systems




More information about the general mailing list