[libfabric-users] intel mpi with libfabric

Mohammed Shaheen m_shaheen1984 at yahoo.com
Mon Dec 3 04:48:20 PST 2018


 PS: when I set the FI_LOG_LEVEL to debug, the program runs successfully but a lot of error messages are printed, below is an excerpt. Are these bad?
libfabric:verbs:core:ofi_check_ep_type():628<info> Unsupported endpoint type
libfabric:verbs:core:ofi_check_ep_type():629<info> Supported: FI_EP_DGRAM
libfabric:verbs:core:ofi_check_ep_type():629<info> Requested: FI_EP_MSG
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: No such device(19)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)

libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:verbs:domain:fi_ibv_ep_close():169<info> EP 0x1d72e60 was closed
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)

Regards,Mohammed



    Am Montag, 3. Dezember 2018, 13:44:36 MEZ hat Mohammed Shaheen via Libfabric-users <libfabric-users at lists.openfabrics.org> Folgendes geschrieben:  
 
  Hi Dmitry,
the problem appeared only once. It did not appear again.
Since you brought it up, do I have to set the FI_PROVIDER variable? I mean is it not enough to source the mpivars script?if I source the mpivars script and set the debug level to 3, I get the right provider verbs;ofi_rxm as seen below.


lce62:~ # mpirun -np 8 -perhost 4 --hostfile hosts ./test.e
[0] MPI startup(): libfabric version: 1.7.0a1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
Hello world from process 7 of 8
Hello world from process 3 of 8
Hello world from process 5 of 8
Hello world from process 1 of 8
Hello world from process 4 of 8
Hello world from process 6 of 8
Hello world from process 2 of 8
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       11242    lce63      {0,1,2,3,4,5,24,25,26,27,28,29}
[0] MPI startup(): 1       11243    lce63      {6,7,8,9,10,11,30,31,32,33,34,35}
[0] MPI startup(): 2       11244    lce63      {12,13,14,15,16,17,36,37,38,39,40,41}
[0] MPI startup(): 3       11245    lce63      {18,19,20,21,22,23,42,43,44,45,46,47}
[0] MPI startup(): 4       29238    lce62      {0,1,2,3,4,5,24,25,26,27,28,29}
[0] MPI startup(): 5       29239    lce62      {6,7,8,9,10,11,30,31,32,33,34,35}
[0] MPI startup(): 6       29240    lce62      {12,13,14,15,16,17,36,37,38,39,40,41}
[0] MPI startup(): 7       29241    lce62      {18,19,20,21,22,23,42,43,44,45,46,47}
Hello world from process 0 of 8


Regards,Mohammed Shaheen


    Am Montag, 3. Dezember 2018, 13:37:20 MEZ hat Gladkov, Dmitry <dmitry.gladkov at intel.com> Folgendes geschrieben:  
 
 #yiv9344534459 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv9344534459 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv9344534459 filtered {panose-1:2 11 6 9 4 5 4 2 2 4;}#yiv9344534459 filtered {panose-1:0 0 0 0 0 0 0 0 0 0;}#yiv9344534459 p.yiv9344534459MsoNormal, #yiv9344534459 li.yiv9344534459MsoNormal, #yiv9344534459 div.yiv9344534459MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;font-family:New serif;}#yiv9344534459 a:link, #yiv9344534459 span.yiv9344534459MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv9344534459 a:visited, #yiv9344534459 span.yiv9344534459MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv9344534459 span.yiv9344534459EmailStyle17 {font-family:sans-serif;color:#1F497D;}#yiv9344534459 .yiv9344534459MsoChpDefault {font-size:10.0pt;}#yiv9344534459 filtered {margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv9344534459 div.yiv9344534459WordSection1 {}#yiv9344534459 
Hi Mohammed,
 
  
 
Can we ask to submit a ticket in IDZ - https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology?
 
Did you use Verbs provider in the test? Please, set FI_PROVIDER=verbs to ensure it. Also if it possible, please send FI_LOG_LEVEL=debug log, it can help for further investigations.
 
--
 
Dmitry
 
  
 
From: Mohammed Shaheen [mailto:m_shaheen1984 at yahoo.com]
Sent: Thursday, November 29, 2018 2:34 PM
To: libfabric-users at lists.openfabrics.org; ofiwg at lists.openfabrics.org; Ilango, Arun <arun.ilango at intel.com>; Gladkov, Dmitry <dmitry.gladkov at intel.com>; Hefty, Sean <sean.hefty at intel.com>
Subject: Re: [libfabric-users] intel mpi with libfabric
 
  
 
Hi,
 
  
 
Now I am trying to use Intel MPI U1 with the libfabric that comes along with it. I get the following error
 
lce62:~ # mpirun -np 8 -perhost 4 --hostfile hosts ./test.e
Hello world from process 6 of 8
Hello world from process 4 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 0 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8
Hello world from process 3 of 8
Abort(809595151) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(356).............: MPI_Finalize failed
PMPI_Finalize(266).............:
MPID_Finalize(959).............:
MPIDI_NM_mpi_init_hook(1299)...:
MPIR_Reduce_intra_binomial(157):
MPIC_Send(149).................:
MPID_Send(256).................:
MPIDI_OFI_send_normal(429).....:
MPIDI_OFI_send_handler(733)....: OFI tagged send failed (ofi_impl.h:733:MPIDI_OFI_send_handler:Invalid argument)
[cli_4]: readline failed
 
  
 
However, when I set I_MPI_DEBUG to anything, the error disappears and it works successfully. Any thoughts?
 
  
 
Regards
 
Mohammed Shaheen
 
  
 
Am Montag, 26. November 2018, 18:10:58 MEZ hat Hefty, Sean <sean.hefty at intel.com> Folgendes geschrieben: 
 
  
 
  
 
> I have tried with the development version from the master branch. I
> get the following errors while building the library (make)
> prov/verbs/src/verbs_ep.c: In function
> 'fi_ibv_msg_xrc_ep_atomic_write':
> prov/verbs/src/verbs_ep.c:1770: error: unknown field 'qp_type'
> specified in initializer

These errors are coming from having an older verbs.h file.  The qp_type field was added as part of XRC support, about 5 years ago.  I think this maps to v13.

> prov/verbs/src/verbs_ep.c:1770: error: unknown field 'qp_type'
> specified in initializer

This is the same error, which suggests that it is still picking up an old verbs.h file.  Maybe Mellanox ships a different verbs.h file that what's upstream, but I doubt it would remove fields.
 


- Sean
 

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park, 
17 Krylatskaya Str., Bldg 4, Moscow 121614, 
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
  _______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org
https://lists.openfabrics.org/mailman/listinfo/libfabric-users
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20181203/bbe618a9/attachment-0001.html>


More information about the Libfabric-users mailing list