[ofa-general] Multiports single HCA uDAPL program problem

Jie Cai Jie.Cai at cs.anu.edu.au
Sun Feb 1 23:14:35 PST 2009


One more problem happened when trying to establish 1 connection per 
rail, as illustrated
in the graph.

          node0                    node1
rail0: psp0 <----------------> ep0         (port 0 on hca)
rail1: psp1 <----------------> ep1         (port 1 on hca)

rail0 got connected first and connection are always stable and correct.
However rail1 sometime connected properly sometime doesn't.
Following is the error message:

11836 Waiting for connect response
11836 Error unexpected conn event : DAT_CONNECTION_EVENT_NON_PEER_REJECTED
11836 Error connect_ep: DAT_ABORT

The program establishes the connection for both rail exactly the same.
What may caused this?

Regards,

-- 
Jie Cai




Davis, Arlin R wrote:
> This looks like an ARP issue across your IPoIB interfaces. 
>
> Please see section 6 of the uDAPL OFED BKM.
>
> http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_ofed_testing_bkm.pdf
>  
> 6. Multi IB port configuration, IPoIB arp reply issues
>
> When two interfaces running one interface may reply to an ARP
> directed to the other interface on the system. The following
> configuration will cause the interfaces to ignore ARP requests if
> not specifically for their IP address.
>
> Add the following lines to /etc/sysctl.conf
> net.ipv4.conf.all.arp_ignore=1
> net.ipv4.conf.ib0.arp_ignore=1
> net.ipv4.conf.ib1.arp_ignore=1
>
> or use sysctl:
> sysctl -w net.ipv4.conf.all.arp_ignore=1
> sysctl -w net.ipv4.conf.ib0.arp_ignore=1
> sysctl -w net.ipv4.conf.ib1.arp_ignore=1
>
> -arlin
>
>   
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org 
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jie Cai
>> Sent: Thursday, January 29, 2009 10:53 PM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] Multiports single HCA uDAPL program problem
>>
>> Hi All,
>>
>> I am kind of noob on IB and uDAPL program. Currently, I am trying to
>> write a program with multirail that utilizes 2 ports on a 
>> single Mallenox
>> ConnectX HCA on both nodes.
>>
>> OFED1.3 has been installed on a SUSE 10.3 linux system.
>>
>> The current problem is that IB connection via uDAPL are very unstable,
>> and sometime the connection can't be established.
>> Error message is usually like:
>>
>> 20350 Server waiting for connect request on port 45248
>> accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2)
>> 20350 Error dat_cr_accept: DAT_INTERNAL_ERROR
>> 20350 Error connect_ep: DAT_INTERNAL_ERROR
>>
>> The status of both port are active:
>> hca_id:    mlx4_0
>>    fw_ver:                2.3.000
>>    node_guid:            0003:ba00:0100:702c
>>    sys_image_guid:            0003:ba00:0100:702f
>>    vendor_id:            0x02c9
>>    vendor_part_id:            25418
>>    hw_ver:                0xA0
>>    board_id:            SUN0070000001
>>    phys_port_cnt:            2
>>        port:    1
>>            state:            PORT_ACTIVE (4)
>>            max_mtu:        2048 (4)
>>            active_mtu:        2048 (4)
>>            sm_lid:            10
>>            port_lid:        8
>>            port_lmc:        0x00
>>
>>        port:    2
>>            state:            PORT_ACTIVE (4)
>>            max_mtu:        2048 (4)
>>            active_mtu:        2048 (4)
>>            sm_lid:            10
>>            port_lid:        9
>>            port_lmc:        0x00
>>
>>
>> I haven't done any specific configuration for multi-port. I assume that
>> OFED1.3 can do it automatically.
>>
>> Would please any one help me on this?
>>
>> Regards,
>> Jie
>>
>> --
>> Jie Cai
>>
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>     
> >



More information about the general mailing list