[ofa-general] Multiports single HCA uDAPL program problem
Jie Cai
Jie.Cai at cs.anu.edu.au
Sun Feb 1 23:14:35 PST 2009
One more problem happened when trying to establish 1 connection per
rail, as illustrated
in the graph.
node0 node1
rail0: psp0 <----------------> ep0 (port 0 on hca)
rail1: psp1 <----------------> ep1 (port 1 on hca)
rail0 got connected first and connection are always stable and correct.
However rail1 sometime connected properly sometime doesn't.
Following is the error message:
11836 Waiting for connect response
11836 Error unexpected conn event : DAT_CONNECTION_EVENT_NON_PEER_REJECTED
11836 Error connect_ep: DAT_ABORT
The program establishes the connection for both rail exactly the same.
What may caused this?
Regards,
--
Jie Cai
Davis, Arlin R wrote:
> This looks like an ARP issue across your IPoIB interfaces.
>
> Please see section 6 of the uDAPL OFED BKM.
>
> http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_ofed_testing_bkm.pdf
>
> 6. Multi IB port configuration, IPoIB arp reply issues
>
> When two interfaces running one interface may reply to an ARP
> directed to the other interface on the system. The following
> configuration will cause the interfaces to ignore ARP requests if
> not specifically for their IP address.
>
> Add the following lines to /etc/sysctl.conf
> net.ipv4.conf.all.arp_ignore=1
> net.ipv4.conf.ib0.arp_ignore=1
> net.ipv4.conf.ib1.arp_ignore=1
>
> or use sysctl:
> sysctl -w net.ipv4.conf.all.arp_ignore=1
> sysctl -w net.ipv4.conf.ib0.arp_ignore=1
> sysctl -w net.ipv4.conf.ib1.arp_ignore=1
>
> -arlin
>
>
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jie Cai
>> Sent: Thursday, January 29, 2009 10:53 PM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] Multiports single HCA uDAPL program problem
>>
>> Hi All,
>>
>> I am kind of noob on IB and uDAPL program. Currently, I am trying to
>> write a program with multirail that utilizes 2 ports on a
>> single Mallenox
>> ConnectX HCA on both nodes.
>>
>> OFED1.3 has been installed on a SUSE 10.3 linux system.
>>
>> The current problem is that IB connection via uDAPL are very unstable,
>> and sometime the connection can't be established.
>> Error message is usually like:
>>
>> 20350 Server waiting for connect request on port 45248
>> accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2)
>> 20350 Error dat_cr_accept: DAT_INTERNAL_ERROR
>> 20350 Error connect_ep: DAT_INTERNAL_ERROR
>>
>> The status of both port are active:
>> hca_id: mlx4_0
>> fw_ver: 2.3.000
>> node_guid: 0003:ba00:0100:702c
>> sys_image_guid: 0003:ba00:0100:702f
>> vendor_id: 0x02c9
>> vendor_part_id: 25418
>> hw_ver: 0xA0
>> board_id: SUN0070000001
>> phys_port_cnt: 2
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 10
>> port_lid: 8
>> port_lmc: 0x00
>>
>> port: 2
>> state: PORT_ACTIVE (4)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 10
>> port_lid: 9
>> port_lmc: 0x00
>>
>>
>> I haven't done any specific configuration for multi-port. I assume that
>> OFED1.3 can do it automatically.
>>
>> Would please any one help me on this?
>>
>> Regards,
>> Jie
>>
>> --
>> Jie Cai
>>
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
> >
More information about the general
mailing list