[ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes

Tang, Changqing changquing.tang at hp.com
Fri Jul 6 12:00:30 PDT 2007


Sean:
	I have 6 nodes with two IB cards on each node. If I configure
the first card on all nodes as one subnet, the second card on all nodes
as another subnet, Plus set arp_ignore=2, jobs on first subnet, or
second subnet work fine.

	But when I configure all 12 cards into a single subnet, jobs on
all first cards work fine, job on all second cards hangs.

	Here is one node IP info:

ib0       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.5  Bcast:172.200.0.255  Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:12375 errors:0 dropped:0 overruns:0 frame:0
          TX packets:155 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:1293846 (1.2 MiB)  TX bytes:16008 (15.6 KiB)

ib1       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.11  Bcast:172.200.0.255
Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:12299 errors:0 dropped:0 overruns:0 frame:0
          TX packets:155 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:1280105 (1.2 MiB)  TX bytes:25117 (24.5 KiB)

	Do you have any idea what's wrong ?  Thanks.

--CQ



> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Tang, Changqing
> Sent: Friday, July 06, 2007 1:08 PM
> To: Sean Hefty; Arlin Davis
> Cc: Vladimir Sokolovsky; OpenFabrics General
> Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> 
> 
> Sean:
> 	Thanks, I think this solve our problem. Currently two 
> cards are on different subnet. Code on either subnet is 
> working reliablely. I have not tried if all cards are on the 
> same subnet.
> 
> 	Do you recommend to config as a single subnet or two subnets ?
> 
> 
> --CQ 
> 
> > -----Original Message-----
> > From: Sean Hefty [mailto:sean.hefty at intel.com]
> > Sent: Friday, July 06, 2007 11:48 AM
> > To: Tang, Changqing; Arlin Davis
> > Cc: Vladimir Sokolovsky; OpenFabrics General
> > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> > 
> > >Eventhough I force all ranks only using the first card
> > (ib0), it works
> > >for a while and then fails with NON_PEER_REJECTED when one
> > rank tries
> > >to connect to another rank (dat_connect() and
> > dat_evd_wait()). (I run a
> > >simple MPI job in an infinite loop, it fails after hundreds runs);
> > 
> > This sounds like it could be a race condition as a result 
> of running 
> > the test in a loop.  If the client starts before the server is 
> > listening, it will receive this sort of reject event.
> > 
> > >It works on the first card (ib0), failed on the second card (ib1)
> > 
> > Please take a look at the following thread:
> > 
> > http://lists.openfabrics.org/pipermail/general/2007-May/036559.html
> > 
> > In particular, see Steve's message about this:
> > 
> > http://lists.openfabrics.org/pipermail/general/2007-May/036571.html
> > 
> > and let me know if his suggestion fixes your problem.
> > 
> > I will update the librdmacm documentation with this information as 
> > well.
> > 
> > - Sean
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 



More information about the general mailing list