[ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
Tang, Changqing
changquing.tang at hp.com
Fri Jul 6 12:00:30 PDT 2007
Sean:
I have 6 nodes with two IB cards on each node. If I configure
the first card on all nodes as one subnet, the second card on all nodes
as another subnet, Plus set arp_ignore=2, jobs on first subnet, or
second subnet work fine.
But when I configure all 12 cards into a single subnet, jobs on
all first cards work fine, job on all second cards hangs.
Here is one node IP info:
ib0 Link encap:InfiniBand HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.200.0.5 Bcast:172.200.0.255 Mask:255.255.255.0
inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:12375 errors:0 dropped:0 overruns:0 frame:0
TX packets:155 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:1293846 (1.2 MiB) TX bytes:16008 (15.6 KiB)
ib1 Link encap:InfiniBand HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.200.0.11 Bcast:172.200.0.255
Mask:255.255.255.0
inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:12299 errors:0 dropped:0 overruns:0 frame:0
TX packets:155 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:1280105 (1.2 MiB) TX bytes:25117 (24.5 KiB)
Do you have any idea what's wrong ? Thanks.
--CQ
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
> Tang, Changqing
> Sent: Friday, July 06, 2007 1:08 PM
> To: Sean Hefty; Arlin Davis
> Cc: Vladimir Sokolovsky; OpenFabrics General
> Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
>
>
> Sean:
> Thanks, I think this solve our problem. Currently two
> cards are on different subnet. Code on either subnet is
> working reliablely. I have not tried if all cards are on the
> same subnet.
>
> Do you recommend to config as a single subnet or two subnets ?
>
>
> --CQ
>
> > -----Original Message-----
> > From: Sean Hefty [mailto:sean.hefty at intel.com]
> > Sent: Friday, July 06, 2007 11:48 AM
> > To: Tang, Changqing; Arlin Davis
> > Cc: Vladimir Sokolovsky; OpenFabrics General
> > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
> >
> > >Eventhough I force all ranks only using the first card
> > (ib0), it works
> > >for a while and then fails with NON_PEER_REJECTED when one
> > rank tries
> > >to connect to another rank (dat_connect() and
> > dat_evd_wait()). (I run a
> > >simple MPI job in an infinite loop, it fails after hundreds runs);
> >
> > This sounds like it could be a race condition as a result
> of running
> > the test in a loop. If the client starts before the server is
> > listening, it will receive this sort of reject event.
> >
> > >It works on the first card (ib0), failed on the second card (ib1)
> >
> > Please take a look at the following thread:
> >
> > http://lists.openfabrics.org/pipermail/general/2007-May/036559.html
> >
> > In particular, see Steve's message about this:
> >
> > http://lists.openfabrics.org/pipermail/general/2007-May/036571.html
> >
> > and let me know if his suggestion fixes your problem.
> >
> > I will update the librdmacm documentation with this information as
> > well.
> >
> > - Sean
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list