[ewg] Update from September OpenFabrics Interoperability Event at UNH-IOL
Bob Noseworthy
ren at iol.unh.edu
Tue Oct 21 09:40:16 PDT 2008
Greetings EWG members,
A bug for the observed IPoIB issue was logged last Friday, and
updated yesterday confirming that RC3 still demonstrates the issue.
This is logged as #1287 --
https://bugs.openfabrics.org/show_bug.cgi?id=1287
Further issues/observations from the recent OFA Interoperability Logo
Group's September Interoperability Event are at the end of this email.
Summary of reported IPoIB issue:
If IPoIB datagram mode is enabled, and IP frames of 8K or larger are
sent, and no ARP entry exists for the destination, then the first IP
frame is always lost (ping used), no matter what the timeout is set to
(as high as 15s)
The following is a short summary of various updates from the September
OpenFabrics Interoperability Event. Due to confidentiality reasons,
many details are occluded. Per the request of the IWG on Oct 14, this
information is being shared with the EWG.
==================
Below are rough notes from our testers, principally Nick Wood and Mike
Hagen.
IB update;
1. An SDP issue was observed once and not reproduced - suspected to be
an issue with starting testing too soon after netserver was started
while all three SDP tests were running simultaneously. When retesting
was performed tests were not run simultaneously and no issues were seen.
2. An SRP issues was observed once and not reproduced - A vendors SRP
target was seen to become unresponsive when srp_sg_tablesize was
increased to 255. Subsequent testing did not reproduce this behavior
but is still being pursued.
2a. A vendors HCA was seen to perform slowly on SRP transfers, this
was traced to an issue with the default srp_sg_tablesize of 16 had to be
increased to 131 for reasonable performance. Reminder - performance
is outside the scope of the Logo program.
Tziporet - this default value perhaps could be increased as recommended
unless there is a reason 16 is preferred.
3. There is a link issue between two vendor's HCA cards. The fix
that was introduced allowed the link indication light to come up however
ibdiagnet never completes (hangs at IPoIB subnets check) and had to be
killed. Ibdiagnet also reports the following error:
-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-E- Could not get PM info:
"pmGetPortCounters 0xffff 1" failed 4 consecutive times.
-E- Could not get PM info:
"pmGetPortCounters 0xffff 1" failed 4 consecutive times.
-I- No illegal PM counters values were found
This happens with both VendorA cards when linked to any speed card
from VendorB *without* an sm running. If there is an sm running and the
fix is in place on the machines housing the VendorA cards then
everything works flawlessly when linked with any speed VendorB card.
Upon removal of the cable from the VendorA card, that card gets put
into a bad state; with the fix in place and an sm running. The sm does
not activate the newly established link. This happened with VendorA
cards to any VendorB card. OpenSM also reports an error on screen;
OpenSM: SM port is down. Reestablishing the connection that was in place
when the opensm instance was started restores the active state.
One final bit of information that I have been able to glean. It does
not appear to matter if you restore the original connection that the
opensm was started on. The only connection that brings the card back to
an active state is if you link it with a qdr hca even if that connection
was not the original. If you then attempt to restore the original the
active state will not be restored.
Currently this issue is presumed to be principally a vendor matter, but
if evidence points to additional issues with ibdiagnet, or other OFED
matters, then bugs will be filed.
4. Similar to the above issue, it was observed that two vendor's HCAs
that should link at DDR when directly connected were actually linking at
SDR speeds, regardless of the cable used. This is a known issue however
seems to be a failure of the Link Init test procedure as the highest
denominator speed is not achieved.
5. An issue with ibdiagnet was discovered by a vendor and bugs submitted
(unrelated to issue 3 above)
==================
iWARP update;
1. "dapltest -T P" will not work between two cards. They both have
implemented a different peer2peer protocol that ensures that a client
does a transfer before the server, to overcome the limitation in the
iWARP standard that says a client must send first data or the connection
must be teared down.
2. The section in the IWG test suite covering dapl must be updated to
include at least some reference to /etc/dat.conf which must be
configured in order to use any dapl based application including many
MPIs and dapltest. (This was being addressed by Arlin Davis)
3. dapl2.0 and dapltest2.0 do not work with iWARP devices. From the
base OFED1.4 install dapl2.0-utils must be uninstalled and compat-dapl
must be installed from the OFED website.
4. Due to the dapl problems, Intel MPI works in single vendor
environments but will not work in multi-vendor environments.
5. The default OpenMPI installed with OFED 1.4 is version 1.2.7. iWARP
support is officially not added until OpenMPI 1.3.
6. Loopback functionality is still not seen by all vendors. (this has
relevance to OFED feature enhancement #1275
<https://bugs.openfabrics.org/show_bug.cgi?id=1275>
7. Dynamic links support was not seen by all vendors when using Intel MPI.
==================
==================
Testing is ongoing with RC3 and future 1.4RCs on a best effort basis
until the GA, at which time the Logo Event will be held for those
participating.
If you have additional questions about these comments, the
Interoperability Events, Logo Events, or the OFA Interoperability Test
Plan, please feel free to contact us here at UNH-IOL, our OFA
Interoperability Logo Group team can be reached at ofalab at iol.unh.edu.
<mailto:ofalab at iol.unh.edu>
The testplan, logo list and past logo reports can be reviewed at
http://www.iol.unh.edu/services/testing/ofa/
Best Regards,
- Bob Noseworthy
Chief Engineer / Technical Sherpa
+1-909-891-0090 {unified phone number for office, cell, etc}
+1-603-862-0090 {IOL Main number-associate this with any shipments}
UNH-IOL
Rupert Dance wrote:
> I have sent another reminder to UNH IOL to get this logged. I will continue
> to follow up on this.
>
> Thanks
>
> Rupert
>
> -----Original Message-----
> From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il]
> Sent: Sunday, October 19, 2008 8:48 AM
> To: Rupert Dance
> Cc: EWG
> Subject: Have you opened bugs to OFED 1.4
>
> I mean the bugs you explained in the last OFED meeting.
>
> Thanks
> Tziporet
>
>
More information about the ewg
mailing list