[ofw] ping on WinOF

Wed Jul 1 09:27:19 PDT 2009

We have seen issues with IPoIB in datagram mode particularly when you use a
large size (8192 and greater).  This was reported to the OFA Bugzilla Bug #
1287 <https://bugs.openfabrics.org/show_bug.cgi?id=1287> . Yosef Etigin
looked into this and suggested a workaround that did affect the first packet
drop. Here is his comment:

It is a network stack limitation and not related ipoib in particular.

There's a limit (default = 3) on number of pending skb's before a neighbour
is

resolved. You can increase it with sysctl net.ipv4.neigh.ib0.unres_qlen.

Obviously, same thing happens with Ethernet interface.

When testing at UNH-IOL for the Logo program, this is what we did:

After working with Sasha Khapyorsky on this issue we have a working fix. To
further explain the situation, the large packet sizes we are using are
overflowing the buffers so there is no room to append the arp request on to
the beginning of the cmd. This results in a dropped packet because the
system doesn't know how to get to the destination due to an empty arp table.
The fix, increase the buffer size via:

sysctl net.ipv4.neigh.ib0.unres_qlen=17 # default is the value 3

Thanks

Rupert Dance

From: ofw-bounces at lists.openfabrics.org
[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of David Brean
Sent: Wednesday, July 01, 2009 11:39 AM
To: ofw at lists.openfabrics.org
Subject: [ofw] ping on WinOF

Hello,

An internal customer is using WinOF 2.0.X and has reported to me the
following behavior related to IPoIB and ping:

Do you have any ideas on why windows 2008 client with HCA may first timeout
ping to other clients on the fabric?

Initially ping fails but then starts working.

Example :  Ping is invoked three times successfully.

C:\GRITS>ping -a 192.168.100.235

Pinging 192.168.100.235 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 192.168.100.235:
   Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),

C:\GRITS>ping -a 192.168.100.235

Pinging 192.168.100.235 with 32 bytes of data:
Request timed out.
Request timed out.
Reply from 192.168.100.235: bytes=32 time<1ms TTL=255
Reply from 192.168.100.235: bytes=32 time<1ms TTL=255

Ping statistics for 192.168.100.235:
   Packets: Sent = 4, Received = 2, Lost = 2 (50% loss),
Approximate round trip times in milli-seconds:
   Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:\GRITS>ping -a 192.168.100.235

Pinging 192.168.100.235 with 32 bytes of data:
Reply from 192.168.100.235: bytes=32 time<1ms TTL=255
Reply from 192.168.100.235: bytes=32 time<1ms TTL=255
Reply from 192.168.100.235: bytes=32 time<1ms TTL=255
Reply from 192.168.100.235: bytes=32 time<1ms TTL=255

Ping statistics for 192.168.100.235:
   Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
   Minimum = 0ms, Maximum = 0ms, Average = 0ms

Then we are good for sometime before this starts again if network is idle on
the fabric.

Has this sort of behavior been observed before?  The Linux and Solaris nodes
sharing the same IP subnet appear to be behaving normally.  Windows server
is the "out-of-the-box" configuration with Voltaire switch configured with
only the default partition (0xFFFF).

-David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090701/c4707f79/attachment.html>