Thanks for the reply!<br><br>Good to know about the "true" MTU vs the synthetic mtu. I wasn't aware of that. <br><br>The NFS is NFS over TCP and the read/write sizes are both set to 32768.<br><br>I don't have any routes that I know of on the IB fabric- a traceroute seemed to verify this. I used tracepath to show me the mtu information between the two hosts. On the second attempt it looks like it "discovered" the correct MTU -<br>

<br>[root@headnode ~]# tracepath thumper1-ib<br> 1:  headnode (10.0.1.1)                       0.133ms pmtu 65520<br> 1:  thumper1-ib (10.0.1.245)                0.161ms reached<br>     Resume: pmtu 2044 hops 1 back 1 <br>

[root@headnode ~]# tracepath thumper1-ib<br> 1:  headnode (10.0.1.1)                       0.122ms pmtu 2044<br> 1:  thumper1-ib (10.0.1.245)                0.121ms reached<br>     Resume: pmtu 2044 hops 1 back 1 <br><br>

We rebooted the infiniband switch which cleared up the NFS issues for now. The one thing I noticed after the reboot was that the solars storage servers were back in the multicast group (saquery -m). It's definitely an odd situation...<br>

<br>Thanks again for your help<br><br><div class="gmail_quote">On Thu, Aug 27, 2009 at 11:26 AM, Nifty Tom Mitchell <span dir="ltr"><<a href="mailto:niftyompi@niftyegg.com">niftyompi@niftyegg.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote:<br>


><br>

> Hi!<br>

><br>

> I'm having some strange problems on an InfiniBand fabric at work. We<br>

> have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012<br>

> IB switch. There are also several Sun "thumpers" running solaris that<br>

> are also connected to the infiniband fabric, however their HCAs are only<br>

> SDR. There are several 20 odd terabyte nfs mounts exported from the<br>

> thumpers and mounted to the compute nodes over IPoIB (we're not using<br>

> NFS RDMA). Opensm is running on the head node and all of the compute<br>

> nodes for redundancys sake. Things were running OK until yesterday when<br>

> a user crashed the head node by sucking up all of its memory, and at the<br>

> time the head node's subnet manager was in the master state. Well, a<br>

> different node quickly picked up subnet management until the head node<br>

> was rebooted at which point the head node became the subnet master.<br>

><br>

> Since logging back in to the cluster after rebooting the head node, the<br>

> nfs mounts from the thumpers have been hanging periodically all over the<br>

> place. I know that two of the thumpers and their nfs exports are being<br>

> hit with an aggregate of about 120MB/s of nfs traffic from about 30 or<br>

> so compute nodes, so I'm sure that's not helping things, however one of<br>

> the other thumpers that has no active jobs hitting its exports<br>

> periodically shows nfs server "not responding" message on the<br>

> clients/compute nodes. I checked the log files for the past week- these<br>

> nfs server not responding messages all started since the head node crash<br>

> yesterday. From what I've been told, every time this happens the only<br>

> fix is to reboot the switch.<br>

><br>

> Of course, any general debugging suggestions would be appreciated, but I<br>

> have a few specific questions regarding IPoIB and connected vs datagram.<br>

> All of the compute nodes and the head node (running ofed 1.4) are using<br>

> "connected mode" for IPoIB -><br>

><br>

> [root@headnode ~]# cat /sys/class/net/ib0/mode<br>

> connected<br>

><br>

> and the mtu of the interface is 65520<br>

><br>

> I don't know how to determine if the solaris (the thumpers) systems are<br>

> using connected mode, but their MTUs are 2044 which leads me to believe<br>

> they're probably not. I cannot log into these machines as I don't manage<br>

> them, but is there a way to determine the IPoIB mtu using an ib*<br>

> utility? Or am I misunderstanding IPoIB that such information wouldn't<br>

> be useful.<br>

><br>

> And lastly, I recall that with TCP over ethernet if you have the mtu<br>

> said to say 9000 and try and sling data to a box with an mtu of 1500 you<br>

> get some weird performance hits. Is it likely that the compute nodes use<br>

> of the larger MTU + connected mode paired with the thumpers much smaller<br>

> MTU + probably datagram mode could be causing timeouts under heavy load?<br>

> Does anybody think that settings the compute/head nodes to datagram mode<br>

> and subsequently dropping the mtu to 2044 would help my situation?<br>

><br>

> Again, any suggestions are greatly appreciated, and thanks in advance<br>

> for any replies!<br>

<br>

</div></div>Look at the MTU choices again.<br>

With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited<br>

to 2K by the switch firmware.   Larger MTUs are thus synthetic and force software to<br>

assemble and disassemble the transfers.  On a fabric the large MTU for IPoIB<br>

works well because the fabric is quite reliable.  When data is routed<br>

to another network with a smaller MTU software needs to assemble and disassemble the<br>

fragments.   Fragmentation can be expensive.  Dropped bits and fragmentation is<br>

a major performance hit.    Normal MTU discovery should make fragmentation go away.<br>

<br>

Ethernet jumbo packets (larger than 1500) are real on the wire.<br>

This is not the case on IB where the MTU is fixed.<br>

<br>

Is the NFS NFS over UDP or TCP ?<br>

What are the NFS read/ write sizes set to?<br>

<br>

Double check routes (traceroute).  Dynamic routes and mixed MTUs is a tangle.<br>

The minimum MTU for a route can be discovered with ping and the do not fragment flag<br>

as long as ICMP packets are not filtered.<br>

<font color="#888888"><br>

--<br>

        T o m  M i t c h e l l<br>

        Found me a new hat, now what?<br>

<br>

</font></blockquote></div><br>