Thanks for the reply!<br><br>Good to know about the "true" MTU vs the synthetic mtu. I wasn't aware of that. <br><br>The NFS is NFS over TCP and the read/write sizes are both set to 32768.<br><br>I don't have any routes that I know of on the IB fabric- a traceroute seemed to verify this. I used tracepath to show me the mtu information between the two hosts. On the second attempt it looks like it "discovered" the correct MTU -<br>
<br>[root@headnode ~]# tracepath thumper1-ib<br> 1: headnode (10.0.1.1) 0.133ms pmtu 65520<br> 1: thumper1-ib (10.0.1.245) 0.161ms reached<br> Resume: pmtu 2044 hops 1 back 1 <br>
[root@headnode ~]# tracepath thumper1-ib<br> 1: headnode (10.0.1.1) 0.122ms pmtu 2044<br> 1: thumper1-ib (10.0.1.245) 0.121ms reached<br> Resume: pmtu 2044 hops 1 back 1 <br><br>
We rebooted the infiniband switch which cleared up the NFS issues for now. The one thing I noticed after the reboot was that the solars storage servers were back in the multicast group (saquery -m). It's definitely an odd situation...<br>
<br>Thanks again for your help<br><br><div class="gmail_quote">On Thu, Aug 27, 2009 at 11:26 AM, Nifty Tom Mitchell <span dir="ltr"><<a href="mailto:niftyompi@niftyegg.com">niftyompi@niftyegg.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">On Thu, Aug 27, 2009 at 08:30:52AM -0400, Aaron Knister wrote:<br>
><br>
> Hi!<br>
><br>
> I'm having some strange problems on an InfiniBand fabric at work. We<br>
> have upwards of 30 nodes running OFED 1.4 with DDR HCAs and a cisco 7012<br>
> IB switch. There are also several Sun "thumpers" running solaris that<br>
> are also connected to the infiniband fabric, however their HCAs are only<br>
> SDR. There are several 20 odd terabyte nfs mounts exported from the<br>
> thumpers and mounted to the compute nodes over IPoIB (we're not using<br>
> NFS RDMA). Opensm is running on the head node and all of the compute<br>
> nodes for redundancys sake. Things were running OK until yesterday when<br>
> a user crashed the head node by sucking up all of its memory, and at the<br>
> time the head node's subnet manager was in the master state. Well, a<br>
> different node quickly picked up subnet management until the head node<br>
> was rebooted at which point the head node became the subnet master.<br>
><br>
> Since logging back in to the cluster after rebooting the head node, the<br>
> nfs mounts from the thumpers have been hanging periodically all over the<br>
> place. I know that two of the thumpers and their nfs exports are being<br>
> hit with an aggregate of about 120MB/s of nfs traffic from about 30 or<br>
> so compute nodes, so I'm sure that's not helping things, however one of<br>
> the other thumpers that has no active jobs hitting its exports<br>
> periodically shows nfs server "not responding" message on the<br>
> clients/compute nodes. I checked the log files for the past week- these<br>
> nfs server not responding messages all started since the head node crash<br>
> yesterday. From what I've been told, every time this happens the only<br>
> fix is to reboot the switch.<br>
><br>
> Of course, any general debugging suggestions would be appreciated, but I<br>
> have a few specific questions regarding IPoIB and connected vs datagram.<br>
> All of the compute nodes and the head node (running ofed 1.4) are using<br>
> "connected mode" for IPoIB -><br>
><br>
> [root@headnode ~]# cat /sys/class/net/ib0/mode<br>
> connected<br>
><br>
> and the mtu of the interface is 65520<br>
><br>
> I don't know how to determine if the solaris (the thumpers) systems are<br>
> using connected mode, but their MTUs are 2044 which leads me to believe<br>
> they're probably not. I cannot log into these machines as I don't manage<br>
> them, but is there a way to determine the IPoIB mtu using an ib*<br>
> utility? Or am I misunderstanding IPoIB that such information wouldn't<br>
> be useful.<br>
><br>
> And lastly, I recall that with TCP over ethernet if you have the mtu<br>
> said to say 9000 and try and sling data to a box with an mtu of 1500 you<br>
> get some weird performance hits. Is it likely that the compute nodes use<br>
> of the larger MTU + connected mode paired with the thumpers much smaller<br>
> MTU + probably datagram mode could be causing timeouts under heavy load?<br>
> Does anybody think that settings the compute/head nodes to datagram mode<br>
> and subsequently dropping the mtu to 2044 would help my situation?<br>
><br>
> Again, any suggestions are greatly appreciated, and thanks in advance<br>
> for any replies!<br>
<br>
</div></div>Look at the MTU choices again.<br>
With Infiniband the "true" MTU is fixed at 2K (or 4K) and often limited<br>
to 2K by the switch firmware. Larger MTUs are thus synthetic and force software to<br>
assemble and disassemble the transfers. On a fabric the large MTU for IPoIB<br>
works well because the fabric is quite reliable. When data is routed<br>
to another network with a smaller MTU software needs to assemble and disassemble the<br>
fragments. Fragmentation can be expensive. Dropped bits and fragmentation is<br>
a major performance hit. Normal MTU discovery should make fragmentation go away.<br>
<br>
Ethernet jumbo packets (larger than 1500) are real on the wire.<br>
This is not the case on IB where the MTU is fixed.<br>
<br>
Is the NFS NFS over UDP or TCP ?<br>
What are the NFS read/ write sizes set to?<br>
<br>
Double check routes (traceroute). Dynamic routes and mixed MTUs is a tangle.<br>
The minimum MTU for a route can be discovered with ping and the do not fragment flag<br>
as long as ICMP packets are not filtered.<br>
<font color="#888888"><br>
--<br>
T o m M i t c h e l l<br>
Found me a new hat, now what?<br>
<br>
</font></blockquote></div><br>