[openib-general] thanks and a question

Ronald G Minnich rminnich at lanl.gov
Wed Apr 12 16:29:30 PDT 2006


I was working with someone and watching a 256-node bproc cluster boot 
friday. The openib folks have done a lot of very nice work. It booted 
quite well once we set hoq and slv to 17 in the voltaire switch. It was 
really snappy coming up. It was actually as fast to boot as a myrinet 
cluster, which was nice to see.

But a question. When hoq and slv were 16 in the voltaire switch, we saw 
tcp sessions hanging. Thinking back on the tcpdump we watched (would 
that i had saved it) it almost seems that the sender thought it had 
gotten an ack for a segment of 96 bytes, and discarded it; whereas the 
receiver thought it had only gotten 32 of the 96 bytes, and was sending 
back its idea of where the tcp stream was. So we sat and watched (via 
tcpdump on the receiver) the two hosts send each other differing ideas 
about the sequence numbers on the tcp connection.

is this at all possible? Could something happen below the tcp stack, 
given a switch with too-low hoq and slv settings, such that the sender 
would discard a segment that the receiver would not have ever seen? Is 
there any switch involvment that could cause this? The whole situation 
was really odd.

Finally, this was one sender, one receiver, and the problem was very, 
very repeatable -- until we bumped 16->17.

Sorry I don't have more info.

thanks

ron



More information about the general mailing list