[openib-general] thanks and a question
Ronald G Minnich
rminnich at lanl.gov
Wed Apr 12 16:29:30 PDT 2006
I was working with someone and watching a 256-node bproc cluster boot
friday. The openib folks have done a lot of very nice work. It booted
quite well once we set hoq and slv to 17 in the voltaire switch. It was
really snappy coming up. It was actually as fast to boot as a myrinet
cluster, which was nice to see.
But a question. When hoq and slv were 16 in the voltaire switch, we saw
tcp sessions hanging. Thinking back on the tcpdump we watched (would
that i had saved it) it almost seems that the sender thought it had
gotten an ack for a segment of 96 bytes, and discarded it; whereas the
receiver thought it had only gotten 32 of the 96 bytes, and was sending
back its idea of where the tcp stream was. So we sat and watched (via
tcpdump on the receiver) the two hosts send each other differing ideas
about the sequence numbers on the tcp connection.
is this at all possible? Could something happen below the tcp stack,
given a switch with too-low hoq and slv settings, such that the sender
would discard a segment that the receiver would not have ever seen? Is
there any switch involvment that could cause this? The whole situation
was really odd.
Finally, this was one sender, one receiver, and the problem was very,
very repeatable -- until we bumped 16->17.
Sorry I don't have more info.
thanks
ron
More information about the general
mailing list