[openib-general] thanks and a question
Hal Rosenstock
halr at voltaire.com
Wed Apr 12 17:25:08 PDT 2006
Hi Ron,
On Wed, 2006-04-12 at 19:29, Ronald G Minnich wrote:
> I was working with someone and watching a 256-node bproc cluster boot
> friday. The openib folks have done a lot of very nice work. It booted
> quite well once we set hoq and slv to 17 in the voltaire switch.
hoq is HOQLife. Is slv the switch LifeTimeValue ?
> It was
> really snappy coming up. It was actually as fast to boot as a myrinet
> cluster, which was nice to see.
Does that have anything to do with those settings ?
> But a question. When hoq and slv were 16 in the voltaire switch, we saw
> tcp sessions hanging.
Truly hanging ?
> Thinking back on the tcpdump we watched (would
> that i had saved it) it almost seems that the sender thought it had
> gotten an ack for a segment of 96 bytes, and discarded it; whereas the
> receiver thought it had only gotten 32 of the 96 bytes, and was sending
> back its idea of where the tcp stream was.
Switches might drop 64 bytes at a time based on those parameters.
> So we sat and watched (via
> tcpdump on the receiver) the two hosts send each other differing ideas
> about the sequence numbers on the tcp connection.
>
> is this at all possible? Could something happen below the tcp stack,
> given a switch with too-low hoq and slv settings, such that the sender
> would discard a segment that the receiver would not have ever seen?
Yes, as the two directions are independent so I think that the dropping
in one direction could cause this.
> Is
> there any switch involvment that could cause this? The whole situation
> was really odd.
>
> Finally, this was one sender, one receiver, and the problem was very,
> very repeatable -- until we bumped 16->17.
That effectively doubles the time before the drops would occur which
probably eliminated the drops so you didn't see this.
16 = 268.435 msec
17 = 526.871 msec
What doesn't make sense to me is the one flow. Are you sure there's no
other data traffic ? If so, that doesn't make sense to me and hang
together with the rest of this scenario.
-- Hal
> Sorry I don't have more info.
>
> thanks
>
> ron
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list