[openib-general] thanks and a question

Hal Rosenstock halr at voltaire.com
Wed Apr 12 17:25:08 PDT 2006


Hi Ron,

On Wed, 2006-04-12 at 19:29, Ronald G Minnich wrote:
> I was working with someone and watching a 256-node bproc cluster boot 
> friday. The openib folks have done a lot of very nice work. It booted 
> quite well once we set hoq and slv to 17 in the voltaire switch.

hoq is HOQLife. Is slv the switch LifeTimeValue ?

>  It was 
> really snappy coming up. It was actually as fast to boot as a myrinet 
> cluster, which was nice to see.

Does that have anything to do with those settings ?

> But a question. When hoq and slv were 16 in the voltaire switch, we saw 
> tcp sessions hanging.

Truly hanging ?

>  Thinking back on the tcpdump we watched (would 
> that i had saved it) it almost seems that the sender thought it had 
> gotten an ack for a segment of 96 bytes, and discarded it; whereas the 
> receiver thought it had only gotten 32 of the 96 bytes, and was sending 
> back its idea of where the tcp stream was.

Switches might drop 64 bytes at a time based on those parameters.

>  So we sat and watched (via 
> tcpdump on the receiver) the two hosts send each other differing ideas 
> about the sequence numbers on the tcp connection.
> 
> is this at all possible? Could something happen below the tcp stack, 
> given a switch with too-low hoq and slv settings, such that the sender 
> would discard a segment that the receiver would not have ever seen?

Yes, as the two directions are independent so I think that the dropping
in one direction could cause this.

>  Is 
> there any switch involvment that could cause this? The whole situation 
> was really odd.
> 
> Finally, this was one sender, one receiver, and the problem was very, 
> very repeatable -- until we bumped 16->17.

That effectively doubles the time before the drops would occur which
probably eliminated the drops so you didn't see this.

16 = 268.435 msec
17 = 526.871 msec

What doesn't make sense to me is the one flow. Are you sure there's no
other data traffic ? If so, that doesn't make sense to me and hang
together with the rest of this scenario.

-- Hal

> Sorry I don't have more info.
> 
> thanks
> 
> ron
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list