[openib-general] thanks and a question

Ronald G Minnich rminnich at lanl.gov
Wed Apr 12 20:46:24 PDT 2006


Hal Rosenstock wrote:

> hoq is HOQLife. Is slv the switch LifeTimeValue ?

I believe so.

> Does that have anything to do with those settings ?

it would not work until hoq and slv were 17.

> Truly hanging ?

yes, and it was the only real connection at that point, from the bproc 
daemon on the slave node to the bproc daemon on the master. There was 
only 1 host powered up at that point. It was very repeatable -- we tried 
to get it to boot many times. And, weirdly, it always hung at that same 
point.


> Switches might drop 64 bytes at a time based on those parameters.

But why does the sender think the segment has been acked, when the 
receiver has never seen that last 64 bytes? Where did the sender get 
that TCP-level ack?


> That effectively doubles the time before the drops would occur which
> probably eliminated the drops so you didn't see this.
> 
> 16 = 268.435 msec
> 17 = 526.871 msec

which leads to another question. This is 1/2 second. Does it really mean 
that you could end up buffering 1/2 worth of flow on each port for all 
256 ports?


> 
> What doesn't make sense to me is the one flow. Are you sure there's no
> other data traffic ? If so, that doesn't make sense to me and hang
> together with the rest of this scenario.

no other traffic that we could see, but there had been traffic prior to 
this.

Thanks hal!

ron



More information about the general mailing list