[openib-general] thanks and a question

Hal Rosenstock halr at voltaire.com
Thu Apr 13 02:57:54 PDT 2006


Hi again Ron,

On Wed, 2006-04-12 at 23:46, Ronald G Minnich wrote:
> Hal Rosenstock wrote:
> 
> > hoq is HOQLife. Is slv the switch LifeTimeValue ?
> 
> I believe so.
> 
> > Does that have anything to do with those settings ?
> 
> it would not work until hoq and slv were 17.
> 
> > Truly hanging ?
> 
> yes, and it was the only real connection at that point, from the bproc 
> daemon on the slave node to the bproc daemon on the master. There was 
> only 1 host powered up at that point. It was very repeatable -- we tried 
> to get it to boot many times. And, weirdly, it always hung at that same 
> point.
> 
> 
> > Switches might drop 64 bytes at a time based on those parameters.
> 
> But why does the sender think the segment has been acked, when the 
> receiver has never seen that last 64 bytes? Where did the sender get 
> that TCP-level ack?

I don't know. It doesn't make sense.

Dropping a buffer (64 bytes) in a packet should cause a CRC error which
should mean the TCP packet is not valid. In any case, you should be able
to see the drops in the various Port (error) counters.

> > That effectively doubles the time before the drops would occur which
> > probably eliminated the drops so you didn't see this.
> > 
> > 16 = 268.435 msec
> > 17 = 526.871 msec
> 
> which leads to another question. This is 1/2 second. Does it really mean 
> that you could end up buffering 1/2 worth of flow on each port for all 
> 256 ports?

It is limited by the number of buffers (per VL per port) which is no
where near this so that could not occur.

The credits advertised on the link are reduced by the buffers in use so
the throughput would slow down on a congested port (meaning either
congestion or a slow receiver). 

> > 
> > What doesn't make sense to me is the one flow. Are you sure there's no
> > other data traffic ? If so, that doesn't make sense to me and hang
> > together with the rest of this scenario.
> 
> no other traffic that we could see, but there had been traffic prior to 
> this.

I would recommend putting an IB analyzer on the last link towards that
slave node and capturing the data traffic.

-- Hal

> Thanks hal!
> 
> ron




More information about the general mailing list