[openib-general] [PATCHv2] IPoIB CM Experimental support

Michael S. Tsirkin mst at mellanox.co.il
Mon Dec 11 11:41:11 PST 2006


>  >> BTW, Roland, could you give me some indication on whether this
>  >> has a chance getting into 2.6.20? If yes I'll stop writing new code
>  >> and focus on polishing this.
> 
> >I think we could probably merge it but maybe it's better to put it in
> >-mm for a cycle given that it's new and not too many people have
> >looked at it yet.  And I still haven't gotten comfortable with the way
> >CM is enabled.
> 
> >- R.
> 
> I think it might be good for others in the OFA community to try this out before
> we decide it is ready for the kernel. I tried it out over the weekend, running
> Intel MPI over IPoIB_CM, and with default MTU settings, it did not cause any
> problems on my small 2 node cluster. Might be good however for someone to load
> this up on a larger cluster and test it.

IMO, we have after -rc1 to fix any bugs.
The feature *is* marked experimental after all, and have 0 impact
on code when disabled at compile time.
So if you want rock-stable, just turn it off.


> I did notice that unless I made the MTU
> really big (16K), there was not much benefit (if any) with the default MTU size.

Right. My observation too. The whole point of IPoIB CM
is to enable high MTU values. 64K is what works really well.

> I also noticed that when I set the MTU to 16K and ran some stressful MPI tests,
> that my system seemed to get un-responsive like IPoIB was taking up too much
> kernel memory.

Could you enable debug and try again? Maybe you have send errors.

My guess would be you are getting RQ underruns and QPs are getting closed and
reopened (and if DREQs are lost for some reason, which shouldn't happen on back
to back but seems to due to some issue in our MAD layer, we could be
getting stale connections which aren't currently cleaned up - it's on
my TODO).

I have a couple of ideas on how to fix it better - e.g. detect RNR NACK
and cycle the QP through RTS/INIT/RTR/RTS -
but the simplest workaround for now would be just to have a high MTU
or increase the RX ring size via IPoIB module option.

Can you try this too, and let me know?

> Thus, I think it best for others to play with this a bit before
> it is submitted upstream.
> 
> my 2 cents,
> woody

I don't know, really - it's an option after all.
Given that it doesn't cause problems for people that don't enable it,
keeping code out of kernel until it's totally robust seems wrong -
instead of debugging/fixing issues I'll have to spend time
keeping the code up to date with upstream.

-- 
MST




More information about the general mailing list