[ofw] [RFC] Generate IBAT path records in IPoIB

Fab Tillier ftillier at windows.microsoft.com
Wed Aug 6 14:55:52 PDT 2008


Hi Tzachi,

> Hi Fab,
>
> Please note that one of our goal is to have ipoib reach bw that is
> higher than 10 GbE.

Wouldn't the broadcast group be > 10Gbps in such a case, or do you mean to reach > 10GbE speeds in a heterogeneous SDR/DDR/QDR fabric?

> Xalex has already started working in this direction (and there is still
> a lot to do).
>
> I understand that you have a complicated task to solve in a short time,
> but please try to allow the exiting mechanism to be used as they are
> today. (for example by using a registry key).

The problem with the existing mechanism is it fails for even relatively small clusters (64 nodes, 8-cores per node fails 100% of the time for me).  Ultimate performance is meaningless if the cluster is unusable under real application workloads.  I suppose I could use the local port speed to set the rate, with potential packet loss due to flooding the target when speeds aren't matched.

I don't know what the proper fix would be.  An SA cache has pitfalls too, and gets pretty complicated if you want to prevent duplicate queries from being issued when an MPI job starts up (your cache entries need to have a 'pending' state).  Then there's the issue of aging of entries, flushing entries when a GID goes out of service, proactively populating the cache when a GID goes into service, etc.

I think having a proper solution to the SM bottleneck should come before any performance optimizations for heterogeneous fabrics (the homogeneous fabric case is handled by my changes.)

Do you agree?

-Fab




More information about the ofw mailing list