[ofa-general] [PATCH 0/7][v1.2, v2.0] uDAPL patch set to enable scalability to 1000+ nodes/10000+ cores

Davis, Arlin R arlin.r.davis at intel.com
Fri Jun 20 11:41:23 PDT 2008


We managed to get access to a large cluster with 1000+ nodes and 10,000+
cores for testing/benchmarking. I am happy to say that uDAPL
successfully scaled out to more then 14,000 cores. However, when running
Intel MPI and uDAPL (OFED 1.2.5, mlx4 DDR) we discovered that the uDAPL
rdma_cm provider would not scale beyond 256 nodes so we had to move back
to a socket cm provider to setup the QP's. This patch set brings back
socket cm (slight redesign) with some fixes and cleanup.

For the record, the basic reason for rdma_cm scaling problems was path
record queries. Until there is consensus on IB path record caching
solutions that scales and is moved upstream I am recommending that uDAPL
IB consumers needing large scale-out use socket cm provider
(libdaplscm.so) in leiu of rdma_cm (libdaplcma.so). iWARP support will
remain via uDAPL rdma_cm provider. 

-arlin



More information about the general mailing list