[openib-general] SA cache design

Rimmer, Todd trimmer at silverstorm.com
Mon Jan 16 14:28:14 PST 2006


> From: Eitan Zahavi [mailto:eitan at mellanox.co.il]
> What was I thinking ...
> for (target  = (myRank + 1) % numNodes ; target != myRank; target =
> (target + 1)% numNodes) { 	/* establish connection to node target
> */
> }
This can be even simpler for MPI.

Given some nodes must listen and others must connect, have an approch such as higher rank processes connect to lower rank processes.

Then its simply:
	initiate listen on my endpoint /* could omit this for highest rank in job */

	for (target=(my_rank-1); target>0; target--)
		initiate connect to target

For even greater efficiency, the "initiate connect to target" could be done in parallel batches.  Eg. start 50 outbound connects, wait for some or all of them to complete, then start the next batch.  Such as:

	for (target=(my_rank-1); target>0; target--)
		while (num_outstanding > limit)
			wait
		num_outstanding++
		initiate connect to target

Then the callback for completing a connection sequence could decrement num_outstanding and wakeup the waiter (or the waiter could be a sleep/poll type loop).

We have been successfully using the algorithms above for about 2-3 years now and they work very well.

Todd Rimmer



More information about the general mailing list