[ewg] RE: [ofa-general] OFED Jan 14 meeting summary on RC2readiness

Sun Jan 20 02:13:02 PST 2008

On Thu, Jan 17, 2008 at 05:25:14PM -0800, Roland Dreier wrote:
>  > Well, I can't speak for everyone, but in my opinion if someone wants to run
>  > MPI job so huge that XRC absolutely has to be used to be able to actually
>  > finish it then he should seriously rethink his application design.
> 
> But where do you think the crossover is where XRC starts to help MPI?
> In other words do I need a 10000 process job on 32-core systems for it
> to matter, or is there a significant advantage for running a 2048
> process job on 256 8-core systems?
Lets do the math:
N - number of processes
C - number of cores
QPS - qp size (assume 4K)
N/C - number of nodes

For non XRC case each process creates QP to each other process so the
number of QPs created by each process is N (well N - 1, but we don't
care) so the memory consumed by QPs from one node is: 
N * C * QPS

For XRC case each process creates send QP for each node and receive QP
for each process so the memory consumed by QPs from one node is:
(N/C * C + N) * QPS => 2 * N * QPS

Looking at your two examples:
1. N=10000 C=32
non XRC memory consumption: 1250M
XRC memory consumption: 78.125M

2. N=2048 C=8
non XRC memory consumption: 64M
XRC memory consumption: 16M

As you can see the benefit grows fast with the number of cores.

But it seems that applications, that are running ob big scale, rarely
(if at all) create all to all connections during their run. Just one
fun observation: lets assume that creating of one connection takes 500ms
then in your first example creating of all connection from one process
to all other processes will take 1.4 hour.

Memory consumed by the QPs is not the only thing that limits scalability
BTW. If each process communicates with all other processes it better
be preposting enough receive buffer. With XRC if recv QP is shared by local
processes and one of them goes RNR all other processes can't receive on
this QP either. And with XRC/SRQ we pretty much rely on HW flow control,
so this scenario will happen. Thus if you want to minimize RNRs you should
prepost more buffers as job grows.

--
			Gleb.