[ofa-general] Re: [ewg] OFED March 24 meeting summary on OFED 1.4 plans
Tang, Changqing
changquing.tang at hp.com
Fri Apr 4 08:26:50 PDT 2008
> > for example, in MPI, process A know the HCA guid on another node.
> > After running for some time, the switch is restarted for
> some reason, and the whole fabric is re-configured.
>
>
> CQ,
>
> If by "the whole fabric is re-configured" you refer to a case
> where a subnet prefix changes while a job runs and a process
> is detached/reattached to the job so now you want to adopt
> your design to handle it, is over engineering, why you want
> to do that?
>
I am concerning the port lid change. It is always the best if a process can figure
the info it needs by itself, SA query is the right way and is in IB spec.
while it is possible to let processes to exchange information(port lid) again, but
there are difficulties: during the middle of a long job run, it is hard to let two
processes to coordinate such infomation exchange, and it requires a second channel
to do so. If the second channel is IPoIB, it is broken as well, and we need to re-establish
it again.
I just ask for the SA functionalities. If it is not possible, we have to use a very
complicated way to let HP-MPI to survive from network failure.
--CQ
> Or.
>
More information about the ewg
mailing list