[ofa-general] Re: IPoIB path caching
Or Gerlitz
ogerlitz at voltaire.com
Wed Jul 25 00:00:28 PDT 2007
Sean Hefty wrote:
>> Linux has a quite sophisticated mechanism to maintain / cache / probe
>> / invalidate / update the network stack L2 neighbour info.
> Path records are not just L2 info. They contain L4, L3, and L2 info
> together.
Maybe I was not clear enough: the neighbours cache keeps the stack Link
(=L2) level info. The "IPoIB L2 info" (the neighbour HW address)
contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info.
So bottom line, the stack considers the <flags|gid|qpn> creature as L2
info wheres in IB terms it contains L4/L3/L2 info.
>> For example, in the Voltaire gen1 stack we had an ib arp module which
>> was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc).
>> This module managed some sort of path cache, were IPoIB was always
>> asking for non-cached path and other ULPs were willing to get cached
>> path.
> IMO, using a cached AH is no different than using a cached path. You're
> simply mapping the PR data into another structure.
From the one hand the stack can't allow itself to do L3 --> L2 (ARP)
resolving for each packet xmit but on the other hand the stack has this
mechanism to probe / invalidate / etc its L2 cache. So my basic claim is
that if the stack decided to renew its L2 info, it would be incorrect
design to use cached IB L2 info.
> We're ignoring the problem here, and that is that a centralized SA
> doesn't scale. MPI stacks have largely ignored this problem by simply
> not doing path record queries. Path information is often hard-coded,
> with QPN data exchanged out of band over sockets (often over Ethernet).
I don't think that trying to separate IPoIB flow from MPI flow is
ignoring the problem. Its different settings, IPoIB is a network device
working under the net stack which has some design philosophy. Native MPI
implementations over IB are not tied to the stack, its different.
> We've seen problems running large MPI jobs without PR caching. I know
> that Silverstorm/QLogic did as well. And apparently Voltaire hit the
> same type of problem, since you added a caching module. (Did Mellanox
> and Topspin/Cisco create PR caches as well?) At least three companies
> working on IB came up with the same solution. What is the objection to
> the current patch set?
Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not--
using cached IB L2 info wheres MPI,Lustre etc did.
I am willing to go with the local sa coming to serve large MPI jobs, so
you load as a prerequisite to spawning large all-to-all job.
But, I think the default for IPoIB needs to be usage of non cached PR.
If you want to support the non-common case of huge-mpi-job-over-ipoib, I
am fine with adding a param to IPoIB telling it to request cached PR
from the ib_sa module.
Or.
More information about the general
mailing list