[ofa-general] Re: IPoIB path caching

Or Gerlitz ogerlitz at voltaire.com
Wed Jul 25 00:00:28 PDT 2007


Sean Hefty wrote:
>> Linux has a quite sophisticated mechanism to maintain / cache / probe 
>> / invalidate / update the network stack L2 neighbour info.

> Path records are not just L2 info.  They contain L4, L3, and L2 info 
> together.

Maybe I was not clear enough: the neighbours cache keeps the stack Link 
(=L2) level info. The "IPoIB L2 info" (the neighbour HW address) 
contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info.

So bottom line, the stack considers the <flags|gid|qpn> creature as L2 
info wheres in IB terms it contains L4/L3/L2 info.

>> For example, in the Voltaire gen1 stack we had an ib arp module which 
>> was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). 
>> This module managed some sort of path cache, were IPoIB was always 
>> asking for non-cached path and other ULPs were willing to get cached 
>> path.

> IMO, using a cached AH is no different than using a cached path.  You're 
> simply mapping the PR data into another structure.

 From the one hand the stack can't allow itself to do L3 --> L2 (ARP) 
resolving for each packet xmit but on the other hand the stack has this 
mechanism to probe / invalidate / etc its L2 cache. So my basic claim is 
that if the stack decided to renew its L2 info, it would be incorrect 
design to use cached IB L2 info.

> We're ignoring the problem here, and that is that a centralized SA 
> doesn't scale.  MPI stacks have largely ignored this problem by simply 
> not doing path record queries.  Path information is often hard-coded, 
> with QPN data exchanged out of band over sockets (often over Ethernet).

I don't think that trying to separate IPoIB flow from MPI flow is 
ignoring the problem. Its different settings, IPoIB is a network device 
working under the net stack which has some design philosophy. Native MPI 
implementations over IB are not tied to the stack, its different.

> We've seen problems running large MPI jobs without PR caching.  I know 
> that Silverstorm/QLogic did as well.  And apparently Voltaire hit the 
> same type of problem, since you added a caching module.  (Did Mellanox 
> and Topspin/Cisco create PR caches as well?)  At least three companies 
> working on IB came up with the same solution.  What is the objection to 
> the current patch set?

Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not-- 
using cached IB L2 info wheres MPI,Lustre etc did.

I am willing to go with the local sa coming to serve large MPI jobs, so 
you load as a prerequisite to spawning large all-to-all job.

But, I think the default for IPoIB needs to be usage of non cached PR.

If you want to support the non-common case of huge-mpi-job-over-ipoib, I 
am fine with adding a param to IPoIB telling it to request cached PR 
from the ib_sa module.

Or.




More information about the general mailing list