[ofa-general] Re: IPoIB path caching

Tue Jul 24 02:39:50 PDT 2007

Sean Hefty wrote:
>> What I have in mind is that IPoIB must not use cached IB path info.

>> If the IB stack has path caching which is in the default flow of 
>> requesting a path record, it should provide an API (eg flag to the 
>> function through which one does path query) to request a non cached path.

> Argh!  This was the original design.  I believe the current design is a 
> better approach.  The ULP shouldn't care whether the PR is cached or not 
> - only that it's usable.

Linux has a quite sophisticated mechanism to maintain / cache / probe / 
invalidate / update the network stack L2 neighbour info.

Stating that although the neighbour cache state machine decided to 
update/delete a neighbour it is just correct by design for IPoIB to use 
  cached IB L2 info is somehow moving too fast I think, some discussion 
is needed here.

My basic thought is that for IPoIB its better to never use cached path 
then to always use cached path. But! maybe there's a way in the middle 
here, lets think. This is what I was referring to when saying "almost 
always".

For example, in the Voltaire gen1 stack we had an ib arp module which 
was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). This 
module managed some sort of path cache, were IPoIB was always asking for 
non-cached path and other ULPs were willing to get cached path.

>> The design I was thinking to suggest for IPoIB is to almost always use 
>> this API since this policy makes the implementation consistent with 
>> the decisions made by the network stack neighbour cache

> This defeats one of the benefit of caching, which is using a single 
> GetTable query, versus literally hundreds or thousands of Get queries. 
> Consider that constant all-to-all communication using IPoIB between 1024 
> ports, with a 15 minute ARP table timeout would hit the SA with close to 
> 600 queries per second.

If the cache comes to serve all-to-all MPI jobs and practically with IB, 
to get MPI performance (specifically latency) people would --not-- be 
using IPoIB for their MPI jobs since they want kernel AND net-stack 
bypass, it does make sense to use non-cached path in IPoIB if we agree 
that design-wise its the the correct approach.

> While I agree that there's the potential for a problem, given that IPoIB 
> has always cached PRs and no one has reported problems, I think we're 
> overstating the likelihood of issues occurring in practice.  Even the SA 
> caches the path data -- getting a PR from the SA doesn't provide any 
> additional guarantees.

I am not with you... I would expect an SA implementation to invalid / 
recompute the relevant data structures associated with each change in 
the fabric and get a trap for each change.

Or.