[ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

Sean Hefty sean.hefty at intel.com
Mon Apr 23 11:14:49 PDT 2007


>1. What happens on e.g. a heterogenious network? It seems that
>path to a specific GID might change e.g. MTU without GID
>going in/out of service. How would this be handled?

If the path parameters change without a GID going in/out of service, then the
cached path records would be off.  A forced cache refresh would be needed.  This
could be done with an application (or manually) writing to the
'ib_local_sa/refresh' file.

I should note that I removed the timed based updates that are in OFED.  This
seemed to be the most objectionable part of that implementation.  (It could be
implemented in userspace if needed.)  I also looked at the path record caching
behavior in ipoib as a starting point.  From what I could tell, ipoib caches a
path record per DGID, and requires bringing the device down/up to refresh the
cache.  (Someone tell me if my understanding is off here.)

>2. What will happen on a number of changes in the network?
>Would not the SA would need to send a huge number of notices now?
>Should we be concerned?

This could be an issue, and I have a few thoughts on this.

First, I can make event registration tunable (yes/no) to avoid sending notices.
(The labs, which are requesting the caching, are only considering static
configurations anyway.)  Second, if an administrator is going to make a large
number of changes to the network, he would be better off disabling the cache
first, making the changes, then re-enabling the cache.  Finally, I don't think
it's unreasonable to expect an SA that claimed to support a subnet of size N to
support event registration to each node.

>3. Comments indicate that the main win from the patch is
>with all-to-all startup times on large MPI clusters. If that is so,
>and assuming a small number of MPI jobs is running on each node,
>isn't it true that the main win is not from *caching* as such
>(since all paths are requested at the beginning and never
>used after this), but rather from limiting the number of outstanding MADs to SA
>and from reusing multiple path queries in a single request.
>Could that be the case?

A definite benefit does come from using a GetTable query, versus a Get query.
However, the rdma_cm/socket-like interface doesn't readily lend itself to using
a GetTable query, so one could argue that the cache is what enables the use of a
GetTable query.

However, without the cache, you end up with duplicated SA queries between
processes.  Currently, each process issues one query for each <SGID,DGID> pair.
Even if a way were found to have each process use a GetTable query rather than a
Get query, we'd still have duplicated queries to the SA.  With the latest
systems, we're looking at 8 cores per node, which would likely result in the SA
processing 8 identical queries per node.  (Assuming 1 process per core,
all-to-all connection model.)

>4. Why do we need yet another API and yet another module to speed up just
>RDMA/CM path record queries?  We now get 2 ways to do this (with/without the
>cache).  Shouldn't there be just one?

I did consider this, but the cache operates synchronously, and ib_sa interface
is asynchronous.  I tried to make the API make sense for the cache.  The rdma_cm
doesn't really take advantage of the synchronous interface, but I believe that
ipoib could.  Converting the ib_local_sa to an asynchronous interface requires
adding registration calls, and an ability to cancel operations.

One potential benefit with a single interface is adding the ability to populate
the cache on an as-need basis, similar to how ipoib works.  Going this route
requires determining how long to maintain path records in the cache, and how to
configure the cache for this use.  I didn't explore this option in a lot of
detail because it didn't match up with the lab's use.

>5. How will the user guess the correct value for paths_per_dest tunable,
>besides disabling the cache? I notice it is currently set to a value
>of 0x7F. Where does this value come from?

This sets the NumbPath field in the path record query.  0x7f is the maximum
value.  Other useful values would be 0 - disable, 1 - one path to each DGID.

>Since OFED includes a significantly different version of this code
>(without notices), and this is the first time the notices code
>makes an appearance, I think that targeting .23, and considering
>alternative options such as the above, would be more prudent.

I don't have any objections to waiting if suggestions cannot be incorporated by
the time 2.6.22 closes, or if we can't reach consensus.  But if all changes are
in by 2.6.22, there's not much to be gained by letting it sit out of tree an
extra release.  I can disable the cache by default, or mark it as experimental

- Sean



More information about the general mailing list