[ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

Mon Apr 23 16:23:12 PDT 2007

On Tue, Apr 24, 2007 at 01:30:05AM +0300, Michael S. Tsirkin wrote:

> > For say, 10000 nodes you could compact an any-to-any path table into
> > around 20 megabytes.
> 
> I wonder how do you propose to compact the path records that drastically?
> Number of records seems to be  10000 * 10000 / 2 = 50 million.

I was imagining this kind of arrangement when I wrote the above (and I
only imagined a little bit, so lets see if it works out)

struct DataBase
{
   unsigned int GIDS;
   in6_addr gid_table[GIDS];
   unsigned int PATHS;
   struct Path path_table[PATHS];
};

struct Path
{
   uint16_t gid_idx;
   uint16_t dlid;
   uint8_t sl;
  // etc
   uint8_t valid_sgid_map[GIDS/8];
};

Lookup:
  bool match_it(struct Path *p)
  {
     const unsigned int srrc_gid_idx = ...; // look in gid_table
     const unsigned int dest_gid_idx = ...; // look in gid_table
     if (p->gid_idx == dest_gid_idx &&
         p->valid_sgid_map[src_gid_idx/8] & (1 << (src_gid_idx % 8)))
         return true;
     return false;
  };

  possible_paths = bsearch(DataBase.path,match_it);

There would be at most NUM_LIDS*NUM_SL Path records (ie all valid
DLID/SL combinations in the subnet).

So for 10000 nodes (1 LID each) we have:
  GIDS = 10000
  PATHS = 10000*16 = 1600000
  sizeof(gid_table) = 160000
  sizeof(struct Path) = 1258
  sizeof(path_table) = 2012800000
  sizeof(DataBase) = 2012960000 = 19.19 MB (worst case)

Broadly, this compresses the 128 bit GIDs by substituting a 16 bit
unique identifier (via gid_table) and the compacts the path records by
observing that in the N*N matrix there are only so many parameter
combinations and they are shared based on the soruce GID. So we use a
bitmap to indicate what destination parameters are valid for each
source in the network. Path lookup is O(log(n)) since the path and gid
tables would be sorted and have fixed element size. The whole thing
could be moved with a single RDMA WQ.

I think rates and pkeys can be coded similarly, but the upper bound
moves up to NUM_LIDS*NUM_SL*NUM_PKEYS*NUM_RATES. 

Off the top of my head I'd say that the pkey coding and
valid_sgid_maps can probably be optimized further, but that would
require actual study :P

I'd also include a path trace database and some flags so that APM can
be made to work sanely, but that is just fluff stuff.

The only reason I mention this is because cache-coherency is a huge
PITA. Even if the spec was updated to add new MADs for coherency it
would still be a PITA. I think for true high-availability you need to
get this stuff right.

If the SM re-routes the fabric to cover for a broken link that data
needs to get pushed out, QPs need to be updated, APM needs to be
fiddled/etc. IB implementations don't do that today, and replication
at least provides a scalable foundation to support this in huge
clusters.

Also, routers have the same basic set of problems :< Routers really
want an accurate replica of the SA database to run well.

Jason