[ofa-general] Multi-threaded diags (Was: Re: [PATCH 4/5] infiniband-diags/libibnetdisc: Introduce a context object.)

Ira Weiny weiny2 at llnl.gov
Thu Aug 27 09:48:10 PDT 2009


On Wed, 26 Aug 2009 18:24:20 -0600
Jason Gunthorpe <jgunthorpe at obsidianresearch.com> wrote:

> On Wed, Aug 26, 2009 at 04:40:26PM -0700, Ira Weiny wrote:
> 
> > Of course!  :-)  But first I would like to mention some numbers from the
> > prototype code I have.  When running on a small fabric the additional overhead
> > of thread creation actually slows down the scan.  :-(
> 
> It seems strange to me to thread something like this (and alot of hard
> work)..
> 
> FSM multiplexing the recv path usually gives much better performance,
> something like net discovery is quite easy..

Using the original algorithm and data structures lended itself to threading.
Now that I am neck deep in all this I have thought that rewriting it all might
be easier.

> main loop:
>  fill tx queue from next list
>  recieve replies and correlate with next list

This would still need additional code (or additional synchronization in the
API to libibnetdisc) if you wanted a user app to be multi-threaded.  Someone
has to be in charge of receiving all replies on that ibmad_port object and
handing them to the proper owner.  Of course one could open multiple
ibmad_port objects but how is the app writer to know to do that?  Digging
through the code to find out that libibnetdisc is consuming all the replies?

This is what got me on this in the first place.  smp_query_via (_do_madrpc) is
not thread safe.  Threading was the easy way to deal with multiple blocking
queries on the fabric.  Changing _do_madrpc to be thread safe allowed a very
quick multithreaded implementation on top of the current algorithm which
blocked on multiple queries.  I did not have to form the queries myself, it
was easy...  (I had that working months ago.)  Given that we don't want to
change libibmad things got more complicated and your algorithm seems much
better... (except [see below])

Also, I feel that someone down the road might fall into the same trap that I
did thinking that smp_query_via is thread safe and I would like to fix that.

> 
> each entry:
>  add to next list additional ports
> 
> Repeat until dead.
> 
> Where a 'next list' would be a set of actions along the lines of
> 'query node' or 'query port' the action on a 'query node' completion
> is to generate 'query port' next list items for all the ports, and on
> 'query port' completion is to generate 'query node' items for all
> enabled ports..
> 
> libumad is nonblocking, parallel, etc...

Yes, and libibmad layers on top of it an easier interface to issue common
queries.  Why should we ask the user to re-implement that code?

For example, mad_rpc now handles redirection.  My implementation does not yet.
So now I have to handle that on my own as well...  :-(

Ira

> 
> Jason


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov



More information about the general mailing list