[ofa-general] [PATCH] Lock _do_madrpc for thread safety

Fri Jul 10 13:26:33 PDT 2009

On Fri, 10 Jul 2009 14:37:01 -0400
Hal Rosenstock <hal.rosenstock at gmail.com> wrote:

> Ira,
> 
> On 7/8/09, Ira Weiny <weiny2 at llnl.gov> wrote:
> > Sasha,
> >
> > I am working on making libibnetdisc a parallel implementation.  As a result
> > I
> > have found that _do_madrpc is not thread safe.
> 
> Have you read Sasha's commit 51d25384626a7b4ba386c414ed56c647a7bf64df
> from 12/26/08 ? In it, he states " I think that it will be more robust
> for multithreaded
> application to use its own synchronization methods (pthread mutex or any
> other) for better control. "

I missed that.  Thanks.

However, I am starting to lean towards putting something in libibmad.  The
problem for me is that in the ibnetdisc library I can't control what the
application is doing.  And more importantly I can't allow the application to
do anything (at least at certain times).

I will wait for Sasha to weigh in on this but I would rather not implement
something in ibnetdisc to synchronize the calls of another library.  For
example I don't want to re-implement mad_rpc and/or mad_send_via.  However, we
must allow applications above ibnetdisc make calls on the wire.

I have been thinking about this since I sent the last email and I have another
idea.  What if we made ibmad queue all responses it gets in _do_madrpc which
are not the response it is looking for.  When mad_receive_via or subsequent
_do_madrpc calls are made they first look in this queue for messages and
return them if they are there.  That would keep things in sync and not lose
messages.  Anyone could call into the mad layer any way they want.

Applications, or parts of applications, which want to do parallel queries
would still have to match up their own responses to queries but they would not
have to worry about losing messages in the ibmad layer.

> 
> > The following patch fixes
> > this.  However, I don't know you want to do...
> >
> > If one only uses mad_rpc and mad_rpc_rmpp then the patch works.  However, if
> > someone is using mad_send_via at the same time _do_madrpc will still fail.
> > Is
> > it by design that some responses will be lost while _do_madrpc is looking
> > for
> > it's response via the TID?
> >
> > Also, according to C13-18.1.1 and C13-19.1.1 you must use the SGID (or SLID)
> > and the MgmtClass in addition to the TID to determine the uniqueness of a
> > message.  The SGID (or SLID) is of course the same but should the MgmtClass
> > be checked here as well?
> >
> > Finally, why does _do_madrpc cast the transaction id to a 32 bit value?
> 
> The kernel uses the high 32 bits for an agent ID. See kernel
> Documentation/infiniband/user_mad.txt "Transaction IDs".

Ah got it, thanks,
Ira

> 
> -- Hal
> 
> > Confused,
> > Ira
> 
> <snip...>

-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov