[ofa-general] [PATCH] Lock _do_madrpc for thread safety
Ira Weiny
weiny2 at llnl.gov
Wed Jul 8 17:47:20 PDT 2009
Sasha,
I am working on making libibnetdisc a parallel implementation. As a result I
have found that _do_madrpc is not thread safe. The following patch fixes
this. However, I don't know you want to do...
If one only uses mad_rpc and mad_rpc_rmpp then the patch works. However, if
someone is using mad_send_via at the same time _do_madrpc will still fail. Is
it by design that some responses will be lost while _do_madrpc is looking for
it's response via the TID?
Also, according to C13-18.1.1 and C13-19.1.1 you must use the SGID (or SLID)
and the MgmtClass in addition to the TID to determine the uniqueness of a
message. The SGID (or SLID) is of course the same but should the MgmtClass be
checked here as well?
Finally, why does _do_madrpc cast the transaction id to a 32 bit value?
Confused,
Ira
From: Ira Weiny <weiny2 at llnl.gov>
Date: Wed, 8 Jul 2009 15:01:11 -0700
Subject: [PATCH] Lock _do_madrpc for thread safety
Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
libibmad/configure.in | 2 ++
libibmad/src/rpc.c | 18 +++++++++++++++---
2 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/libibmad/configure.in b/libibmad/configure.in
index b4f5c41..5d3e304 100644
--- a/libibmad/configure.in
+++ b/libibmad/configure.in
@@ -43,6 +43,8 @@ then
AC_CHECK_HEADER(infiniband/umad.h, [],
AC_MSG_ERROR([<infiniband/umad.h> not found. libibmad requires libibumad.])
)
+AC_CHECK_HEADER(pthread.h, [],
+ AC_MSG_ERROR([<pthread.h> not found. libibmad requires pthread.h]))
fi
dnl Checks for library functions
diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c
index 07b623d..83bd4f0 100644
--- a/libibmad/src/rpc.c
+++ b/libibmad/src/rpc.c
@@ -43,6 +43,7 @@
#include <infiniband/umad.h>
#include <infiniband/mad.h>
+#include <pthread.h>
#include "mad_internal.h"
@@ -58,6 +59,8 @@ static int madrpc_timeout = MAD_DEF_TIMEOUT_MS;
static void *save_mad;
static int save_mad_len = 256;
+static pthread_mutex_t snd_rcv_lock = PTHREAD_MUTEX_INITIALIZER;
+
#undef DEBUG
#define DEBUG if (ibdebug) IBWARN
#define ERRS(fmt, ...) do { \
@@ -127,6 +130,7 @@ static int
_do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
int timeout, int max_retries)
{
+ uint32_t trid_rcv;
uint32_t trid; /* only low 32 bits */
int retries;
int length, status;
@@ -150,6 +154,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
ERRS("retry %d (timeout %d ms)", retries, timeout);
length = len;
+
+ pthread_mutex_lock(&snd_rcv_lock);
if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) {
IBWARN("send failed; %m");
return -1;
@@ -159,6 +165,7 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
/* send packet is lost somewhere. */
do {
if (umad_recv(port_id, rcvbuf, &length, timeout) < 0) {
+ pthread_mutex_unlock(&snd_rcv_lock);
IBWARN("recv failed: %m");
return -1;
}
@@ -168,9 +175,14 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len,
xdump(stderr, "rcv buf\n", umad_get_mad(rcvbuf),
IB_MAD_SIZE);
}
- } while ((uint32_t)
- mad_get_field64(umad_get_mad(rcvbuf), 0,
- IB_MAD_TRID_F) != trid);
+
+ trid_rcv = (uint32_t) mad_get_field64(umad_get_mad(rcvbuf),
+ 0, IB_MAD_TRID_F);
+ if (trid_rcv != trid)
+ IBWARN("trid_rcv(%x) != trid (%x)\n", trid_rcv, trid);
+
+ } while (trid_rcv != trid);
+ pthread_mutex_unlock(&snd_rcv_lock);
status = umad_status(rcvbuf);
if (!status)
--
1.5.4.5
More information about the general
mailing list