[ofa-general] [PATCH RFC] libibumad: give up cpu time slice if write() falied

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Thu Jan 29 07:26:37 PST 2009


Hi Sasha,

While running opensm on a single-core CPU I've noticed the following problem:
when SA is stressed with many SA queries (such as when you run "osmtest -f f"
on a multi-core CPU machine), sometimes opensm fails to send responses.
It appears that send buffer gets full, and write() fails.
Since the CPU has a single core, when the sender thread retries to send the
same packet, it fails again and again, because the driver didn't have the chance
to transmit something from the send buffer.

Then I added 20 msec sleep after write() failure to make it give up the cpu time
slice, and I saw some improvement. When I added sleep of several seconds, the problem
disappeared completely, but then of course, the client's transaction will fail with
timeout unless you specifically increase the transaction timeout on the client side.

So, what do you think about the following patch?

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 libibumad/src/umad.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c
index 78b956d..a03a018 100644
--- a/libibumad/src/umad.c
+++ b/libibumad/src/umad.c
@@ -814,6 +814,8 @@ umad_send(int fd, int agentid, void *umad, int length,

 	DEBUG("write returned %d != sizeof umad %zu + length %d (%m)",
 	      n, umad_size(), length);
+	
+	usleep(20000);
 	if (!errno)
 		errno = EIO;
 	return -EIO;
-- 
1.5.1.4




More information about the general mailing list