[ofa-general] ibsim: sim_read_pkt: write failed: Resource temporarily unavailable - pkt dropped
Vincent Ficet
jean-vincent.ficet at bull.net
Thu Sep 18 07:45:10 PDT 2008
Hello,
While simulating a large fabric using ibsim (roughly 3000 lines of
topology, 50 x 36 port switches, 576 HCAs), I get the following errors:
sim_read_pkt: write failed: Resource temporarily unavailable - pkt dropped
The code is as follows (ibsim.c, function sim_read_pkt()):
// reply
ret = write(dcl->fd, buf, size);
if (ret == size)
return 0;
if (ret < 0 && (errno == ECONNREFUSED || errno == ENOTCONN)) {
IBWARN("client %u seems to be dead - disconnecting.",
dcl->id);
disconnect_client(dcl->id);
}
IBWARN("write failed: %m - pkt dropped");
The error being thrown out here is EAGAIN and is not handled at all.
When I kill opensm after seeing these errors, I see that the MADs were
not acknowledged by ibsim, e.g:
OpenSM: Got signal 2 - exiting...
There are still 51 MADs out. Forcing the exit of the OpenSM application...
To address this issue, I modified the code as follows:
--- ibsim.c.ORIG 2008-09-18 14:30:07.000000000 +0200
+++ ibsim.c 2008-09-18 15:37:55.000000000 +0200
@@ -481,6 +481,8 @@
return -1;
}
for (;;) {
+ int retry_count = 0;
+
if ((size = read(fd, buf, sizeof(buf))) <= 0)
return size;
@@ -497,7 +499,14 @@
size, sizeof(struct sim_request), dcl->id, dcl->fd);
// reply
- ret = write(dcl->fd, buf, size);
+ do {
+ ret = write(dcl->fd, buf, size);
+ if (retry_count && (ret != size)) {
+ IBWARN("failed to send reply: ret = %d, retry_count
=%d, errno = %d.",
+ ret, retry_count, errno);
+ }
+ } while ((retry_count++ < 20) && (ret == -1));
+
if (ret == size)
return 0;
Basically, it simply retries 20 times before giving up (and I still get
errors, although less).
The question is: Am I looking at the right thing here, or is the 'pkt
dropped' error hiding another problem elsewhere ?
Note: both ibsim and opensm codes are pulled from the git head branch.
Thanks for your help,
Vincent
More information about the general
mailing list