[ofa-general] ibsim: sim_read_pkt: write failed: Resource temporarily unavailable - pkt dropped

Vincent Ficet jean-vincent.ficet at bull.net
Thu Sep 18 07:45:10 PDT 2008


Hello,

While simulating a large fabric using ibsim (roughly 3000 lines of 
topology, 50 x 36 port switches, 576 HCAs), I get the following errors:

sim_read_pkt: write failed: Resource temporarily unavailable - pkt dropped

The code is as follows (ibsim.c, function sim_read_pkt()):

        // reply
        ret = write(dcl->fd, buf, size);
        if (ret == size)
            return 0;

        if (ret < 0 && (errno == ECONNREFUSED || errno == ENOTCONN)) {
            IBWARN("client %u seems to be dead - disconnecting.",
                   dcl->id);
            disconnect_client(dcl->id);
        }
        IBWARN("write failed: %m - pkt dropped");

The error being thrown out here is EAGAIN and is not handled at all.

When I kill opensm after seeing these errors, I see that the MADs were 
not acknowledged by ibsim, e.g:

OpenSM: Got signal 2 - exiting...
There are still 51 MADs out. Forcing the exit of the OpenSM application...

To address this issue, I modified the code as follows:

--- ibsim.c.ORIG    2008-09-18 14:30:07.000000000 +0200
+++ ibsim.c    2008-09-18 15:37:55.000000000 +0200
@@ -481,6 +481,8 @@
         return -1;
     }
     for (;;) {
+        int retry_count = 0;
+
         if ((size = read(fd, buf, sizeof(buf))) <= 0)
             return size;
 
@@ -497,7 +499,14 @@
              size, sizeof(struct sim_request), dcl->id, dcl->fd);
 
         // reply
-        ret = write(dcl->fd, buf, size);
+        do {
+            ret = write(dcl->fd, buf, size);
+            if (retry_count && (ret != size)) {
+                IBWARN("failed to send reply: ret = %d, retry_count 
=%d, errno = %d.",
+                    ret, retry_count, errno);
+            }
+        } while ((retry_count++ < 20) && (ret == -1));
+             
         if (ret == size)
             return 0;
 
Basically, it simply retries 20 times before giving up (and I still get 
errors, although less).

The question is: Am I looking at the right thing here, or is the 'pkt 
dropped' error hiding another problem elsewhere ?
Note: both ibsim and opensm codes are pulled from the git head branch.

Thanks for your help,

Vincent



More information about the general mailing list