[ofa-general] QP connection healthy detection problem with fork/exec

Tang, Changqing changquing.tang at hp.com
Wed Mar 26 14:45:38 PDT 2008



Hi:
        I have a connection healthy detection problem, here is what I do.   Rank 0 and Rank 1 setup a QP connection.
Rank 0 is waiting a message from rank 1, during this time, Rank 0 periodically sends a heart-beat message back to
Rank 1 to detect if the connection is OK, or if rank 1 has died.

        The heart-beat is a zero-byte RDMA message:

                sr.next = NULL;
                sr.wr_id = (uint64_t)(AULONG)rdmahdr;

                sr.sg_list = &ssg;
                sr.num_sge = 0;
                sr.opcode = IBV_WR_RDMA_WRITE;
                sr.send_flags = IBV_SEND_INLINE|IBV_SEND_SIGNALED;

        If this heart-beat message completes with success, I think, the connection is OK, and peer process is alive.

        However in Rank 1, fork() is called, and parent exit(), the child call sleep for 5 minutes. But in rank 0,
The hear-beat message is always success untill I kill rank 2's child.

        Further, rank 1 calls fork() and exits, the child calls
execl("/bin/sleep", "sleep", "300", (char *)0);

        In rank 0, the heart-beat is still success untill I kill the 'sleep' process.

        It is easy to understand that if only fork() is called, the child will hold QP resources from parent, rank 0 can NOT detect
anything wrong. But if child calls exec, everything in rank 1 has been destroyed, why can't rank 0 detect the connection is broken ?


        Thanks for any help.


--CQ Tang, HP-MPI



More information about the general mailing list