[ofa-general] QP connection healthy detection problem with fork/exec
Tang, Changqing
changquing.tang at hp.com
Wed Mar 26 14:45:38 PDT 2008
Hi:
I have a connection healthy detection problem, here is what I do. Rank 0 and Rank 1 setup a QP connection.
Rank 0 is waiting a message from rank 1, during this time, Rank 0 periodically sends a heart-beat message back to
Rank 1 to detect if the connection is OK, or if rank 1 has died.
The heart-beat is a zero-byte RDMA message:
sr.next = NULL;
sr.wr_id = (uint64_t)(AULONG)rdmahdr;
sr.sg_list = &ssg;
sr.num_sge = 0;
sr.opcode = IBV_WR_RDMA_WRITE;
sr.send_flags = IBV_SEND_INLINE|IBV_SEND_SIGNALED;
If this heart-beat message completes with success, I think, the connection is OK, and peer process is alive.
However in Rank 1, fork() is called, and parent exit(), the child call sleep for 5 minutes. But in rank 0,
The hear-beat message is always success untill I kill rank 2's child.
Further, rank 1 calls fork() and exits, the child calls
execl("/bin/sleep", "sleep", "300", (char *)0);
In rank 0, the heart-beat is still success untill I kill the 'sleep' process.
It is easy to understand that if only fork() is called, the child will hold QP resources from parent, rank 0 can NOT detect
anything wrong. But if child calls exec, everything in rank 1 has been destroyed, why can't rank 0 detect the connection is broken ?
Thanks for any help.
--CQ Tang, HP-MPI
More information about the general
mailing list