[Users] Linux kernel: Crash of IB peer in RC mode is not detected

Fabian Holler fabian.holler at profitbricks.com
Thu Oct 23 06:15:22 PDT 2014


Hello,

we are implementing Linux kernel modules that are transferring data
with RDMA-Write operations via an RC-connection between 2 hosts.

After the RDMA connection between the hosts was established we are causing a
kernel Oops on one of them with "echo c > /proc/sysrq-trigger".

The other peer of the RC connection don't notice the crash.
RDMA-Write operations are still finished successfully with a WC event 10min
after the crash.
Our module has event handlers registered for:
- CQ ib_event_handler,
- QP ib_event_handler,
- device ib_event_handler,
- connection manager event handler.
But we don't receive any events that indicate a connection abort. 

I expected that RDMA-Write operations will fail if the other crashes.
Also I hoped that an event is generated when a host is crashed. The subnet
manager should notice it and notify every other device in the network.

Are we missing something in our modules?
Is there a way to determine that a RC peer crashed without implementing a
ping-pong mechanism?

Our setup:
- Linux 3.14.13
- Mellanox Technologies MT27500 Family [ConnectX-3], 
  mlx4_core driver
- both peers are directly connected, no switch in between
- on both hosts OpenSM 3.2.6 is running


thanks in advance

Fabian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20141023/b1c444dd/attachment.sig>


More information about the Users mailing list