[Users] Linux kernel: Crash of IB peer in RC mode is not detected

Jack Wang xjtuwjp at gmail.com
Thu Oct 23 06:50:09 PDT 2014


cc to linux-rdma, which is more proper for this kind of questions.

2014-10-23 15:15 GMT+02:00 Fabian Holler <fabian.holler at profitbricks.com>:
> Hello,
>
> we are implementing Linux kernel modules that are transferring data
> with RDMA-Write operations via an RC-connection between 2 hosts.
>
> After the RDMA connection between the hosts was established we are causing a
> kernel Oops on one of them with "echo c > /proc/sysrq-trigger".
>
> The other peer of the RC connection don't notice the crash.
> RDMA-Write operations are still finished successfully with a WC event 10min
> after the crash.
> Our module has event handlers registered for:
> - CQ ib_event_handler,
> - QP ib_event_handler,
> - device ib_event_handler,
> - connection manager event handler.
> But we don't receive any events that indicate a connection abort.
>
> I expected that RDMA-Write operations will fail if the other crashes.
> Also I hoped that an event is generated when a host is crashed. The subnet
> manager should notice it and notify every other device in the network.
>
> Are we missing something in our modules?
> Is there a way to determine that a RC peer crashed without implementing a
> ping-pong mechanism?
>
> Our setup:
> - Linux 3.14.13
> - Mellanox Technologies MT27500 Family [ConnectX-3],
>   mlx4_core driver
> - both peers are directly connected, no switch in between
> - on both hosts OpenSM 3.2.6 is running
>
>
> thanks in advance
>
> Fabian
>
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users
>



More information about the Users mailing list