<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi!<br>
Recently we've run into bunch of problems with cluster software
(RHCS) after installing latest OFED 2.0 software from Mellanox (<a
href="http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.0-3.0.0&mname=MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz">MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz</a>).<br>
It seems, that in contrast to 1.5 ofed2.0 does not deliver multicast
messages to the senders receive queue, thus preventing some software
(namely corosync) from proper functioning. Test case application is
attached to this message, it is cutted from the corosync stack and
modified a bit.<br>
To compile: gcc -o test coropoll.c totemiba.c totemip.c util.c
-libverbs -lrdmacm -lrt<br>
run: ./test -i <ip_address>, where ip_address is ib_ipoib
address of IB HCA.<br>
<br>
On RHEL 6.4. distributed IB sw:<br>
root@ar03 ofed2test]# ./test -i 172.32.32.13<br>
addr: 172.32.32.13<br>
family 2<br>
initialize:<br>
pollrun<br>
mcast_bind<br>
mcast_rdma_event_fn ADDR_RESOLVED<br>
mcast_rdma_event_fn MULTICAST_JOIN<br>
iface_change_fn<br>
calling in send_fn<br>
in totemiba_mcast_flush_send: called ibv_post_send with res=0,
msg_len=9;, qp_num=117, qkey=1234567<br>
mcast_cq_send_event_fn, res of ibv_poll_cq=1<br>
in mcast_cq_recv_event_fn res=1<br>
in mcast_cq_recv_event_fn calling iba_deliver_fn, wc[0].byte_len=49<br>
IN iba_deliver_fn calling main_deliver_fn with bytes=49<br>
deliver_fn<br>
Received message asdfasdf<br>
<br>
<br>
On ofed 2.0:<br>
addr: 172.32.32.12<br>
family 2<br>
initialize:<br>
pollrun<br>
mcast_bind<br>
mcast_rdma_event_fn ADDR_RESOLVED<br>
mcast_rdma_event_fn MULTICAST_JOIN<br>
iface_change_fn<br>
calling in send_fn<br>
in totemiba_mcast_flush_send: called ibv_post_send with res=0,
msg_len=9;, qp_num=93, qkey=1234567<br>
mcast_cq_send_event_fn, res of ibv_poll_cq=1<br>
<br>
uname -a<br>
Linux ar02 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 14:23:09 CDT
2013 x86_64 x86_64 x86_64 GNU/Linux<br>
HCA: mellanox ConnectX2, fw 2.9.1000<br>
<br>
It seems that recv_comp_channel->fd is never touched and message
is not delivered to _sender_ (it is delivered to other hcas in
multicast group though).<br>
I could not find any documentation regarding this case (delivering
mcast message to senders receive queue), so i am not sure which
behaviour is correct.<br>
<br>
I'd very appreciate if someone could run this test on his machine
and confirm/disaffirm the problem. Also, it would be nice to know if
multicast message has to be queued to _sender's_ receive queue after
sending.<br>
<br>
Regards,<br>
<br>
<pre class="moz-signature" cols="72">--
Ing. Yevheniy Demchenko
Senior Linux Administrator
UVT s.r.o. </pre>
</body>
</html>