[Users] Inconsistency in delivering MCAST messages in OFED 1.5 and 2.0
Yevheniy Demchenko
zheka at uvt.cz
Mon Sep 9 20:36:34 PDT 2013
Hi!
Recently we've run into bunch of problems with cluster software (RHCS)
after installing latest OFED 2.0 software from Mellanox
(MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz
<http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.0-3.0.0&mname=MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz>).
It seems, that in contrast to 1.5 ofed2.0 does not deliver multicast
messages to the senders receive queue, thus preventing some software
(namely corosync) from proper functioning. Test case application is
attached to this message, it is cutted from the corosync stack and
modified a bit.
To compile: gcc -o test coropoll.c totemiba.c totemip.c util.c
-libverbs -lrdmacm -lrt
run: ./test -i <ip_address>, where ip_address is ib_ipoib address of IB HCA.
On RHEL 6.4. distributed IB sw:
root at ar03 ofed2test]# ./test -i 172.32.32.13
addr: 172.32.32.13
family 2
initialize:
pollrun
mcast_bind
mcast_rdma_event_fn ADDR_RESOLVED
mcast_rdma_event_fn MULTICAST_JOIN
iface_change_fn
calling in send_fn
in totemiba_mcast_flush_send: called ibv_post_send with res=0,
msg_len=9;, qp_num=117, qkey=1234567
mcast_cq_send_event_fn, res of ibv_poll_cq=1
in mcast_cq_recv_event_fn res=1
in mcast_cq_recv_event_fn calling iba_deliver_fn, wc[0].byte_len=49
IN iba_deliver_fn calling main_deliver_fn with bytes=49
deliver_fn
Received message asdfasdf
On ofed 2.0:
addr: 172.32.32.12
family 2
initialize:
pollrun
mcast_bind
mcast_rdma_event_fn ADDR_RESOLVED
mcast_rdma_event_fn MULTICAST_JOIN
iface_change_fn
calling in send_fn
in totemiba_mcast_flush_send: called ibv_post_send with res=0,
msg_len=9;, qp_num=93, qkey=1234567
mcast_cq_send_event_fn, res of ibv_poll_cq=1
uname -a
Linux ar02 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 14:23:09 CDT
2013 x86_64 x86_64 x86_64 GNU/Linux
HCA: mellanox ConnectX2, fw 2.9.1000
It seems that recv_comp_channel->fd is never touched and message is not
delivered to _sender_ (it is delivered to other hcas in multicast group
though).
I could not find any documentation regarding this case (delivering mcast
message to senders receive queue), so i am not sure which behaviour is
correct.
I'd very appreciate if someone could run this test on his machine and
confirm/disaffirm the problem. Also, it would be nice to know if
multicast message has to be queued to _sender's_ receive queue after
sending.
Regards,
--
Ing. Yevheniy Demchenko
Senior Linux Administrator
UVT s.r.o.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130910/3286ac65/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: temp-multicast-ofed2.tar.gz
Type: application/x-gzip
Size: 41899 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130910/3286ac65/attachment.bin>
More information about the Users
mailing list