[Users] Inconsistency in delivering MCAST messages in OFED 1.5 and 2.0

Yevheniy Demchenko zheka at uvt.cz
Mon Sep 9 20:36:34 PDT 2013


Hi!
Recently we've run into bunch of problems with cluster software (RHCS) 
after installing latest OFED 2.0 software from Mellanox 
(MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz 
<http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.0-3.0.0&mname=MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz>).
It seems, that in contrast to 1.5 ofed2.0 does not deliver multicast 
messages to the senders receive queue, thus preventing some software 
(namely corosync) from proper functioning. Test case application is 
attached to this message, it is cutted from the corosync stack and 
modified a bit.
To compile: gcc -o test  coropoll.c  totemiba.c  totemip.c  util.c 
-libverbs -lrdmacm -lrt
run: ./test -i <ip_address>, where ip_address is ib_ipoib address of IB HCA.

On RHEL 6.4. distributed IB sw:
root at ar03 ofed2test]# ./test -i 172.32.32.13
addr: 172.32.32.13
family 2
initialize:
pollrun
mcast_bind
mcast_rdma_event_fn ADDR_RESOLVED
mcast_rdma_event_fn MULTICAST_JOIN
iface_change_fn
calling in send_fn
in totemiba_mcast_flush_send: called ibv_post_send with res=0, 
msg_len=9;, qp_num=117, qkey=1234567
mcast_cq_send_event_fn, res of ibv_poll_cq=1
in mcast_cq_recv_event_fn res=1
in mcast_cq_recv_event_fn calling iba_deliver_fn, wc[0].byte_len=49
IN iba_deliver_fn calling main_deliver_fn with bytes=49
deliver_fn
Received message asdfasdf


On ofed 2.0:
addr: 172.32.32.12
family 2
initialize:
pollrun
mcast_bind
mcast_rdma_event_fn ADDR_RESOLVED
mcast_rdma_event_fn MULTICAST_JOIN
iface_change_fn
calling in send_fn
in totemiba_mcast_flush_send: called ibv_post_send with res=0, 
msg_len=9;, qp_num=93, qkey=1234567
mcast_cq_send_event_fn, res of ibv_poll_cq=1

uname -a
Linux ar02 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 14:23:09 CDT 
2013 x86_64 x86_64 x86_64 GNU/Linux
HCA: mellanox ConnectX2, fw 2.9.1000

It seems that recv_comp_channel->fd is never touched and message is not 
delivered to _sender_ (it is delivered to other hcas in multicast group 
though).
I could not find any documentation regarding this case (delivering mcast 
message to senders receive queue), so i am not sure which behaviour is 
correct.

I'd very appreciate if someone could run this test on his machine and 
confirm/disaffirm the problem. Also, it would be nice to know if 
multicast message has to be queued to _sender's_ receive queue after 
sending.

Regards,

-- 
Ing. Yevheniy Demchenko
Senior Linux Administrator
UVT s.r.o.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130910/3286ac65/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: temp-multicast-ofed2.tar.gz
Type: application/x-gzip
Size: 41899 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130910/3286ac65/attachment.bin>


More information about the Users mailing list