[ofw] Re: Completion with bad status: IBV_WC_EXC_RETRY_EXC_ERROR
Fab Tillier
ftillier at windows.microsoft.com
Wed Nov 21 09:14:00 PST 2007
Hi Diego,
It sounds like you're still having a QP configuration issue, and that you're not yet at the point where RDMA operations would work. Have you tried send/receive operations to isolate potential rkey issues? I suspect these won't work either.
My current theory is an endianess issue somewhere in your application. If you look at the ib_qp_mod_t structure in ib_types.h, the structure used as input to the ib_modify_qp function, you will see many fields as 'ib_net32'. These are fields that are treated in network order by the drivers, and the 'ib_netxx' types (or simply 'netxx') are used to identify which fields are network order vs. host order.
Here's the list of fields that you need to treat in network order on Windows. I don't know how they're handled in Linux:
->INIT: qkey
->RTR: rq_psn, dest_qp, primary_av.dlid
->RTS: sq_psn
It sounds like you have the DLID issue handled correctly, but do you set the destination QP and PSNs properly?
-Fab
-----Original Message-----
From: Diego Guella [mailto:diego.guella at sircomtech.com]
Sent: Wednesday, November 21, 2007 6:11 AM
To: Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: Re: [ofw] Re: Completion with bad status: IBV_WC_EXC_RETRY_EXC_ERROR
Hi Fab,
Thanks for your answer.
Please see my replies inline.
----- Original Message -----
From: "Fab Tillier" <ftillier at windows.microsoft.com>
>
>When you exchange the rkey, are you keeping track of endianness? The
>Windows drivers treat rkeys in network order. I think the Linux stack
> >does this in host order, and this could cause your problems. I would have
>expected a different error than a retry exceeded error, though.
No, I didn't change endianness of the rkey.
So I made a test changing endianness of the rkey, but the error is always
the same.
I too would have expected a different error, say a IB_WCS_REM_ACCESS_ERR,
instead of this retry exceeded.
>For the LIDs, you need to swap it on the Windows side, not the Linux side -
>this could be the cause for the retry error.
You said (or perhaps Tzachi said) that Windows treats the LID in network
order.
So in my "CM" protocol I am exchanging the LID in network order: Windows
sends (and receives) the LID _as is_, while Linux sends it applying ntohs
before the send (and applying htons after receive).
>Is there any reason you don't use the IB CM or RDMA CM for connection
>establishment? On the Windows side, you'll need to deal with the >RDMA CM
>private data format yourself, but at least it will take care of the QP
>settings for you.
I have taken the example in WinIB 1.3, and slightly modified it (removed
some parts and added support to RDMA READ/WRITE tests).
This program works well in a Windows/Windows test.
Then I ported this program to Linux, modified again to use verbs instead of
ib_al. It works well in a Linux/Linux test.
The problem only arises when I try to use a Windows daemon and a Linux
client, and vice versa.
I posted the source code of this programs in older emails, I can resend it
to you if you wish.
Thanks,
Diego
More information about the ofw
mailing list