[Iwg-arbitration-committee] Updated arbitration request for Intel QLE7340 & QLE7342 HCAs (Jan 2015 OFA Interop Logo Event)
Dave Wyman
dwyman at iol.unh.edu
Tue Jun 9 11:27:30 PDT 2015
Hi all,
I've reset the cluster to ofed 3.12-1 and rerun rsockets with the
updated librdmacm-1.0.21 installed. Everything passed heterogeneously.
I'll now do the same for nfsordma with the hacked together connectathon
plan and move this arbitration request forward.
Please let me know if you have any questions
Thanks,
Dave
On 6/5/15 4:36 PM, Dave Wyman wrote:
> Hello,
>
> I've done a test run with the updated librdmacm-1.0.21 installed on
> hosts with QLE7340 and QLE7342 against each other and the set of
> Mellanox devices included in the January Logo Event with each host
> acting as both server and client. The installed OFED was 3.18-rc2.
> All passed per the test plan (rstream -T [sabn] -S all). This is
> great news for the upcoming logo event given the fix has been
> submitted for 3.18. I plan to reset the hosts to 3.12-1 and again
> update the Intel hosts with librdmacm-1.0.21 and test again to verify
> this against the pending arbitration request.
>
> Thanks,
> Dave
>
>
>
>
>
> On Jun 3, 2015, at 3:10 PM, "Calciano, Jess" <jess.calciano at intel.com
> <mailto:jess.calciano at intel.com>> wrote:
>
>> Hello,
>> Since the original arbitration request was submitted, there’s been
>> some further discussion about the RSockets failure. With the fix for
>> librdmacm described in the original request, rstream ran successfully
>> for most message sizes, but still hung with -S 1024.
>> Additional investigation traced the new problem to an incompatibility
>> between the qib driver and the ibv_create_qp() function. A workaround
>> (described below) is available for the current OFED version and a
>> permanent fix to librdmacm will be included in the next OFED 3.18
>> release.
>> Details:
>> The ultimate issue is still related to the qib driver being
>> non-compliant with the ibv_create_qp() definition:
>>
>> The function ibv_create_qp() will update the qp_init_attr->cap
>> struct with the actual QP values of the QP that was created;
>> *** the values will be greater than or equal to the values requested. ***
>>
>> Specifically, the qib driver will return an inline size that is
>> smaller than that requested. Rsockets has code to trap for this, but
>> the rsockets code looks like this:
>>
>> inline_size = SOME_DEFAULT_LIKE_64
>> rs_init_bufs(...);
>> ...
>> rs_create_qp(...);
>> inline_size = qp_cap->max_inline_size;
>>
>> The issue is that rs_init_bufs(), which allocates the buffers and
>> registers the memory, uses the default inline size. The net result
>> is that rsockets ends up referencing memory that is outside of the
>> registered memory region when sending credit updates. The lost
>> credit update is causing the hang that you see.
>>
>> A quick check shows that I can move the rs_init_bufs() call after the
>> qp has been created and have the test work. You should also be able
>> to override the inline_size by writing the value 0 into a config
>> file. This will set the inline_size to 0 as the default. To do
>> this, you need to write a 0 into /etc/rdma/rsocket/inline_default.
>> (The actual path will depend on your configuration, so it could be
>> under /usr/etc/rdma/... for example.) Updating the config file
>> should work with the current version.
>>
>> I will provide an update to the librdmacm to handle this. That
>> update will find its way into the 3.18 release.
>>
>> Thanks,
>> Jess Calciano
>> *From:*Calciano, Jess
>> *Sent:*Wednesday, April 08, 2015 2:39 PM
>> *To:*iwg-arbitration-committee at openfabrics.org
>> <mailto:iwg-arbitration-committee at openfabrics.org>
>> *Cc:*OFA Lab Mailing List; Dave Wyman; Rupert Dance
>> <rsdance at soft-forge.com <mailto:rsdance at soft-forge.com>>
>> (rsdance at soft-forge.com <mailto:rsdance at soft-forge.com>); Cole,
>> Cliff; Mascarenhas, Edward; Sharma, Karun; Thete, Swapna; Hefty,
>> Sean; Yan, Philip W; Flores, Jose F
>> *Subject:*Arbitration request for Intel QLE7340 & QLE7342 HCAs (Jan
>> 2015 OFA Interop Logo Event)
>> Hello,
>> Intel would like to file an arbitration request for the January 2015
>> OFA Interop Logo Event results for the Intel QLE7340 and QLE7342 HCAs.
>> The provided report (attached for reference) shows two failing tests:
>> 1)TI NFS over RDMA
>> 2)TI RSockets
>> The Intel team has investigated these results and determined that the
>> failures are due to bugs in non-Intel components.
>> NFSoRDMA:
>> The failure is due to a known Connectathon issue, documented here:
>> http://www.spinics.net/lists/linux-nfs/msg16460.html
>> RSockets:
>>
>> The issue is that ibv_modify_qp() is failing. The problem is that an
>> incorrect bit is set in the qp_attr_mask, which is returned from the
>> kernel. With Intel, bit 21 of the qp_attr_mask is set. This is not
>> the case for a Mellanox HCA.
>>
>> Bit 21 is not defined for userspace. However, it was defined in the
>> kernel as IB_QP_SMAC.
>>
>> If the librdmacm is modified to mask out this bit, the call succeeds
>> and rstream runs successfully.
>>
>> Please let me know if the arbitration committee needs any additional
>> information on the analysis.
>> Thanks,
>> Jess Calciano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/iwg-arbitration-committee/attachments/20150609/14374a67/attachment.html>
More information about the iwg-arbitration-committee
mailing list