[Openib-windows] [BUG] Errors with multiple HCAs in a single system

Leonid Keller leonid at mellanox.co.il
Wed Oct 19 05:12:42 PDT 2005


SB

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com]
> Sent: Monday, October 17, 2005 9:16 PM
> To: openib-windows at openib.org
> Subject: [Openib-windows] [BUG] Errors with multiple HCAs in a single
> system
> 
> 
> I am trying to run both a PCI-X and PCI-Express HCA in a 
> single server 

I guess, you are the first one, that's trying that ...  :(

> to
> compare performance, and cannot enable both devices 
> simultaneously.  Which ever
> HCA is starts second fails with the following output to the debugger:
> 
> <1> MDT(1): prepare_intr_resources: Failed to MOSAL_ISR_set 
> MOSAL ret=-1
> <1> MDT(4): THH_eventp_create: Cannot set interrupt resources.
> <1> MDT(1): THH_hob_open_hca: could not create eventp (-255)
> <1> MDT(1): THH_hob_close_hca: Device already closed
> 
> It looks like the device lookup in MOSAL_ISR_set uses a generic name
> "InfiniHost" and thus matches the first entry in the table.

Yes, a bug in prepare_intr_resources(), which calls MOSAL_ISR_set() with
"InfiniHost" built_in parameter instead of real name. Try to change it by
eventp->hob.dev_name.

> 
> Why is this doing a lookup?  Doesn't the code know exactly 
> which device is being
> initialized?  Why put the device in the MOSAL_dev_db and then 
> search for it?

MOSAL was born in Linux as a separate module, that performs all OS-dependent
things.
It is to do that on base of the function parameters.
When i start to port MOSAL to Windows, i saw that in some cases (like with
MOSAL_ISR_set()) MOSAL can't do that, because it needs some information,
found only in the posession of the driver.
That's why i added MOSAL DB, which duplicates some Driver device
information.

There is no MOSAL in new driver, so there will be no that problem ...


> 
> I set a breakpoint and manually helped the code find the 
> right entry, and things
> moved forward from here but quickly failed again when 
> creating a special QP:
> 
> ~1:mlnx_query_qp() !ERROR!: completes with ERROR status 
> IB_INVALID_QP_HANDLE
> ~1:mlnx_create_spl_qp() !ERROR!: completes with ERROR status
> IB_INVALID_QP_HANDLE
> ~1:al:create_spl_qp_svc() !ERROR!: ib_get_spl_qp failed, 
> IB_INVALID_QP_HANDLE
> ~1:al:create_spl_qp_svc() ]
> 
> Now, the fact that create_spl_qp can return 
> IB_INVALID_QP_HANDLE seems flawed to
> me.  To make things worse, the HCA driver doesn't clean up the QP it
> successfully created internally, and that resource is thus leaked.
> 
> Apparently, the same issue happens with normal QP creation, 
> judging by the
> output from IPoIB trying to load:
> 
> ~0:mlnx_query_qp() !ERROR!: completes with ERROR status 
> IB_INVALID_QP_HANDLE
> ~0:mlnx_create_qp() !ERROR!: completes with ERROR status 
> IB_INVALID_QP_HANDLE
> ~0:__ib_mgr_init() !ERROR!: ib_create_qp returned IB_INVALID_QP_HANDLE
> ~0:__ib_mgr_init() ]
> ~0:__port_init() !ERROR!: __ib_mgr_init returned IB_INVALID_QP_HANDLE
> ~0:__port_init() ]
> ~0:ipoib_create_port() !ERROR!: ipoib_port_init returned 
> IB_INVALID_QP_HANDLE.
> ~0:ipoib_create_port() ]
> 
> IPoIB then tries to tear things down and the code asserts 
> because PD destruction
> failed:
> 
> ~0:mlnx_deallocate_pd() !ERROR!: completes with ERROR status 
> IB_RESOURCE_BUSY
> 
> The resource is busy due to the leak of the QP when 
> mlnx_create_qp failed.
> 
> - Fab
> 
> _______________________________________________
> openib-windows mailing list
> openib-windows at openib.org
> http://openib.org/mailman/listinfo/openib-windows
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20051019/9800c7b4/attachment.html>


More information about the ofw mailing list