[Openib-windows] Race between destruction and AL verbs

Yossi Leybovich sleybo at mellanox.co.il
Sun Nov 26 03:46:00 PST 2006


Fab/Alex

In case that you have the time and you still remember 
I hope you will be able to help me in the following questions.

I am testing the HA in the SRP driver (This also reproduce scenarios
that Alex address with his patch).

In case that the SRP driver try to move from path to path I get
INVALID_CQ error returning from the poll_cq.
I guess that this because the disconnect folow call ib_close_ca but
there are still cb IBAL /MTHCA call during the sestruction

There are few problems/issues that I think we should address:

1. I think that the SRP as kernel driver should exit gracefully step by
step 
- Close QPs - Move QP to err/reset and wait for all WQE to return.
- Close CQs 
- Close PDs 
- Wait for outstanding queries (if have)
- and then close CA 
(Similar to the IPoIB driver)

2. In any case there shouldn't be race between CQ callbacks and
destruction ,
the IBAL should take ref and check the status before calling the user cb
function.
These is to prevent cases that the CB is invoke while the CQ is destroy

And of course the MTHCA should also take ref on its object before
calling the IBAL function (you can see that the mthca do that for async
events) Loenid check that with Linux code.

I found this problem all across the IBAL for example:
- calling ib_query_qp while other thread try to destroy the QP
- calling ib_modify_qp while other thread try to destroy the QP
- calling CQ callback while destroying CQ
- calling QP event while destroying QP

I found cases that the AL try to protect against destruction , like
query_ca (it first call aquire_ca that increment ref)
But still the code misses the obj state checking after ref taking.

3.Wouldnt it be more efficient to create new session first and connect
it to the new path 
and than handle destruction of session from the old path?

Yossi 






More information about the ofw mailing list