[openib-general] APM: QP migration state change when failover triggered by hw

Jack Morgenstein jackm at mellanox.co.il
Tue Aug 1 06:10:02 PDT 2006


On Tuesday 01 August 2006 05:19, Venkatesh Babu wrote:
> Configuration2: Node1 and Node 2 conneected through two switches for
> each port.
>  Node1, port1 -> switch1 -> Node2, port1
>  Node1, port2 -> switch2 -> Node2, port2
>
> Node 1:
> 1. Call ib_cm_listen() to wait for connection requests
> 2. When a REQ message arrives create a RC QP and establish a connection
> 3. Setup callback handlers to receive packets.
> 4. Receive packets and verify it and drop it.
> 5. Event IB_MIG_MIGRATED received
> 6. Stopped receiving packets.
>
> Node 2:
> 1. Create RC QP
> 2. Send REQ message to Node 1 to establish the connection (Load both
> primary and alternate paths)
> 3. Contineously send some packets
> 4. Simulate the port failure by unplugging the IB cable
> 5. Event IB_MIG_MIGRATED received
>
>  But with
> Configuration2, IB_EVENT_PORT_ERR event occurrs on a node1, failover to
> the alternate path doesn't work. The traffic stops. Because node1
> doesn't now when the IB_EVENT_PORT_ERR event occurred on Node2.

We have not seen these problems here.  We have regression tests which check 
APM, and they have run without problems.  These tests have scripts which 
bring the HCA port down (equivalent to pulling the cable) to check that the 
migration occurs automatically.
(You should NOT need to do ib_modify_qp for the migration to work in the case 
of a port error).

Note, though, that these tests use the ibv_verbs layer directly.  We have not 
checked out APM over the CM.  There may be a bug here regarding setting up 
the alternate path properly when creating the connection (although this does 
seem strange, since you indicate that the MIGRATED event is received on both 
sides!).

Please send us your test code so that we may reproduce the problem here.

- Jack









More information about the general mailing list