[ofa-general] Missing IB_EVENT_PATH_MIG events

Dotan Barak dotanb at dev.mellanox.co.il
Wed Oct 17 00:14:53 PDT 2007


Hi.

The value of the timeout that you sent me means that the QP will wait 
for ~ 0.25 second before any retry,
so it might take some time for the QPs to start doing the APM (depend of 
the retry_cnt value).

 From your description i believe that not all of the QPs have started to 
move to the alternate path.
did you make sure that the APM state machine (in every QP) is ARMED 
before moving the port down ?
(data packets must we sent between the local/remote QPs in order to make 
the QP in this state)


(If you can send me the code that handles this scenario i will try to 
reproduce it here, in out lab)


thanks
Dotan

lbt wrote:
> Thanks for your reply Dotan!
>
> The timeout is set to 16.
>
> Here is some more info. Please let me know if there is any other info 
> I can provide.
> Setup:
> - 2 Nodes, each has a dual-port HCA (board_id: MT_0150000001, 
> InfiniHost III firmware 25218, v. 5.2.0) - this is the latest Mellanox 
> firmware I believe
> - port 1 of each node is connected to one IB switch, and likewise for 
> port 2 --> thus have 2 separate IB subnets, providing 2 possible paths 
> between the 2 nodes
> - IB switch is InfiniScale MT43132 **
> - Using OFED 1.2 driver stack
>
> Our software creates RCQPs between 2 nodes, with primary and alternate 
> path specified.
> Test does the following: Using 10 RCQPs
> 1. Hardware triggered migration by bringing down the port of the 
> primary path (haven't ever seen a problem with the hardware triggered 
> migrations)
> 2. Restore the port --> reloads alternate path
>     - Local QPs send LAP
>     - Remote QPs reply with APR
> 3. Redistributes RCQP's across both ports for load balancing using 
> software triggered migrations for the RCQPs selected for migration.
> a. Local QPs: use ib_modify_qp to trigger migration --> get 
> IB_EVENT_PATH_MIG on local QPs
> b. Remote QPs: IB_EVENT_PATH_MIG
> c. Local QPs: after software-triggered migration completes, reloads 
> alternate path by sending LAP
> d. Remote QPs: reply with APR
>
> Keep doing this in a loop. The issue is that in 3b, not all the remote 
> QP's reporte an IB_EVENT for the path migration triggered in 3a. I 
> noticed that when this happens it's usually in the first and/or second 
> cycle (subsequent cycles don't manifest this issue), and it occurs on 
> the last RCQP's that were migrated in 3a.
>
> BTW: Do you know if there there is a way I can determine/dump which 
> events are in the Event Queue?
>
> Thanks again!
> Lan
>
> On 10/15/07, *Dotan Barak* <dotanb at dev.mellanox.co.il 
> <mailto:dotanb at dev.mellanox.co.il>> wrote:
>
>     Hi.
>
>     lbt wrote:
>     > Hi,
>     >
>     > I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA
>     > (ib_mthca driver).  When I have several RCQP's that I am trying to
>     > migrate (software triggered migration using ib_modify_qp), I've
>     > noticed that sometimes 1 or 2 of the remote QP's never generate an
>     > IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that
>     > it just gets lost. I looked through some of the ib_mthca patches in
>     > git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
>     <http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>
>     > <
>     http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>, and
>     > incorporated the mmiowb patch for ib_mthca commands
>     > (
>     http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd
>     > <
>     http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd>).
>     > But still seeing same issue. I have a test case that repeates
>     > software-triggered migrations + rearming in a loop, and this
>     problem
>     > usually occurs in the first few cycles, but is not too frequent. If
>     > anyone has any ideas on what might be wrong, or tips on  where I can
>     > look/do to debug this, that would be very much appreciated!
>     >
>     > For example, this is the console output I will see (printed out
>     by our
>     > rcqp event handler):
>     > On the local end - initiates software triggered migration, using
>     > ib_modify_qp:
>     > Event IB_EVENT_PATH_MIG occurred on QP#1043
>     > Event IB_EVENT_PATH_MIG occurred on QP#1040
>     > Event IB_EVENT_PATH_MIG occurred on QP#1033
>     >
>     > On the remote end:
>     > Event IB_EVENT_PATH_MIG occurred on QP#1040
>     > Event IB_EVENT_PATH_MIG occurred on QP#1043
>     Is
>     the timeout value (in the QP attributes) is 0?
>     If the answer is no, can you please supply some more details on this?
>
>
>     thanks
>     Dotan
>
>




More information about the general mailing list