[ofa-general] Missing IB_EVENT_PATH_MIG events

lbt transter at gmail.com
Tue Oct 16 10:40:38 PDT 2007


Thanks for your reply Dotan!

The timeout is set to 16.

Here is some more info. Please let me know if there is any other info I can
provide.
Setup:
- 2 Nodes, each has a dual-port HCA (board_id: MT_0150000001, InfiniHost III
firmware 25218, v. 5.2.0) - this is the latest Mellanox firmware I believe
- port 1 of each node is connected to one IB switch, and likewise for port 2
--> thus have 2 separate IB subnets, providing 2 possible paths between the
2 nodes
- IB switch is InfiniScale MT43132 **
- Using OFED 1.2 driver stack

Our software creates RCQPs between 2 nodes, with primary and alternate path
specified.
Test does the following: Using 10 RCQPs
1. Hardware triggered migration by bringing down the port of the primary
path (haven't ever seen a problem with the hardware triggered migrations)
2. Restore the port --> reloads alternate path
    - Local QPs send LAP
    - Remote QPs reply with APR
3. Redistributes RCQP's across both ports for load balancing using software
triggered migrations for the RCQPs selected for migration.
a. Local QPs: use ib_modify_qp to trigger migration --> get
IB_EVENT_PATH_MIG on local QPs
b. Remote QPs: IB_EVENT_PATH_MIG
c. Local QPs: after software-triggered migration completes, reloads
alternate path by sending LAP
d. Remote QPs: reply with APR

Keep doing this in a loop. The issue is that in 3b, not all the remote QP's
reporte an IB_EVENT for the path migration triggered in 3a. I noticed that
when this happens it's usually in the first and/or second cycle (subsequent
cycles don't manifest this issue), and it occurs on the last RCQP's that
were migrated in 3a.

BTW: Do you know if there there is a way I can determine/dump which events
are in the Event Queue?

Thanks again!
Lan

On 10/15/07, Dotan Barak <dotanb at dev.mellanox.co.il> wrote:
>
> Hi.
>
> lbt wrote:
> > Hi,
> >
> > I'm trying out APM with OFED 1.2 , using Mellanox dual-port HCA
> > (ib_mthca driver).  When I have several RCQP's that I am trying to
> > migrate (software triggered migration using ib_modify_qp), I've
> > noticed that sometimes 1 or 2 of the remote QP's never generate an
> > IB_EVENT_PATH_MIG or even an IB_EVENT_PATH_MIG_ERR ... it seems that
> > it just gets lost. I looked through some of the ib_mthca patches in
> > git.kernel.org/?p=linux/kernel/git/roland/infiniband.git
> > <http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git>, and
> > incorporated the mmiowb patch for ib_mthca commands
> > (
> http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd
> > <
> http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=76d7cc0345a037e8eea426f8abc710abd22946dd
> >).
> > But still seeing same issue. I have a test case that repeates
> > software-triggered migrations + rearming in a loop, and this problem
> > usually occurs in the first few cycles, but is not too frequent. If
> > anyone has any ideas on what might be wrong, or tips on  where I can
> > look/do to debug this, that would be very much appreciated!
> >
> > For example, this is the console output I will see (printed out by our
> > rcqp event handler):
> > On the local end - initiates software triggered migration, using
> > ib_modify_qp:
> > Event IB_EVENT_PATH_MIG occurred on QP#1043
> > Event IB_EVENT_PATH_MIG occurred on QP#1040
> > Event IB_EVENT_PATH_MIG occurred on QP#1033
> >
> > On the remote end:
> > Event IB_EVENT_PATH_MIG occurred on QP#1040
> > Event IB_EVENT_PATH_MIG occurred on QP#1043
> Is
> the timeout value (in the QP attributes) is 0?
> If the answer is no, can you please supply some more details on this?
>
>
> thanks
> Dotan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071016/d52425d0/attachment.html>


More information about the general mailing list