[ofa-general] proper way to recover from poll CQ failed error
    Dotan Barak 
    dotanb at dev.mellanox.co.il
       
    Thu Mar 13 07:02:26 PDT 2008
    
    
  
Can you execute ibv_devinfo after you get this error?
(I'm trying to reproduce this error in our labs without any success so far).
can you please send me the output of "uname -a" + "cat /proc/cpuinfo"
Dotan
Murray Smigel wrote:
> Dotan Barak wrote:
>
>> Hi.
>>
>> The fact that ibv_poll_cq failed indicates that something bad happened.
>> Usually this failure should create any problem and only the process 
>> that had the problem is being
>> effected from this.
>>
>> I personally think that the ib_* performance tools are better to 
>> check the performance of your subnet.
>
> **
> Which programs are these?
>
>>
>> I will be happy if you'll answer the following questions:
>> Is this error is consistent?
>
> **
> Yes. From this morning...
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 1 -n 1000
> ...
> 2000 bytes in 0.00 seconds = 3.56 Mbit/sec
> 1000 iters in 0.00 seconds = 4.49 usec/iter
>
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 100 -n 1000
> ...
> 200000 bytes in 0.00 seconds = 357.46 Mbit/sec
> 1000 iters in 0.00 seconds = 4.48 usec/iter
>
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 200 -n 1000
> ...
> 400000 bytes in 0.00 seconds = 711.74 Mbit/sec
> 1000 iters in 0.00 seconds = 4.50 usec/iter
>
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 500 -n 1000
> ...
> 1000000 bytes in 0.00 seconds = 1752.08 Mbit/sec
> 1000 iters in 0.00 seconds = 4.57 usec/iter
>
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 1000 -n 1000
> ...
> 2000000 bytes in 0.01 seconds = 2973.98 Mbit/sec
> 1000 iters in 0.01 seconds = 5.38 usec/iter
>
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 2000 -n 1000
> ...
> 4000000 bytes in 0.01 seconds = 5226.20 Mbit/sec
> 1000 iters in 0.01 seconds = 6.12 usec/iter
>
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 5000 -n 1000
> ...
> poll CQ failed -2
>
> Now, this used to work:
> murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 1 -n 1000
> ...
> poll CQ failed -2
>
> As did this:
> murray at nasnu2:/usr/local/bin$ ./ibv_rc_pingpong -s 1 -n 1000
>  local address:  LID 0x0003, QPN 0xa9004a, PSN 0x540960
>  remote address: LID 0x0004, QPN 0xaf004a, PSN 0x787c46
> poll CQ failed -2
>
>> Can you please send me the output of the ibv_devinfo of your machines?
>
> murray at nasnu2:/usr/local/bin$ ./ibv_devinfo
> hca_id: mlx4_0
>        fw_ver:                         2.3.000
>        node_guid:                      0002:c903:0000:c51c
>        sys_image_guid:                 0002:c903:0000:c51f
>        vendor_id:                      0x02c9
>        vendor_part_id:                 25418
>        hw_ver:                         0xA0
>        board_id:                       MT_04A0120002
>        phys_port_cnt:                  2
>                port:   1
>                        state:                  PORT_ACTIVE (4)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 3
>                        port_lid:               3
>                        port_lmc:               0x00
>
>                port:   2
>                        state:                  PORT_DOWN (1)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 0
>                        port_lid:               0
>                        port_lmc:               0x00
>
>
> murray at nasnu3:~$ cd /usr/local/bin
> murray at nasnu3:/usr/local/bin$ ./ibv_devinfo
> hca_id: mlx4_0
>        fw_ver:                         2.3.000
>        node_guid:                      0002:c903:0000:c474
>        sys_image_guid:                 0002:c903:0000:c477
>        vendor_id:                      0x02c9
>        vendor_part_id:                 25418
>        hw_ver:                         0xA0
>        board_id:                       MT_04A0120002
>        phys_port_cnt:                  2
>                port:   1
>                        state:                  PORT_ACTIVE (4)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 3
>                        port_lid:               4
>                        port_lmc:               0x00
>
>                port:   2
>                        state:                  PORT_DOWN (1)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 0
>                        port_lid:               0
>                        port_lmc:               0x00
>
>
>> Did you have any error message in the /var/log/messages when you saw 
>> this error?
>
>
>
> nasnu2:/usr/local/bin# tail /var/log/messages
> Mar 13 05:30:05 nasnu2 -- MARK --
> Mar 13 05:50:05 nasnu2 -- MARK --
> Mar 13 06:10:05 nasnu2 -- MARK --
> Mar 13 06:30:05 nasnu2 -- MARK --
> Mar 13 06:50:05 nasnu2 -- MARK --
> Mar 13 07:10:05 nasnu2 -- MARK --
> Mar 13 07:30:05 nasnu2 -- MARK --
> Mar 13 07:36:09 nasnu2 syslogd 1.4.1#18: restart.
> Mar 13 07:50:05 nasnu2 -- MARK --
> Mar 13 08:10:05 nasnu2 -- MARK --
>
> nasnu2:/usr/local/bin# tail /var/log/opensm.log
> Mar 13 08:16:46 024587 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:16:56 024615 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:17:06 024596 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:17:16 024549 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:17:26 024664 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:17:36 024626 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:17:46 024627 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:17:56 024611 [43806960] 0x02 -> SUBNET UP
> Mar 13 08:18:06 024607 [43806960] 0x02 -> SUBNET UP
>
> Thanks for your help,
> murray smigel
>
>
>>
>> thanks
>> Dotan
>>
>> Murray Smigel wrote:
>>
>>> Hi,
>>> I am running OFED-3.0 using ConnectX adapters in a two machine 
>>> direct connect mode.
>>> Most of the various pingpong tests seem ok, but when I run
>>> ibv_srq_pingpong -s 500 -n 1000
>>>
>>> I get poll "CQ failed -2" when I start up the client side.  Smaller 
>>> values of -s worked fine.
>>> Once this happens, no other pingpong tests seem to work.
>>> I have then unloaded all the ib_* mlmx_* and iw_* modules, reloaded 
>>> them and things still
>>> fail. I have to reboot the machines to get things back.
>>>
>>> 1) Is there a cleaner way to recover from this situation?
>>> 2) Is the initial failure an indication that something else is wrong?
>>> 3) Is the -s 1 latency I see with ibv_rc_pingpong of ~7 microseconds 
>>> reasonable?
>>>
>>> Thanks,
>>> murray smigel
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit 
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>
>
>
    
    
More information about the general
mailing list