[ofa-general] proper way to recover from poll CQ failed error
Murray Smigel
murray at tradeworx.com
Thu Mar 13 05:28:36 PDT 2008
Dotan Barak wrote:
> Hi.
>
> The fact that ibv_poll_cq failed indicates that something bad happened.
> Usually this failure should create any problem and only the process
> that had the problem is being
> effected from this.
>
> I personally think that the ib_* performance tools are better to check
> the performance of your subnet.
**
Which programs are these?
>
> I will be happy if you'll answer the following questions:
> Is this error is consistent?
**
Yes. From this morning...
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 1 -n 1000
...
2000 bytes in 0.00 seconds = 3.56 Mbit/sec
1000 iters in 0.00 seconds = 4.49 usec/iter
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 100 -n 1000
...
200000 bytes in 0.00 seconds = 357.46 Mbit/sec
1000 iters in 0.00 seconds = 4.48 usec/iter
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 200 -n 1000
...
400000 bytes in 0.00 seconds = 711.74 Mbit/sec
1000 iters in 0.00 seconds = 4.50 usec/iter
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 500 -n 1000
...
1000000 bytes in 0.00 seconds = 1752.08 Mbit/sec
1000 iters in 0.00 seconds = 4.57 usec/iter
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 1000 -n 1000
...
2000000 bytes in 0.01 seconds = 2973.98 Mbit/sec
1000 iters in 0.01 seconds = 5.38 usec/iter
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 2000 -n 1000
...
4000000 bytes in 0.01 seconds = 5226.20 Mbit/sec
1000 iters in 0.01 seconds = 6.12 usec/iter
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 5000 -n 1000
...
poll CQ failed -2
Now, this used to work:
murray at nasnu2:/usr/local/bin$ ./ibv_srq_pingpong -s 1 -n 1000
...
poll CQ failed -2
As did this:
murray at nasnu2:/usr/local/bin$ ./ibv_rc_pingpong -s 1 -n 1000
local address: LID 0x0003, QPN 0xa9004a, PSN 0x540960
remote address: LID 0x0004, QPN 0xaf004a, PSN 0x787c46
poll CQ failed -2
> Can you please send me the output of the ibv_devinfo of your machines?
murray at nasnu2:/usr/local/bin$ ./ibv_devinfo
hca_id: mlx4_0
fw_ver: 2.3.000
node_guid: 0002:c903:0000:c51c
sys_image_guid: 0002:c903:0000:c51f
vendor_id: 0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id: MT_04A0120002
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 3
port_lid: 3
port_lmc: 0x00
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
murray at nasnu3:~$ cd /usr/local/bin
murray at nasnu3:/usr/local/bin$ ./ibv_devinfo
hca_id: mlx4_0
fw_ver: 2.3.000
node_guid: 0002:c903:0000:c474
sys_image_guid: 0002:c903:0000:c477
vendor_id: 0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id: MT_04A0120002
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 3
port_lid: 4
port_lmc: 0x00
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
> Did you have any error message in the /var/log/messages when you saw
> this error?
nasnu2:/usr/local/bin# tail /var/log/messages
Mar 13 05:30:05 nasnu2 -- MARK --
Mar 13 05:50:05 nasnu2 -- MARK --
Mar 13 06:10:05 nasnu2 -- MARK --
Mar 13 06:30:05 nasnu2 -- MARK --
Mar 13 06:50:05 nasnu2 -- MARK --
Mar 13 07:10:05 nasnu2 -- MARK --
Mar 13 07:30:05 nasnu2 -- MARK --
Mar 13 07:36:09 nasnu2 syslogd 1.4.1#18: restart.
Mar 13 07:50:05 nasnu2 -- MARK --
Mar 13 08:10:05 nasnu2 -- MARK --
nasnu2:/usr/local/bin# tail /var/log/opensm.log
Mar 13 08:16:46 024587 [43806960] 0x02 -> SUBNET UP
Mar 13 08:16:56 024615 [43806960] 0x02 -> SUBNET UP
Mar 13 08:17:06 024596 [43806960] 0x02 -> SUBNET UP
Mar 13 08:17:16 024549 [43806960] 0x02 -> SUBNET UP
Mar 13 08:17:26 024664 [43806960] 0x02 -> SUBNET UP
Mar 13 08:17:36 024626 [43806960] 0x02 -> SUBNET UP
Mar 13 08:17:46 024627 [43806960] 0x02 -> SUBNET UP
Mar 13 08:17:56 024611 [43806960] 0x02 -> SUBNET UP
Mar 13 08:18:06 024607 [43806960] 0x02 -> SUBNET UP
Thanks for your help,
murray smigel
>
> thanks
> Dotan
>
> Murray Smigel wrote:
>
>> Hi,
>> I am running OFED-3.0 using ConnectX adapters in a two machine direct
>> connect mode.
>> Most of the various pingpong tests seem ok, but when I run
>> ibv_srq_pingpong -s 500 -n 1000
>>
>> I get poll "CQ failed -2" when I start up the client side. Smaller
>> values of -s worked fine.
>> Once this happens, no other pingpong tests seem to work.
>> I have then unloaded all the ib_* mlmx_* and iw_* modules, reloaded
>> them and things still
>> fail. I have to reboot the machines to get things back.
>>
>> 1) Is there a cleaner way to recover from this situation?
>> 2) Is the initial failure an indication that something else is wrong?
>> 3) Is the -s 1 latency I see with ibv_rc_pingpong of ~7 microseconds
>> reasonable?
>>
>> Thanks,
>> murray smigel
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
More information about the general
mailing list