[openib-general] Question on the best approach to debug aninfiniband connection problem

Hal Rosenstock halr at voltaire.com
Wed Aug 24 21:01:57 PDT 2005


Hi Sean,
 
Sorry for the slow response. I was in transit today until now.
 
>  I was wondering if there is a "best practices" method to debug a
>  possible infiniband connection. 
 
There are FAQs on the OpenIB wiki https://openib.org/tiki/tiki-index.php
There is one for IPoIB http://www.openib.org/docs/ipoib_faq.txt
 
> I am currently trying to send a message
>  over infiniband ib0 interface and I continue to get transmit errors.
>  Minus going through and seeing if the port state is active, 
 
Is the port state active ?
 
What are you running ? Is this OpenSM and IPoIB off the trunk or something else ?
 
> I am at a loss to find out what the problem is. I did notice a lot of errors in
>  the /var/log/osm.log which I have listed below for today:

Aug 24 08:19:10 [42FFF960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0001
GID:0xfe80000000000000,0x0005ad000003d269
Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:19:10 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method =
SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7.
It appears that a join is failing for some reason. It doesn't say which group
(MGID) this is.  (I will add that into the log).

The SM is receiving a join rather than a create request for
a new multicast group. That might be OK depending on which group it is.

Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 256
Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0001
GID:0xfe80000000000000,0x0005ad000003d269
Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0001
GID:0xfe80000000000000,0x0005ad000003d269
Aug 24 08:19:16 [447FF960] -> umad_receiver: recv error Interrupted
system call
Aug 24 08:22:05 [AB441140] -> OpenSM Rev:openib-1.0.0
Aug 24 08:22:05 [AB441140] -> osm_opensm_init: Forcing single threaded
dispatcher.

It looks like OpenSM restarted here. If OpenSM is restarted currently, the IPoIB 
interface needs to be downed and then upped as client reregistration is not currently
supported.


Aug 24 08:22:05 [AB441140] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000
GID:0xfe80000000000000,0x0000000000000000
Aug 24 08:22:05 [AB441140] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000
GID:0xfe80000000000000,0x0000000000000000
Aug 24 08:22:05 [AB441140] -> osm_vendor_get_all_port_attr: assign CA
mthca0 port 1 guid (0x5ad000003d269) as the default port.
Aug 24 08:22:05 [AB441140] -> osm_vendor_bind: Binding to port
0x5ad000003d269.
Aug 24 08:22:05 [AB441140] -> osm_vendor_bind: Binding to port
0x5ad000003d269.
Aug 24 08:22:05 [42FFF960] -> __osm_trap_rcv_process_request: Received
Generic Notice type:0x01 num:128 Producer:2 from LID:0x0002
TID:0x0000000000000000
Aug 24 08:22:05 [42FFF960] -> osm_report_notice: Reporting Generic
Notice type:1 num:128 from LID:0x0002
GID:0xfe80000000000000,0x0002c9010bec5320
Aug 24 08:22:06 [42FFF960] -> __osm_trap_rcv_process_request: Received
Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001
TID:0x0000000000000000
Aug 24 08:22:06 [42FFF960] -> osm_report_notice: Reporting Generic
Notice type:4 num:144 from LID:0x0001
GID:0xfe80000000000000,0x0005ad000003d269
Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
Aug 24 08:22:12 [42FFF960] -> osm_vendor_send: RMPP 0 length 112

-- Hal




More information about the general mailing list