[openib-general] Question on the best approach to debug aninfiniband connection problem

Sean Hubbell shubbell at dbresearch.net
Thu Aug 25 05:41:44 PDT 2005


>Is the port state active ?
>  
>
The port is active for port 1 and down for port 2. Port 2 is not connected.

> 
>What are you running ? Is this OpenSM and IPoIB off the trunk or something else ?
> 
>  
>
>>I am at a loss to find out what the problem is. I did notice a lot of errors in
>> the /var/log/osm.log which I have listed below for today:
>>    
>>
>
>  
>
Yes, I guess I should have mentioned that. I am running cAos 2.0 with 
the openib package along with the opensm that comes with openib. I am 
also trying to run over IPoIB.

>Aug 24 08:19:10 [42FFF960] -> osm_report_notice: Reporting Generic
>Notice type:3 num:67 from LID:0x0001
>GID:0xfe80000000000000,0x0005ad000003d269
>Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:10 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method =
>SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083,
>expected comp mask = 0x00000000000130c7.
>It appears that a join is failing for some reason. It doesn't say which group
>(MGID) this is.  (I will add that into the log).
>
>The SM is receiving a join rather than a create request for
>a new multicast group. That might be OK depending on which group it is.
>
>Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 256
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic
>Notice type:3 num:67 from LID:0x0001
>GID:0xfe80000000000000,0x0005ad000003d269
>Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic
>Notice type:3 num:67 from LID:0x0001
>GID:0xfe80000000000000,0x0005ad000003d269
>Aug 24 08:19:16 [447FF960] -> umad_receiver: recv error Interrupted
>system call
>Aug 24 08:22:05 [AB441140] -> OpenSM Rev:openib-1.0.0
>Aug 24 08:22:05 [AB441140] -> osm_opensm_init: Forcing single threaded
>dispatcher.
>
>It looks like OpenSM restarted here. If OpenSM is restarted currently, the IPoIB 
>interface needs to be downed and then upped as client reregistration is not currently
>supported.
>  
>
Yes, from the 4.5 hours I spent looking yesterday and with looking at 
the arp table, this makes since. What I ended up doing to fix it is to 
bring down ib0 and then bring it back up. After a little while when I 
started to try and ping, things were back to working. I will have to say 
that I was very concerned with our applications running using IPoIB, but 
after you mentioned this and after what I saw, I think we will be ok.


Thank you for your response.

Sean



More information about the general mailing list