[openib-general] Question on the best approach to debug aninfiniband connection problem
Sean Hubbell
shubbell at dbresearch.net
Thu Aug 25 05:41:44 PDT 2005
>Is the port state active ?
>
>
The port is active for port 1 and down for port 2. Port 2 is not connected.
>
>What are you running ? Is this OpenSM and IPoIB off the trunk or something else ?
>
>
>
>>I am at a loss to find out what the problem is. I did notice a lot of errors in
>> the /var/log/osm.log which I have listed below for today:
>>
>>
>
>
>
Yes, I guess I should have mentioned that. I am running cAos 2.0 with
the openib package along with the opensm that comes with openib. I am
also trying to run over IPoIB.
>Aug 24 08:19:10 [42FFF960] -> osm_report_notice: Reporting Generic
>Notice type:3 num:67 from LID:0x0001
>GID:0xfe80000000000000,0x0005ad000003d269
>Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:10 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method =
>SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083,
>expected comp mask = 0x00000000000130c7.
>It appears that a join is failing for some reason. It doesn't say which group
>(MGID) this is. (I will add that into the log).
>
>The SM is receiving a join rather than a create request for
>a new multicast group. That might be OK depending on which group it is.
>
>Aug 24 08:19:10 [42FFF960] -> osm_vendor_send: RMPP 0 length 256
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_vendor_send: RMPP 0 length 112
>Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic
>Notice type:3 num:67 from LID:0x0001
>GID:0xfe80000000000000,0x0005ad000003d269
>Aug 24 08:19:14 [42FFF960] -> osm_report_notice: Reporting Generic
>Notice type:3 num:67 from LID:0x0001
>GID:0xfe80000000000000,0x0005ad000003d269
>Aug 24 08:19:16 [447FF960] -> umad_receiver: recv error Interrupted
>system call
>Aug 24 08:22:05 [AB441140] -> OpenSM Rev:openib-1.0.0
>Aug 24 08:22:05 [AB441140] -> osm_opensm_init: Forcing single threaded
>dispatcher.
>
>It looks like OpenSM restarted here. If OpenSM is restarted currently, the IPoIB
>interface needs to be downed and then upped as client reregistration is not currently
>supported.
>
>
Yes, from the 4.5 hours I spent looking yesterday and with looking at
the arp table, this makes since. What I ended up doing to fix it is to
bring down ib0 and then bring it back up. After a little while when I
started to try and ping, things were back to working. I will have to say
that I was very concerned with our applications running using IPoIB, but
after you mentioned this and after what I saw, I think we will be ok.
Thank you for your response.
Sean
More information about the general
mailing list