[openib-general] Infiniband on Debian etch RC1

Diego Guella diego.guella at sircomtech.com
Tue Nov 21 05:28:17 PST 2006


>> If you are using OFED 1.1, then you should use the source RPM for
>> OpenSM. There was one patch on the list found with Debian for a stack
>> smashing issue with osm_helper.c.
>
> The SM which is currently running (on SuSE 9.3) is the one included in
> OFED-1.0.
> Should I migrate to OFED-1.1 or can I build opensm from the OFED-1.0
> source
> RPM?
>
>
>>> I added a line with "ib_ipoib" to /etc/modules.
>>>
>>> So now I think I have to configure 2 new devices (MHES28 has 2 ports) in
>>> /etc/network/interfaces.
>>>
>>> I added 2 devices named ib0 and ib1, and I configured them to have
>>> static
>>> IP
>>> addresses, just like a normal ethernet device.
>>>
>>> ifconfig shows they are up, one has the attribute "RUNNING" too, the
>>> other
>>> not (I think this is because one has the cable plugged, the other not).
>>>
>>> All this is done on server PE1950
>>>
>>> Now, that cable goes to the other server, a PE2850, which has a SM
>>> running.
>>>
>>> I try to ping that server, using the IP address of the infiniband IPoIB
>>> interace, but I get "destination unreachable".
>>
>> I presume the two machines are on the same IPoIB subnet. Are there any
>> errors in the OpenSM log ?
>>
> Yes, my Ethernet subnet is 192.168.200.0/255 and my Infiniband IPoIB
> subnet
> is 193.168.200.0/255, this is the same on all the machines.
>
> I opened /var/log/osm.log for the first time now and (apart for the log
> size - 31MB!)
> there is this error, that is repeating every 10 seconds from June, 26 (the
> date when I installed OFED-1.0) till today:
> -----
> Nov 21 13:00:22 438727 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 13:00:32 441056 [0000] -> SM port is down
> -----
>
> I want to point out that I have another system, a desktop, with installed
> SuSE 9.3 and OFED-1.0 and a MHES14 card, and that works fine, IPoIB, SDP,
> RDMA, all the features are OK.
>
>
>
Since the log was very big, I renamed it and reeboted the machine (machine:
PE2850). the other machine (PE1950) was on when PE2850 rebooted.
the log now is:
-----
Nov 21 13:59:36 134700 [AB467140] -> OpenSM Rev:openib-1.2.1 OpenIB svn
Exported revision
Nov 21 13:59:36 134833 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn Exported
revision

Nov 21 13:59:36 135958 [AB467140] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000
GID:0xfe80000000000000,0x0000000000000000
Nov 21 13:59:36 136014 [AB467140] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000
GID:0xfe80000000000000,0x0000000000000000
Nov 21 13:59:36 139847 [AB467140] -> osm_vendor_bind: Binding to port
0x2c9020021c9f1
Nov 21 13:59:36 142549 [AB467140] -> osm_vendor_bind: Binding to port
0x2c9020021c9f1
Nov 21 13:59:36 144291 [0000] -> Entering MASTER state

Nov 21 13:59:36 201383 [0000] -> SUBNET UP

Nov 21 13:59:36 510056 [41802960] -> __osm_trap_rcv_process_request:
Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002
TID:0x0000000000000000
Nov 21 13:59:36 510245 [41802960] -> osm_report_notice: Reporting Generic
Notice type:4 num:144 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 13:59:37 150118 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 13:59:37 151069 [41001960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 13:59:37 151501 [42804960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 13:59:37 153744 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9f1
Nov 21 13:59:38 860675 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
0x0000000000000002 from port 0x0002c9020021c9f1
-----

Then, I rebooted PE1950, with PE2850 still on.
These lines were added to the log:

-----
Nov 21 14:04:19 597836 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:19 604180 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9fd
Nov 21 14:04:27 141302 [42804960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:27 141393 [42804960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 631497 [42804960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 631571 [42804960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 631621 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 631710 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 634132 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 635458 [41001960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:29 636439 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9f1
Nov 21 14:04:30 636442 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9f1
Nov 21 14:04:36 183733 [0000] -> SM port is down

Nov 21 14:04:36 183915 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:65 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:36 183937 [42003960] -> Removed port with
GUID:0x0002c9020021c9fd LID range [0x4,0x4] of node:MT25218 InfiniHostEx
Mellanox Technologies
Nov 21 14:04:36 183960 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:36 183974 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:36 183986 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:36 183997 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:65 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:04:36 184004 [42003960] -> Removed port with
GUID:0x0002c9020021c9f1 LID range [0x2,0x2] of node:server19 HCA-1
Nov 21 14:04:46 185503 [0000] -> SM port is down

Nov 21 14:04:46 185665 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:04:56 186974 [0000] -> SM port is down

Nov 21 14:04:56 187118 [41001960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:05:06 188262 [0000] -> SM port is down

Nov 21 14:05:06 188412 [41001960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:05:16 189833 [0000] -> SM port is down

Nov 21 14:05:16 189985 [41001960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:05:26 191250 [0000] -> SM port is down

Nov 21 14:05:26 191394 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:05:36 192765 [0000] -> SM port is down

Nov 21 14:05:36 192910 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:05:46 194216 [0000] -> SM port is down

Nov 21 14:05:46 194359 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:05:56 195696 [0000] -> SM port is down

Nov 21 14:05:56 195824 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Nov 21 14:06:06 197568 [0000] -> Entering MASTER state

Nov 21 14:06:06 202672 [0000] -> SUBNET UP

Nov 21 14:06:06 204086 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:06 205041 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9f1
Nov 21 14:06:06 205228 [41001960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:06 205758 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:07 199438 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:07 199494 [42003960] -> osm_report_notice: Reporting Generic
Notice type:3 num:67 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:07 201049 [41001960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9f1
Nov 21 14:06:07 202116 [41802960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:07 203033 [42804960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9f1
Nov 21 14:06:07 316711 [42804960] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0002
GID:0xfe80000000000000,0x0002c9020021c9f1
Nov 21 14:06:07 322157 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9fd
Nov 21 14:06:07 330160 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9fd
Nov 21 14:06:09 294120 [42804960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
0x0000000000000002 from port 0x0002c9020021c9fd
Nov 21 14:06:09 741992 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
0x0000000000000016 from port 0x0002c9020021c9fd
-----

Then I tried, from PE1950, to ping PE2850, but in the log weren't appeared
new lines

I don't know what is the meaning of the log, this log is 'normal' or these
errors are critical?

Thanks,
Diego





More information about the general mailing list