[openib-general] Infiniband on Debian etch RC1

Hal Rosenstock halr at voltaire.com
Tue Nov 21 05:38:42 PST 2006


On Tue, 2006-11-21 at 08:28, Diego Guella wrote:
> >> If you are using OFED 1.1, then you should use the source RPM for
> >> OpenSM. There was one patch on the list found with Debian for a stack
> >> smashing issue with osm_helper.c.
> >
> > The SM which is currently running (on SuSE 9.3) is the one included in
> > OFED-1.0.
> > Should I migrate to OFED-1.1 or can I build opensm from the OFED-1.0
> > source
> > RPM?
> >
> >
> >>> I added a line with "ib_ipoib" to /etc/modules.
> >>>
> >>> So now I think I have to configure 2 new devices (MHES28 has 2 ports) in
> >>> /etc/network/interfaces.
> >>>
> >>> I added 2 devices named ib0 and ib1, and I configured them to have
> >>> static
> >>> IP
> >>> addresses, just like a normal ethernet device.
> >>>
> >>> ifconfig shows they are up, one has the attribute "RUNNING" too, the
> >>> other
> >>> not (I think this is because one has the cable plugged, the other not).
> >>>
> >>> All this is done on server PE1950
> >>>
> >>> Now, that cable goes to the other server, a PE2850, which has a SM
> >>> running.
> >>>
> >>> I try to ping that server, using the IP address of the infiniband IPoIB
> >>> interace, but I get "destination unreachable".
> >>
> >> I presume the two machines are on the same IPoIB subnet. Are there any
> >> errors in the OpenSM log ?
> >>
> > Yes, my Ethernet subnet is 192.168.200.0/255 and my Infiniband IPoIB
> > subnet
> > is 193.168.200.0/255, this is the same on all the machines.
> >
> > I opened /var/log/osm.log for the first time now and (apart for the log
> > size - 31MB!)
> > there is this error, that is repeating every 10 seconds from June, 26 (the
> > date when I installed OFED-1.0) till today:
> > -----
> > Nov 21 13:00:22 438727 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> > 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> > IB_SMINFO_STATE_DISCOVERING
> > Nov 21 13:00:32 441056 [0000] -> SM port is down
> > -----
> >
> > I want to point out that I have another system, a desktop, with installed
> > SuSE 9.3 and OFED-1.0 and a MHES14 card, and that works fine, IPoIB, SDP,
> > RDMA, all the features are OK.
> >
> >
> >
> Since the log was very big, I renamed it and reeboted the machine (machine:
> PE2850). the other machine (PE1950) was on when PE2850 rebooted.
> the log now is:
> -----
> Nov 21 13:59:36 134700 [AB467140] -> OpenSM Rev:openib-1.2.1 OpenIB svn
> Exported revision
> Nov 21 13:59:36 134833 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn Exported
> revision
> 
> Nov 21 13:59:36 135958 [AB467140] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Nov 21 13:59:36 136014 [AB467140] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Nov 21 13:59:36 139847 [AB467140] -> osm_vendor_bind: Binding to port
> 0x2c9020021c9f1
> Nov 21 13:59:36 142549 [AB467140] -> osm_vendor_bind: Binding to port
> 0x2c9020021c9f1
> Nov 21 13:59:36 144291 [0000] -> Entering MASTER state
> 
> Nov 21 13:59:36 201383 [0000] -> SUBNET UP
> 
> Nov 21 13:59:36 510056 [41802960] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002
> TID:0x0000000000000000
> Nov 21 13:59:36 510245 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:4 num:144 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 13:59:37 150118 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 13:59:37 151069 [41001960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 13:59:37 151501 [42804960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 13:59:37 153744 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9f1
> Nov 21 13:59:38 860675 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
> 0x0000000000000002 from port 0x0002c9020021c9f1
> -----
> 
> Then, I rebooted PE1950, with PE2850 still on.
> These lines were added to the log:
> 
> -----
> Nov 21 14:04:19 597836 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:19 604180 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9fd
> Nov 21 14:04:27 141302 [42804960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:27 141393 [42804960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 631497 [42804960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 631571 [42804960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 631621 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 631710 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 634132 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 635458 [41001960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:29 636439 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9f1
> Nov 21 14:04:30 636442 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9f1
> Nov 21 14:04:36 183733 [0000] -> SM port is down

Why is the SM port down ?

> Nov 21 14:04:36 183915 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:65 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:36 183937 [42003960] -> Removed port with
> GUID:0x0002c9020021c9fd LID range [0x4,0x4] of node:MT25218 InfiniHostEx
> Mellanox Technologies
> Nov 21 14:04:36 183960 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:36 183974 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:36 183986 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:36 183997 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:65 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:04:36 184004 [42003960] -> Removed port with
> GUID:0x0002c9020021c9f1 LID range [0x2,0x2] of node:server19 HCA-1
> Nov 21 14:04:46 185503 [0000] -> SM port is down
> 
> Nov 21 14:04:46 185665 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:04:56 186974 [0000] -> SM port is down
> 
> Nov 21 14:04:56 187118 [41001960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:05:06 188262 [0000] -> SM port is down
> 
> Nov 21 14:05:06 188412 [41001960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:05:16 189833 [0000] -> SM port is down
> 
> Nov 21 14:05:16 189985 [41001960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:05:26 191250 [0000] -> SM port is down
> 
> Nov 21 14:05:26 191394 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:05:36 192765 [0000] -> SM port is down
> 
> Nov 21 14:05:36 192910 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:05:46 194216 [0000] -> SM port is down
> 
> Nov 21 14:05:46 194359 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:05:56 195696 [0000] -> SM port is down
> 
> Nov 21 14:05:56 195824 [42003960] -> __osm_sm_state_mgr_signal_error: ERR
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
> IB_SMINFO_STATE_DISCOVERING
> Nov 21 14:06:06 197568 [0000] -> Entering MASTER state
> 
> Nov 21 14:06:06 202672 [0000] -> SUBNET UP
> 
> Nov 21 14:06:06 204086 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:06 205041 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9f1
> Nov 21 14:06:06 205228 [41001960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:06 205758 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:07 199438 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:07 199494 [42003960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:67 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:07 201049 [41001960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9f1
> Nov 21 14:06:07 202116 [41802960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:07 203033 [42804960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9f1
> Nov 21 14:06:07 316711 [42804960] -> osm_report_notice: Reporting Generic
> Notice type:3 num:66 from LID:0x0002
> GID:0xfe80000000000000,0x0002c9020021c9f1
> Nov 21 14:06:07 322157 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9fd
> Nov 21 14:06:07 330160 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9fd
> Nov 21 14:06:09 294120 [42804960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
> 0x0000000000000002 from port 0x0002c9020021c9fd
> Nov 21 14:06:09 741992 [42003960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083,
> expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 :
> 0x0000000000000016 from port 0x0002c9020021c9fd
> -----
> 
> Then I tried, from PE1950, to ping PE2850, but in the log weren't appeared
> new lines
> 
> I don't know what is the meaning of the log, this log is 'normal' or these
> errors are critical?

The join failures are to (likely) not needed IP multicast groups
(224.0.0.2 and 224.0.0.22).

The key issue appears to me to be why the SM port is down. This needs to
be resolved and then go from there...

-- Hal

> Thanks,
> Diego
> 





More information about the general mailing list