[Users] Weird IPoIB issue
Robert LeBlanc
robert_leblanc at byu.edu
Fri Nov 15 13:53:04 PST 2013
Hal,
Thanks for all your help! I have working IPoIB on my Dell blades now!
I swung over to my install of OpenSM with get_mfttop FALSE and rebooted the
switches and I could use IPoIB. I swung back to Oracle's SM and rebooted
the switches and it kept working.
Thanks for all your support and holding my hand through this issue.
Robert LeBlanc
OIT Infrastructure & Virtualization Engineer
Brigham Young University
On Fri, Nov 15, 2013 at 11:35 AM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:
> It just means their version is based on 3.3.5 which is really old and
> moldy. They've made a few changes internally. If they say you can try a
> stock OpenSM, go for it and hopefully things will work properly. You can
> even try with this config file. The transaction timeout is lengthed from
> 200 msec to 2 sec.
>
> I do not recall what version started supporting use_mfttop. I don't have
> the time to figure that out. But I doubt that it does as it doesn't show up
> in the config file unless I missed it. You can just try it and see if your
> results change (for the better). The opensm log will show a complaint about
> an unknown config option if it's not supported.
>
> Note that once a switch has MulticastFDBTop set to 0xbfff if OpenSM is set
> not to use_mfttop, I don't think it resets it to 0 but rather ignores it.
> The switches need to be reset so this field is 0. Start the process with
> that and verify using smpquery si on all your switches.
>
> -- Hal
>
>
> On Fri, Nov 15, 2013 at 12:49 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>
>> Hal,
>>
>> From what I can tell, the start up script that starts opensm in Xsigo
>> only specifies the following command line parameters:
>>
>> "-t 2000 -L 100 -y -q loopback -P /tmp/osmpart.conf -F
>> /opt/xsigo/xsigos/current/ofed/etc/opensm.opts"
>>
>> The opensm.opts contains:
>> # SA database file name
>> sa_db_file /var/log/opensm-sa.dump
>>
>> # If TRUE causes OpenSM to dump SA database at the end of
>> # every light sweep, regardless of the verbosity level
>> sa_db_dump TRUE
>>
>> # The directory to hold the file OpenSM dumps
>> dump_files_dir /var/log/
>>
>> And the osmpart.conf contains:
>> Default=0x7fff,ipoib: ALL=full ;
>>
>> They are running OpenSM 3.3.5 so it seems that it is pretty vanilla.
>> However, I know that we set the priority of the SMs in their management
>> tool, so I'm wondering if they are passing some additional parameters
>> through the loopback interface. I guess they could have patched the OpenSM
>> code, but I'm not sure they have done that.
>>
>> I logged into the opensm console and dumped the config. Disable multicast
>> is set to false. It looks like MulticastFDBTop was implemented back in
>> 2009, so this version should support it. Can I set use_mfttop using this
>> version, if not do you know what version I can?
>>
>> In my testing with ibsim, the LIDs between the real environment and
>> simulated environment appeared to be the same as well as the routing, so I
>> don't believe that I'd run into a problem moving to OpenSM as a primary SM.
>> Do you see anything in the running config that would be concerning to you
>> that should be configured with OpenSM? The differences that I see that I
>> think may drastically change the network behavior are transaction_timeout
>> and babbling_port_policy, but I'm not 100% sure.
>>
>> OpenSM $ dump_conf
>> #
>> # DEVICE ATTRIBUTES OPTIONS
>> #
>> # The port GUID on which the OpenSM is running
>> guid 0x0000000000000000
>>
>> # M_Key value sent to all ports qualifying all Set(PortInfo)
>> m_key 0x0000000000000000
>>
>> # The lease period used for the M_Key on this subnet in [sec]
>> m_key_lease_period 0
>>
>> # SM_Key value of the SM used for SM authentication
>> sm_key 0x0000000000000001
>>
>> # SM_Key value to qualify rcv SA queries as 'trusted'
>> sa_key 0x0000000000000001
>>
>> # Note that for both values above (sm_key and sa_key)
>> # OpenSM version 3.2.1 and below used the default value '1'
>> # in a host byte order, it is fixed now but you may need to
>> # change the values to interoperate with old OpenSM running
>> # on a little endian machine.
>>
>> # Subnet prefix used on this subnet
>> subnet_prefix 0xfe80000000000000
>>
>> # The LMC value used on this subnet
>> lmc 0
>>
>> # lmc_esp0 determines whether LMC value used on subnet is used for
>> # enhanced switch port 0. If TRUE, LMC value for subnet is used for
>> # ESP0. Otherwise, LMC value for ESP0s is 0.
>> lmc_esp0 FALSE
>>
>> # sm_sl determines SMSL used for SM/SA communication
>> sm_sl 0
>>
>> # The code of maximal time a packet can live in a switch
>> # The actual time is 4.096usec * 2^<packet_life_time>
>> # The value 0x14 disables this mechanism
>> packet_life_time 0x12
>>
>> # The number of sequential packets dropped that cause the port
>> # to enter the VLStalled state. The result of setting this value to
>> # zero is undefined.
>> vl_stall_count 0x07
>>
>> # The number of sequential packets dropped that cause the port
>> # to enter the VLStalled state. This value is for switch ports
>> # driving a CA or router port. The result of setting this value
>> # to zero is undefined.
>> leaf_vl_stall_count 0x07
>>
>> # The code of maximal time a packet can wait at the head of
>> # transmission queue.
>> # The actual time is 4.096usec * 2^<head_of_queue_lifetime>
>> # The value 0x14 disables this mechanism
>> head_of_queue_lifetime 0x12
>>
>> # The maximal time a packet can wait at the head of queue on
>> # switch port connected to a CA or router port
>> leaf_head_of_queue_lifetime 0x10
>>
>> # Limit the maximal operational VLs
>> max_op_vls 5
>>
>> # Force PortInfo:LinkSpeedEnabled on switch ports
>> # If 0, don't modify PortInfo:LinkSpeedEnabled on switch port
>> # Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port
>> # Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 "PortInfo")
>> # 1: 2.5 Gbps
>> # 3: 2.5 or 5.0 Gbps
>> # 5: 2.5 or 10.0 Gbps
>> # 7: 2.5 or 5.0 or 10.0 Gbps
>> # 2,4,6,8-14 Reserved
>> # Default 15: set to PortInfo:LinkSpeedSupported
>> force_link_speed 15
>>
>> # The subnet_timeout code that will be set for all the ports
>> # The actual timeout is 4.096usec * 2^<subnet_timeout>
>> subnet_timeout 18
>>
>> # Threshold of local phy errors for sending Trap 129
>> local_phy_errors_threshold 0x08
>>
>> # Threshold of credit overrun errors for sending Trap 130
>> overrun_errors_threshold 0x08
>>
>> #
>> # PARTITIONING OPTIONS
>> #
>> # Partition configuration file to be used
>> partition_config_file /tmp/osmpart.conf
>>
>> # Disable partition enforcement by switches
>> no_partition_enforcement FALSE
>>
>> #
>> # SWEEP OPTIONS
>> #
>> # The number of seconds between subnet sweeps (0 disables it)
>> sweep_interval 10
>>
>> # If TRUE cause all lids to be reassigned
>> reassign_lids FALSE
>>
>> # If TRUE forces every sweep to be a heavy sweep
>> force_heavy_sweep FALSE
>>
>> # If TRUE every trap will cause a heavy sweep.
>> # NOTE: successive identical traps (>10) are suppressed
>> sweep_on_trap TRUE
>>
>> #
>> # ROUTING OPTIONS
>> #
>> # If TRUE count switches as link subscriptions
>> port_profile_switch_nodes FALSE
>>
>> # Name of file with port guids to be ignored by port profiling
>> port_prof_ignore_file (null)
>>
>> # The file holding routing weighting factors per output port
>> hop_weights_file (null)
>>
>> # Routing engine
>> # Multiple routing engines can be specified separated by
>> # commas so that specific ordering of routing algorithms will
>> # be tried if earlier routing engines fail.
>> # Supported engines: minhop, updn, file, ftree, lash, dor
>> routing_engine (null)
>>
>> # Connect roots (use FALSE if unsure)
>> connect_roots FALSE
>>
>> # Use unicast routing cache (use FALSE if unsure)
>> use_ucast_cache FALSE
>>
>> # Lid matrix dump file name
>> lid_matrix_dump_file (null)
>>
>> # LFTs file name
>> lfts_file (null)
>>
>> # The file holding the root node guids (for fat-tree or Up/Down)
>> # One guid in each line
>> root_guid_file (null)
>>
>> # The file holding the fat-tree compute node guids
>> # One guid in each line
>> cn_guid_file (null)
>>
>> # The file holding the fat-tree I/O node guids
>> # One guid in each line
>> io_guid_file (null)
>>
>> # Number of reverse hops allowed for I/O nodes
>> # Used for connectivity between I/O nodes connected to Top Switches
>> max_reverse_hops 0
>>
>> # The file holding the node ids which will be used by Up/Down algorithm
>> instead
>> # of GUIDs (one guid and id in each line)
>> ids_guid_file (null)
>>
>> # The file holding guid routing order guids (for MinHop and Up/Down)
>> guid_routing_order_file (null)
>>
>> # Do mesh topology analysis (for LASH algorithm)
>> do_mesh_analysis FALSE
>>
>> # Starting VL for LASH algorithm
>> lash_start_vl 0
>>
>> # SA database file name
>> sa_db_file /var/log/opensm-sa.dump
>>
>> # If TRUE causes OpenSM to dump SA database at the end of
>> # every light sweep, regardless of the verbosity level
>> sa_db_dump TRUE
>>
>> #
>> # HANDOVER - MULTIPLE SMs OPTIONS
>> #
>> # SM priority used for deciding who is the master
>> # Range goes from 0 (lowest priority) to 15 (highest).
>> sm_priority 5
>>
>> # If TRUE other SMs on the subnet should be ignored
>> ignore_other_sm FALSE
>>
>> # Timeout in [msec] between two polls of active master SM
>> sminfo_polling_timeout 10000
>>
>> # Number of failing polls of remote SM that declares it dead
>> polling_retry_number 4
>>
>> # If TRUE honor the guid2lid file when coming out of standby
>> # state, if such file exists and is valid
>> honor_guid2lid_file FALSE
>>
>> #
>> # TIMING AND THREADING OPTIONS
>> #
>> # Maximum number of SMPs sent in parallel
>> max_wire_smps 4
>>
>> # The maximum time in [msec] allowed for a transaction to complete
>> transaction_timeout 2000
>>
>> # The maximum number of retries allowed for a transaction to complete
>> transaction_retries 3
>>
>> # Maximal time in [msec] a message can stay in the incoming message queue.
>> # If there is more than one message in the queue and the last message
>> # stayed in the queue more than this value, any SA request will be
>> # immediately returned with a BUSY status.
>> max_msg_fifo_timeout 10000
>>
>> # Use a single thread for handling SA queries
>> single_thread FALSE
>>
>> #
>> # MISC OPTIONS
>> #
>> # Daemon mode
>> daemon FALSE
>>
>> # SM Inactive
>> sm_inactive FALSE
>>
>> # Babbling Port Policy
>> babbling_port_policy FALSE
>>
>> #
>> # Performance Manager Options
>> #
>> # perfmgr enable
>> perfmgr FALSE
>>
>> # perfmgr redirection enable
>> perfmgr_redir TRUE
>>
>> # sweep time in seconds
>> perfmgr_sweep_time_s 180
>>
>> # Max outstanding queries
>> perfmgr_max_outstanding_queries 500
>>
>> #
>> # Event DB Options
>> #
>> # Dump file to dump the events to
>> event_db_dump_file (null)
>>
>> #
>> # Event Plugin Options
>> #
>> event_plugin_name (null)
>>
>> #
>> # Node name map for mapping node's to more descriptive node descriptions
>> # (man ibnetdiscover for more information)
>> #
>> node_name_map_name (null)
>>
>> #
>> # DEBUG FEATURES
>> #
>> # The log flags used
>> log_flags 0x03
>>
>> # Force flush of the log file after each log message
>> force_log_flush FALSE
>>
>> # Log file to be used
>> log_file /var/log/opensm.log
>>
>> # Limit the size of the log file in MB. If overrun, log is restarted
>> log_max_size 100
>>
>> # If TRUE will accumulate the log over multiple OpenSM sessions
>> accum_log_file TRUE
>>
>> # The directory to hold the file OpenSM dumps
>> dump_files_dir /var/log/
>>
>> # If TRUE enables new high risk options and hardware specific quirks
>> enable_quirks FALSE
>>
>> # If TRUE disables client reregistration
>> no_clients_rereg FALSE
>>
>> # If TRUE OpenSM should disable multicast support and
>> # no multicast routing is performed if TRUE
>> disable_multicast FALSE
>>
>> # If TRUE opensm will exit on fatal initialization issues
>> exit_on_fatal FALSE
>>
>> # console [off|local|loopback|socket]
>> console loopback
>>
>> # Telnet port for console (default 10000)
>> console_port 10000
>>
>> #
>> # QoS OPTIONS
>> #
>> # Enable QoS setup
>> qos FALSE
>>
>> # QoS policy file to be used
>> qos_policy_file /opt/xsigo/xsigos/current/ofed/etc/opensm/qos-policy.conf
>>
>> # QoS default options
>> qos_max_vls 0
>> qos_high_limit -1
>> qos_vlarb_high (null)
>> qos_vlarb_low (null)
>> qos_sl2vl (null)
>>
>> # QoS CA options
>> qos_ca_max_vls 0
>> qos_ca_high_limit -1
>> qos_ca_vlarb_high (null)
>> qos_ca_vlarb_low (null)
>> qos_ca_sl2vl (null)
>>
>> # QoS Switch Port 0 options
>> qos_sw0_max_vls 0
>> qos_sw0_high_limit -1
>> qos_sw0_vlarb_high (null)
>> qos_sw0_vlarb_low (null)
>> qos_sw0_sl2vl (null)
>>
>> # QoS Switch external ports options
>> qos_swe_max_vls 0
>> qos_swe_high_limit -1
>> qos_swe_vlarb_high (null)
>> qos_swe_vlarb_low (null)
>> qos_swe_sl2vl (null)
>>
>> # QoS Router ports options
>> qos_rtr_max_vls 0
>> qos_rtr_high_limit -1
>> qos_rtr_vlarb_high (null)
>> qos_rtr_vlarb_low (null)
>> qos_rtr_sl2vl (null)
>>
>> # Prefix routes file name
>> prefix_routes_file
>> /opt/xsigo/xsigos/current/ofed/etc/opensm/prefix-routes.conf
>>
>> #
>> # IPv6 Solicited Node Multicast (SNM) Options
>> #
>> consolidate_ipv6_snm_req FALSE
>>
>> OpenSM $
>>
>> Thanks again for all your help.
>>
>>
>> Robert LeBlanc
>> OIT Infrastructure & Virtualization Engineer
>> Brigham Young University
>>
>>
>> On Wed, Nov 13, 2013 at 12:27 PM, Robert LeBlanc <robert_leblanc at byu.edu>wrote:
>>
>>> They told me in the past that we could use our own external subnet
>>> manager or the one built into their box.
>>>
>>>
>>> Robert LeBlanc
>>> OIT Infrastructure & Virtualization Engineer
>>> Brigham Young University
>>>
>>>
>>> On Wed, Nov 13, 2013 at 12:25 PM, Hal Rosenstock <
>>> hal.rosenstock at gmail.com> wrote:
>>>
>>>> Yes but I'm not sure what the Xsigo SM "special sauce" is so those
>>>> boxes may not function properly.
>>>>
>>>>
>>>> On Wed, Nov 13, 2013 at 2:13 PM, Robert LeBlanc <robert_leblanc at byu.edu
>>>> > wrote:
>>>>
>>>>> The front line Oracle tech is giving me some hog wash that it is a
>>>>> problem with Dell and Mellanox and that the subnet manager is not at fault
>>>>> (although they are passing the request to engineering). I think I'm just
>>>>> going to run OpenSM on this test node (after reducing the priority on the
>>>>> Oracle sm) and see if the problem clears up.
>>>>>
>>>>>
>>>>> Robert LeBlanc
>>>>> OIT Infrastructure & Virtualization Engineer
>>>>> Brigham Young University
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2013 at 12:08 PM, Hal Rosenstock <
>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>
>>>>>> Yes, IPoIB uses multicast groups for the IP broadcast group and any
>>>>>> IP multicast groups. You can see those with saquery -g. But depending on
>>>>>> the locations of the ports running IPoIB and your topology, a multicast
>>>>>> group may or may not be routed via a particular switch.
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 13, 2013 at 2:06 PM, Robert LeBlanc <
>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>
>>>>>>> Ipoib uses multicast, right? I'm guessing that is why I can't get
>>>>>>> ipoib to work on our blades but our rack servers can.
>>>>>>>
>>>>>>> Robert LeBlanc
>>>>>>> Virtualization and Server Engineer
>>>>>>> Brigham Young University
>>>>>>>
>>>>>>> Sent from a mobile device, please excuse any typos.
>>>>>>> On Nov 13, 2013 12:03 PM, "Hal Rosenstock" <hal.rosenstock at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That should be fine. 7.4.3000 looks like the latest.
>>>>>>>>
>>>>>>>> This looks like an SM issue missetting that parameter in the switch
>>>>>>>> assuming that there are some MC groups routed through that switch.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2013 at 1:55 PM, Robert LeBlanc <
>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>
>>>>>>>>> [root at desxi003 ~]# flint -d
>>>>>>>>> /dev/mst/SW_MT48438_0x2c90200448e28_lid-0x0034 q
>>>>>>>>> Image type: FS2
>>>>>>>>> FW Version: 7.4.0
>>>>>>>>> Device ID: 48438
>>>>>>>>> Description: Node Sys image
>>>>>>>>> GUIDs: 0002c90200448e28 0002c90200448e2b
>>>>>>>>> Board ID: n/a (DEL08D0110003)
>>>>>>>>> VSD: n/a
>>>>>>>>> PSID: DEL08D0110003
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Robert LeBlanc
>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>> Brigham Young University
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 13, 2013 at 11:52 AM, Hal Rosenstock <
>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> What's the latest firmware version ?
>>>>>>>>>>
>>>>>>>>>> Can you determine the firmware version of the switches ? vendstat
>>>>>>>>>> -N <switch lid> might work to show this.
>>>>>>>>>>
>>>>>>>>>> This is important...
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> -- Hal
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 13, 2013 at 1:46 PM, Robert LeBlanc <
>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for all the help so far, this is a great community! I've
>>>>>>>>>>> fed all this info back to Oracle and I'll have to see what they say.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 13, 2013 at 11:40 AM, Hal Rosenstock <
>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, this is the cause of the issues.
>>>>>>>>>>>>
>>>>>>>>>>>> smpdump (and smpquery) merely query (get) and don't set
>>>>>>>>>>>> parameters and anyhow, the SM would overwrite it when it thought it needed
>>>>>>>>>>>> to update it. It's an SM and/or firmware issue.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 13, 2013 at 1:38 PM, Robert LeBlanc <
>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We are on the latest version of firmware for all of our
>>>>>>>>>>>>> switches (as of last month). I guess I'll have to check with Oracle and see
>>>>>>>>>>>>> if they are setting this parameter in their subnet manager. Just to
>>>>>>>>>>>>> confirm, using smpdump (or similar) to change the value won't do any good
>>>>>>>>>>>>> because the subnet manager will just change it back?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this is the cause of the problems, now to get it fixed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 13, 2013 at 11:34 AM, Hal Rosenstock <
>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In general, MulticastFDBTop should be 0 or some value above
>>>>>>>>>>>>>> 0xc000.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Indicates the upper bound of the range of the multicast
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> forwarding table. Packets received with MLIDs greater
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> than MulticastFDBTop are considered to be outside the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> range of the Multicast Forwarding Table (see
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 18.2.4.3.3
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Required Multicast Relay on page 1072
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ). A valid MulticastFDBTop
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is less than MulticastFDBCap + 0xC000.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This component applies only to switches that implement
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the optional multicast forwarding service. A switch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> shall ignore the MulticastFDBTop component if it has
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the value zero. The initial value for MulticastFDBTop
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> shall be set to zero. A value of 0xBFFF means there are
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> no MulticastForwardingTable entries.
>>>>>>>>>>>>>> It is set by OpenSM. There is a parameter to disable it's use
>>>>>>>>>>>>>> (use_mfttop) which can be set to FALSE. This may depend on which OpenSM
>>>>>>>>>>>>>> version you are running. In order to get out of this state, you may need to
>>>>>>>>>>>>>> reset any switches which have this parameter set like this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any idea on the firmware versions in your various switches ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Nov 13, 2013 at 1:16 PM, Robert LeBlanc <
>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry to take so long, I've been busy with other things.
>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [root at desxi003 ~]# smpquery si 52
>>>>>>>>>>>>>>> # Switch info: Lid 52
>>>>>>>>>>>>>>> LinearFdbCap:....................49152
>>>>>>>>>>>>>>> RandomFdbCap:....................0
>>>>>>>>>>>>>>> McastFdbCap:.....................4096
>>>>>>>>>>>>>>> LinearFdbTop:....................189
>>>>>>>>>>>>>>> DefPort:.........................0
>>>>>>>>>>>>>>> DefMcastPrimPort:................255
>>>>>>>>>>>>>>> DefMcastNotPrimPort:.............255
>>>>>>>>>>>>>>> LifeTime:........................18
>>>>>>>>>>>>>>> StateChange:.....................0
>>>>>>>>>>>>>>> OptSLtoVLMapping:................1
>>>>>>>>>>>>>>> LidsPerPort:.....................0
>>>>>>>>>>>>>>> PartEnforceCap:..................32
>>>>>>>>>>>>>>> InboundPartEnf:..................1
>>>>>>>>>>>>>>> OutboundPartEnf:.................1
>>>>>>>>>>>>>>> FilterRawInbound:................1
>>>>>>>>>>>>>>> FilterRawOutbound:...............1
>>>>>>>>>>>>>>> EnhancedPort0:...................0
>>>>>>>>>>>>>>> MulticastFDBTop:.................0xbfff
>>>>>>>>>>>>>>> [root at desxi003 ~]# smpquery pi 52 0
>>>>>>>>>>>>>>> # Port info: Lid 52 port 0
>>>>>>>>>>>>>>> Mkey:............................0x0000000000000000
>>>>>>>>>>>>>>> GidPrefix:.......................0xfe80000000000000
>>>>>>>>>>>>>>> Lid:.............................52
>>>>>>>>>>>>>>> SMLid:...........................49
>>>>>>>>>>>>>>> CapMask:.........................0x42500848
>>>>>>>>>>>>>>> IsTrapSupported
>>>>>>>>>>>>>>> IsSLMappingSupported
>>>>>>>>>>>>>>> IsSystemImageGUIDsupported
>>>>>>>>>>>>>>> IsVendorClassSupported
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> IsCapabilityMaskNoticeSupported
>>>>>>>>>>>>>>> IsClientRegistrationSupported
>>>>>>>>>>>>>>> IsMulticastFDBTopSupported
>>>>>>>>>>>>>>> DiagCode:........................0x0000
>>>>>>>>>>>>>>> MkeyLeasePeriod:.................0
>>>>>>>>>>>>>>> LocalPort:.......................1
>>>>>>>>>>>>>>> LinkWidthEnabled:................1X or 4X
>>>>>>>>>>>>>>> LinkWidthSupported:..............1X or 4X
>>>>>>>>>>>>>>> LinkWidthActive:.................4X
>>>>>>>>>>>>>>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or
>>>>>>>>>>>>>>> 10.0 Gbps
>>>>>>>>>>>>>>> LinkState:.......................Active
>>>>>>>>>>>>>>> PhysLinkState:...................LinkUp
>>>>>>>>>>>>>>> LinkDownDefState:................Polling
>>>>>>>>>>>>>>> ProtectBits:.....................0
>>>>>>>>>>>>>>> LMC:.............................0
>>>>>>>>>>>>>>> LinkSpeedActive:.................10.0 Gbps
>>>>>>>>>>>>>>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or
>>>>>>>>>>>>>>> 10.0 Gbps
>>>>>>>>>>>>>>> NeighborMTU:.....................4096
>>>>>>>>>>>>>>> SMSL:............................0
>>>>>>>>>>>>>>> VLCap:...........................VL0
>>>>>>>>>>>>>>> InitType:........................0x00
>>>>>>>>>>>>>>> VLHighLimit:.....................0
>>>>>>>>>>>>>>> VLArbHighCap:....................0
>>>>>>>>>>>>>>> VLArbLowCap:.....................0
>>>>>>>>>>>>>>> InitReply:.......................0x00
>>>>>>>>>>>>>>> MtuCap:..........................4096
>>>>>>>>>>>>>>> VLStallCount:....................0
>>>>>>>>>>>>>>> HoqLife:.........................0
>>>>>>>>>>>>>>> OperVLs:.........................VL0
>>>>>>>>>>>>>>> PartEnforceInb:..................0
>>>>>>>>>>>>>>> PartEnforceOutb:.................0
>>>>>>>>>>>>>>> FilterRawInb:....................0
>>>>>>>>>>>>>>> FilterRawOutb:...................0
>>>>>>>>>>>>>>> MkeyViolations:..................0
>>>>>>>>>>>>>>> PkeyViolations:..................0
>>>>>>>>>>>>>>> QkeyViolations:..................0
>>>>>>>>>>>>>>> GuidCap:.........................1
>>>>>>>>>>>>>>> ClientReregister:................0
>>>>>>>>>>>>>>> McastPkeyTrapSuppressionEnabled:.0
>>>>>>>>>>>>>>> SubnetTimeout:...................18
>>>>>>>>>>>>>>> RespTimeVal:.....................20
>>>>>>>>>>>>>>> LocalPhysErr:....................0
>>>>>>>>>>>>>>> OverrunErr:......................0
>>>>>>>>>>>>>>> MaxCreditHint:...................0
>>>>>>>>>>>>>>> RoundTrip:.......................0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From what I've read in the Mellanox Release
>>>>>>>>>>>>>>> Notes MultiCastFDBTop=0xBFFF is supposed to discard MC traffic. The
>>>>>>>>>>>>>>> question is, how do I set this value to something else and what should it
>>>>>>>>>>>>>>> be set to?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Oct 30, 2013 at 12:28 PM, Hal Rosenstock <
>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Determine LID of switch (in the below say switch is lid x)
>>>>>>>>>>>>>>>> Then:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> smpquery si x
>>>>>>>>>>>>>>>> (of interest are McastFdbCap and MulticastFDBTop)
>>>>>>>>>>>>>>>> smpquery pi x 0
>>>>>>>>>>>>>>>> (of interest is CapMask)
>>>>>>>>>>>>>>>> ibroute -M x
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Oct 29, 2013 at 3:56 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Both ports show up in the "saquery MCMR" results with a
>>>>>>>>>>>>>>>>> JoinState of 0x1.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> How can I dump the parameters of a non-managed switch so
>>>>>>>>>>>>>>>>> that I can confirm that multicast is not turned off on the Dell chassis IB
>>>>>>>>>>>>>>>>> switches?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 5:04 PM, Coulter, Susan K <
>>>>>>>>>>>>>>>>> skc at lanl.gov> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/class/net should give you the details on your
>>>>>>>>>>>>>>>>>> devices, like this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -bash-4.1# cd /sys/class/net
>>>>>>>>>>>>>>>>>> -bash-4.1# ls -l
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth0 ->
>>>>>>>>>>>>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.0/net/eth0
>>>>>>>>>>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 12:59 eth1 ->
>>>>>>>>>>>>>>>>>> ../../devices/pci0000:00/0000:00:02.0/0000:04:00.1/net/eth1
>>>>>>>>>>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib0 ->
>>>>>>>>>>>>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib0
>>>>>>>>>>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib1 ->
>>>>>>>>>>>>>>>>>> ../../devices/pci0000:40/0000:40:0c.0/0000:47:00.0/net/ib1
>>>>>>>>>>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib2 ->
>>>>>>>>>>>>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib2
>>>>>>>>>>>>>>>>>> lrwxrwxrwx 1 root root 0 Oct 23 15:42 ib3 ->
>>>>>>>>>>>>>>>>>> ../../devices/pci0000:c0/0000:c0:0c.0/0000:c7:00.0/net/ib3
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then use "lspci | grep Mell" to get the pci device
>>>>>>>>>>>>>>>>>> numbers.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 47:00.0 Network controller: Mellanox Technologies
>>>>>>>>>>>>>>>>>> MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>>>>>>>>>>>> c7:00.0 Network controller: Mellanox Technologies MT26428
>>>>>>>>>>>>>>>>>> [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In this example, ib0 and 1 are referencing the device
>>>>>>>>>>>>>>>>>> at 47:00.0
>>>>>>>>>>>>>>>>>> And ib2 and ib3 are referencing the device at c7:00.0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That said, if you only have one card - this is probably
>>>>>>>>>>>>>>>>>> not the problem.
>>>>>>>>>>>>>>>>>> Additionally, since the arp requests are being seen going
>>>>>>>>>>>>>>>>>> out ib0, your emulation appears to be working.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If those arp requests are not being seen on the other
>>>>>>>>>>>>>>>>>> end, it seems like a problem with the mgids.
>>>>>>>>>>>>>>>>>> Like maybe the port you are trying to reach is not in the
>>>>>>>>>>>>>>>>>> IPoIB multicast group?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can look at all the multicast member records with
>>>>>>>>>>>>>>>>>> "saquery MCMR".
>>>>>>>>>>>>>>>>>> Or - you can grep for mcmr_rcv_join_mgrp references in
>>>>>>>>>>>>>>>>>> your SM logs …
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Oct 28, 2013, at 1:08 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can ibping between both hosts just fine.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [root at desxi003 ~]# ibping 0x37
>>>>>>>>>>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.111 ms
>>>>>>>>>>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>>>>>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.189 ms
>>>>>>>>>>>>>>>>>> Pong from desxi004.(none) (Lid 55): time 0.179 ms
>>>>>>>>>>>>>>>>>> ^C
>>>>>>>>>>>>>>>>>> --- desxi004.(none) (Lid 55) ibping statistics ---
>>>>>>>>>>>>>>>>>> 4 packets transmitted, 4 received, 0% packet loss, time
>>>>>>>>>>>>>>>>>> 3086 ms
>>>>>>>>>>>>>>>>>> rtt min/avg/max = 0.111/0.167/0.189 ms
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [root at desxi004 ~]# ibping 0x2d
>>>>>>>>>>>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.156 ms
>>>>>>>>>>>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.175 ms
>>>>>>>>>>>>>>>>>> Pong from desxi003.(none) (Lid 45): time 0.176 ms
>>>>>>>>>>>>>>>>>> ^C
>>>>>>>>>>>>>>>>>> --- desxi003.(none) (Lid 45) ibping statistics ---
>>>>>>>>>>>>>>>>>> 3 packets transmitted, 3 received, 0% packet loss, time
>>>>>>>>>>>>>>>>>> 2302 ms
>>>>>>>>>>>>>>>>>> rtt min/avg/max = 0.156/0.169/0.176 ms
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When I do an Ethernet ping to the IPoIB address,
>>>>>>>>>>>>>>>>>> tcpdump only shows the outgoing ARP request.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [root at desxi003 ~]# tcpdump -i ib0
>>>>>>>>>>>>>>>>>> tcpdump: verbose output suppressed, use -v or -vv for
>>>>>>>>>>>>>>>>>> full protocol decode
>>>>>>>>>>>>>>>>>> listening on ib0, link-type LINUX_SLL (Linux cooked),
>>>>>>>>>>>>>>>>>> capture size 65535 bytes
>>>>>>>>>>>>>>>>>> 19:00:08.950320 ARP, Request who-has 192.168.9.4 tell
>>>>>>>>>>>>>>>>>> 192.168.9.3, length 56
>>>>>>>>>>>>>>>>>> 19:00:09.950320 ARP, Request who-has 192.168.9.4 tell
>>>>>>>>>>>>>>>>>> 192.168.9.3, length 56
>>>>>>>>>>>>>>>>>> 19:00:10.950307 ARP, Request who-has 192.168.9.4 tell
>>>>>>>>>>>>>>>>>> 192.168.9.3, length 56
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Running tcpdump on the rack servers I don't see the ARP
>>>>>>>>>>>>>>>>>> request there which I should.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> From what I've read, ib0 should be mapped to the first
>>>>>>>>>>>>>>>>>> port and ib1 should be mapped to the second port. We have one IB card with
>>>>>>>>>>>>>>>>>> two ports. The modprobe is the default installed with the Mellanox drivers.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [root at desxi003 etc]# cat modprobe.d/ib_ipoib.conf
>>>>>>>>>>>>>>>>>> # install ib_ipoib modprobe --ignore-install ib_ipoib &&
>>>>>>>>>>>>>>>>>> /sbin/ib_ipoib_sysctl load
>>>>>>>>>>>>>>>>>> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe
>>>>>>>>>>>>>>>>>> -r --ignore-remove ib_ipoib
>>>>>>>>>>>>>>>>>> alias ib0 ib_ipoib
>>>>>>>>>>>>>>>>>> alias ib1 ib_ipoib
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can you give me some pointers on digging into the
>>>>>>>>>>>>>>>>>> device layer to make sure IPoIB is connected correctly? Would I look in
>>>>>>>>>>>>>>>>>> /sys or /proc for that?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Dell has not been able to replicate the problem in
>>>>>>>>>>>>>>>>>> their environment and they only support Red Hat and won't work with my
>>>>>>>>>>>>>>>>>> CentOS live CD. These blades don't have internal hard drives so it makes it
>>>>>>>>>>>>>>>>>> hard to install any OS. I don't know if I can engage Mellanox since they
>>>>>>>>>>>>>>>>>> build the switch hardware and driver stack we are using.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I really appreciate all the help you guys have given
>>>>>>>>>>>>>>>>>> thus far, I'm learning a lot as this progresses. I'm reading through
>>>>>>>>>>>>>>>>>> https://tools.ietf.org/html/rfc4391 trying to understand
>>>>>>>>>>>>>>>>>> IPoIB from top to bottom.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:53 PM, Coulter, Susan K <
>>>>>>>>>>>>>>>>>> skc at lanl.gov> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If you are not seeing any packets leave the ib0
>>>>>>>>>>>>>>>>>>> interface, it sounds like the emulation layer is not connected to the right
>>>>>>>>>>>>>>>>>>> device.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If ib_ipoib kernel module is loaded, and a simple
>>>>>>>>>>>>>>>>>>> native IB test works between those blades - (like ib_read_bw) you need to
>>>>>>>>>>>>>>>>>>> dig into the device layer and insure ipoib is "connected" to the right
>>>>>>>>>>>>>>>>>>> device.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Do you have more than 1 IB card?
>>>>>>>>>>>>>>>>>>> What does your modprobe config look like for ipoib?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Oct 28, 2013, at 12:38 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> These ESX hosts (2 blade server and 2 rack servers)
>>>>>>>>>>>>>>>>>>> are booted into a CentOS 6.2 Live CD that I built. Right now everything I'm
>>>>>>>>>>>>>>>>>>> trying to get working is CentOS 6.2. All of our other hosts are running
>>>>>>>>>>>>>>>>>>> ESXi and have IPoIB interfaces, but none of them are configured and I'm not
>>>>>>>>>>>>>>>>>>> trying to get those working right now.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ideally, we would like our ESX hosts to communicate
>>>>>>>>>>>>>>>>>>> with each other for vMotion and protected VM traffic as well as with our
>>>>>>>>>>>>>>>>>>> Commvault backup servers (Windows) over IPoIB (or Oracle's PVI which is
>>>>>>>>>>>>>>>>>>> very similar).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:33 PM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are those ESXi IPoIB interfaces ? Do some of these work
>>>>>>>>>>>>>>>>>>>> and others not ? Are there normal Linux IPoIB interfaces ? Do they work ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 2:24 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yes, I can not ping them over the IPoIB interface. It
>>>>>>>>>>>>>>>>>>>>> is a very simple network set-up.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> desxi003
>>>>>>>>>>>>>>>>>>>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520
>>>>>>>>>>>>>>>>>>>>> qdisc pfifo_fast state UP qlen 256
>>>>>>>>>>>>>>>>>>>>> link/infiniband
>>>>>>>>>>>>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:d1 brd
>>>>>>>>>>>>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>>>>>>>>>>>>> inet 192.168.9.3/24 brd 192.168.9.255 scope
>>>>>>>>>>>>>>>>>>>>> global ib0
>>>>>>>>>>>>>>>>>>>>> inet6 fe80::f24d:a290:9778:e7d1/64 scope link
>>>>>>>>>>>>>>>>>>>>> valid_lft forever preferred_lft forever
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> desxi004
>>>>>>>>>>>>>>>>>>>>> 8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520
>>>>>>>>>>>>>>>>>>>>> qdisc pfifo_fast state UP qlen 256
>>>>>>>>>>>>>>>>>>>>> link/infiniband
>>>>>>>>>>>>>>>>>>>>> 80:20:00:54:fe:80:00:00:00:00:00:00:f0:4d:a2:90:97:78:e7:15 brd
>>>>>>>>>>>>>>>>>>>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>>>>>>>>>>>>>>>>>>> inet 192.168.9.4/24 brd 192.168.9.255 scope
>>>>>>>>>>>>>>>>>>>>> global ib0
>>>>>>>>>>>>>>>>>>>>> inet6 fe80::f24d:a290:9778:e715/64 scope link
>>>>>>>>>>>>>>>>>>>>> valid_lft forever preferred_lft forever
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:22 PM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> So these 2 hosts have trouble talking IPoIB to each
>>>>>>>>>>>>>>>>>>>>>> other ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 2:16 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I was just wondering about that. It seems reasonable
>>>>>>>>>>>>>>>>>>>>>>> that the broadcast traffic would go over multicast, but effectively
>>>>>>>>>>>>>>>>>>>>>>> channels would be created for node to node communication, otherwise the
>>>>>>>>>>>>>>>>>>>>>>> entire multicast group would be limited to 10 Gbps (in this instance) for
>>>>>>>>>>>>>>>>>>>>>>> the whole group. That doesn't scale very well.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The things I've read about IPoIB performance
>>>>>>>>>>>>>>>>>>>>>>> tuning seems pretty vague, and the changes most people recommend seem to be
>>>>>>>>>>>>>>>>>>>>>>> already in place on the systems I'm using. Some people said, try using a
>>>>>>>>>>>>>>>>>>>>>>> newer version of Ubuntu, but ultimately, I have very little control over
>>>>>>>>>>>>>>>>>>>>>>> VMware. Once I can get the Linux machines to communicate IPoIB between the
>>>>>>>>>>>>>>>>>>>>>>> racks and blades, then I'm going to turn my attention over to performance
>>>>>>>>>>>>>>>>>>>>>>> optimization. It doesn't seem to make much sense to spend time there when
>>>>>>>>>>>>>>>>>>>>>>> it is not working at all for most machines.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I've done ibtracert between the two nodes, is that
>>>>>>>>>>>>>>>>>>>>>>> what you mean by walking the route?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [root at desxi003 ~]# ibtracert -m 0xc000 0x2d 0x37
>>>>>>>>>>>>>>>>>>>>>>> From ca 0xf04da2909778e7d0 port 1 lid 45-45
>>>>>>>>>>>>>>>>>>>>>>> "localhost HCA-1"
>>>>>>>>>>>>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[17] lid 51
>>>>>>>>>>>>>>>>>>>>>>> "Infiniscale-IV Mellanox Technologies"
>>>>>>>>>>>>>>>>>>>>>>> [18] -> ca 0xf04da2909778e714[1] lid 55 "localhost
>>>>>>>>>>>>>>>>>>>>>>> HCA-1"
>>>>>>>>>>>>>>>>>>>>>>> To ca 0xf04da2909778e714 port 1 lid 55-55 "localhost
>>>>>>>>>>>>>>>>>>>>>>> HCA-1"
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [root at desxi004 ~]# ibtracert -m 0xc000 0x37 0x2d
>>>>>>>>>>>>>>>>>>>>>>> From ca 0xf04da2909778e714 port 1 lid 55-55
>>>>>>>>>>>>>>>>>>>>>>> "localhost HCA-1"
>>>>>>>>>>>>>>>>>>>>>>> [1] -> switch 0x2c90200448ec8[18] lid 51
>>>>>>>>>>>>>>>>>>>>>>> "Infiniscale-IV Mellanox Technologies"
>>>>>>>>>>>>>>>>>>>>>>> [17] -> ca 0xf04da2909778e7d0[1] lid 45 "localhost
>>>>>>>>>>>>>>>>>>>>>>> HCA-1"
>>>>>>>>>>>>>>>>>>>>>>> To ca 0xf04da2909778e7d0 port 1 lid 45-45 "localhost
>>>>>>>>>>>>>>>>>>>>>>> HCA-1"
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As you can see, the route is on the same switch,
>>>>>>>>>>>>>>>>>>>>>>> the blades are right next to each other.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:05 PM, Hal Rosenstock <
>>>>>>>>>>>>>>>>>>>>>>> hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Which mystery is explained ? The 10 Gbps is a
>>>>>>>>>>>>>>>>>>>>>>>> multicast only limit and does not apply to unicast. The BW limitation
>>>>>>>>>>>>>>>>>>>>>>>> you're seeing is due to other factors. There's been much written about
>>>>>>>>>>>>>>>>>>>>>>>> IPoIB performance.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If all the MC members are joined and routed, then
>>>>>>>>>>>>>>>>>>>>>>>> the IPoIB connectivity issue is some other issue. Are you sure this is the
>>>>>>>>>>>>>>>>>>>>>>>> case ? Did you walk the route between 2 nodes where you have a connectivity
>>>>>>>>>>>>>>>>>>>>>>>> issue ?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:58 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Well, that explains one mystery, now I need to
>>>>>>>>>>>>>>>>>>>>>>>>> figure out why it seems the Dell blades are not passing the traffic.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:51 AM, Hal Rosenstock
>>>>>>>>>>>>>>>>>>>>>>>>> <hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, that's the IPoIB IPv4 broadcast group for
>>>>>>>>>>>>>>>>>>>>>>>>>> the default (0xffff) partition. 0x80 part of mtu and rate just means "is
>>>>>>>>>>>>>>>>>>>>>>>>>> equal to". mtu 0x04 is 2K (2048) and rate 0x3 is 10 Gb/sec. These are
>>>>>>>>>>>>>>>>>>>>>>>>>> indeed the defaults.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:45 PM, Robert LeBlanc <
>>>>>>>>>>>>>>>>>>>>>>>>>> robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The info for that MGID is:
>>>>>>>>>>>>>>>>>>>>>>>>>>> MCMemberRecord group dump:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>>>>>>>>>>>>>>>>>>>>>>> Mlid....................0xC000
>>>>>>>>>>>>>>>>>>>>>>>>>>> Mtu.....................0x84
>>>>>>>>>>>>>>>>>>>>>>>>>>> pkey....................0xFFFF
>>>>>>>>>>>>>>>>>>>>>>>>>>> Rate....................0x83
>>>>>>>>>>>>>>>>>>>>>>>>>>> SL......................0x0
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't understand the MTU and Rate (130 and
>>>>>>>>>>>>>>>>>>>>>>>>>>> 131 dec). When I run iperf between the two hosts over IPoIB in connected
>>>>>>>>>>>>>>>>>>>>>>>>>>> mode and MTU 65520. I've tried multiple threads, but the sum is still 10
>>>>>>>>>>>>>>>>>>>>>>>>>>> Gbps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:40 AM, Hal
>>>>>>>>>>>>>>>>>>>>>>>>>>> Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> saquery -g should show what MGID is mapped to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> MLID 0xc000 and the group parameters.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> When you say 10 Gbps max, is that multicast
>>>>>>>>>>>>>>>>>>>>>>>>>>>> or unicast ? That limit is only on the multicast.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 1:28 PM, Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Well, that can explain why I'm only able to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> get 10 Gbps max from the two hosts that are working.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have tried updn and dnup and they didn't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help either. I think the only thing that will help is Automatic Path
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Migration is it tries very hard to route the alternative LIDs through
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different systemguids. I suspect it would require re-LIDing everything
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which would mean an outage. I'm still trying to get answers from Oracle if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that is even a possibility. I've tried seeding some of the algorithms with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> information like root nodes, etc, but none of them worked better.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The MLID 0xc000 exists and I can see all the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes joined to the group using saquery. I've checked the route using
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ibtracert specifying the MLID. The only thing I'm not sure how to check is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the group parameters. What tool would I use for that?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 11:16 AM, Hal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Xsigo's SM is not "straight" OpenSM. They
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have some proprietary enhancements and it may be based on old vintage of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OpenSM. You will likely need to work with them/Oracle now on issues.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lack of a partitions file does mean default
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partition and default rate (10 Gbps) so from what I saw all ports had
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sufficient rate to join MC group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There are certain topology requirements for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running various routing algorithms. Did you try updn or dnup ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The key is determining whether the IPoIB
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> broadcast group is setup correctly. What MLID is the group built on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (usually 0xc000) ? What are the group parameters (rate, MTU) ? Are all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> members that are running IPoIB joined ? Is the group routed to all such
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> members ? There are infiniband-diags for all of this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 12:19 PM, Robert
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LeBlanc <robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OpenSM (the SM runs on Xsigo so they manage
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it) is using minhop. I've loaded the ibnetdiscover output into ibsim and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> run all the different routing algorithms against it with and without
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scatter ports. Minhop had 50% of our hosts running all paths through a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> single IS5030 switch (at least the LIDs we need which represent Ethernet
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and Fibre Channel cards the hosts should communicate with). Ftree, dor, and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dfsssp failed back to minhop, the others routed more paths through the same
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IS5030 in some cases increasing our host count with single point of failure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to 75%.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As far as I can tell there is no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions.conf file so I assume we are using the default partition. There
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is an opensm.opts file, but it only specifies logging information.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # SA database file name
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sa_db_file /var/log/opensm-sa.dump
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # If TRUE causes OpenSM to dump SA
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> database at the end of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # every light sweep, regardless of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> verbosity level
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sa_db_dump TRUE
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # The directory to hold the file OpenSM
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dumps
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dump_files_dir /var/log/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The SM node is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> xsigoa:/opt/xsigo/xsigos/current/ofed/etc#
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ibaddr
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GID fe80::13:9702:100:979 LID start 0x1 end
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0x1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We do have Switch-X in two of the Dell
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> m1000e chassis but the cards, ports 17-32, are FDR10 (the switch may be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> straight FDR, but I'm not 100% sure). The IS5030 are QDR which the Switch-X
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are connected to, the switches in the Xsigo directors are QDR, but the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ethernet and Fibre Channel cards are DDR. The DDR cards will not be running
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IPoIB (at least to my knowledge they don't have the ability), only the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hosts should be leveraging IPoIB. I hope that clears up some of your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questions. If you have more, I will try to answer them.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization Engineer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 28, 2013 at 9:57 AM, Hal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What routing algorithm is configured in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OpenSM ? What does your partitions.conf file look like ? Which node is your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OpenSM ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I only see QDR and DDR links although
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you have Switch-X so I assume all FDR ports are connected to slower (QDR)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> devices. I don't see any FDR-10 ports but maybe they're also connected to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> QDR ports so show up as QDR in the topology.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> There are DDR CAs in Xsigo box but not sure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whether or not they run IPoIB.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- Hal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Oct 27, 2013 at 9:46 PM, Robert
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LeBlanc <robert_leblanc at byu.edu> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Since you guys are amazingly helpful, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thought I would pick your brains in a new problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We have two Xsigo directors cross
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connected to four Mellanox IS5030 switches. Connected to those we have four
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dell m1000e chassis each with two IB switches (two chassis have QDR and two
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have FDR10). We have 9 dual-port rack servers connected to the IS5030
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switches. For testing purposes we have an additional Dell m1000e QDR
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> chassis connected to one Xsigo director and two dual-port FDR10 rack
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> servers connected to the other Xsigo director.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I can get IPoIB to work between the two
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> test rack servers connected to the one Xsigo director. But I can not get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> IPoIB to work between any blades either right next to each other or to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> working rack servers. I'm using the same exact live CentOS ISO on all four
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> servers. I've checked opensm and the blades have joined the multicast group
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0xc000 properly. tcpdump basically says that traffic is not leaving the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> blades. tcpdump also shows no traffic entering the blades from the rack
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> servers. An ibtracert using 0xc000 mlid shows that routing exists between
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hosts.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've read about
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MulticastFDBTop=0xBFFF but I don't know how to set it and I doubt it would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have been set by default.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyone have some ideas on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> troubleshooting steps to try? I think Google is tired of me asking
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> questions about it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> OIT Infrastructure & Virtualization
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Engineer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Brigham Young University
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>>>>>> Users at lists.openfabrics.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ====================================
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Susan Coulter
>>>>>>>>>>>>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>>>>>>>>>>>>> 505-667-8425
>>>>>>>>>>>>>>>>>>> Increase the Peace...
>>>>>>>>>>>>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>>>>>>>>>>>>> ====================================
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ====================================
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Susan Coulter
>>>>>>>>>>>>>>>>>> HPC-3 Network/Infrastructure
>>>>>>>>>>>>>>>>>> 505-667-8425
>>>>>>>>>>>>>>>>>> Increase the Peace...
>>>>>>>>>>>>>>>>>> An eye for an eye leaves the whole world blind
>>>>>>>>>>>>>>>>>> ====================================
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20131115/0c186767/attachment.html>
More information about the Users
mailing list