[ewg] OPENSM cONFIGURATION

Atul Yadav atulyadavtech at gmail.com
Sat Apr 12 08:29:26 PDT 2014


Hi,

Yes, i am able to ping all the nodes connected with Infiniband switch
For more details please go through the attachment.



Thanks
Atul Yadav


On Sat, Apr 12, 2014 at 7:28 PM, Hal Rosenstock <hal at dev.mellanox.co.il>wrote:

> On 4/12/2014 6:59 AM, Atul Yadav wrote:
> > HI,
> >
> > Thanks for replying
> > In this artectuire, when we are doing ibv_rc_pingpong between two nodes
> > connected with same switch we are getting result. But when we use two
> > nodes with 2 switches we are getting error.
> >
> > Success:-
> > [root at oss1 ~]# ibv_rc_pingpong
> >   local address:  LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID ::
> >   remote address: LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID ::
> > 8192000 bytes in 0.01 seconds = 6992.74 Mbit/sec
> > 1000 iters in 0.01 seconds = 9.37 usec/iter
> > [root at oss1 ~]#
> >
> > [root at mds1 ~]# ibv_rc_pingpong 173.16.1.52
> >   local address:  LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID ::
> >   remote address: LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID ::
> > 8192000 bytes in 0.01 seconds = 7084.97 Mbit/sec
> > 1000 iters in 0.01 seconds = 9.25 usec/iter
> > [root at mds1 ~]#
> >
> >
> >
> >
> > Error
> > [root at nalanda mvapich2-1.9]# ibv_rc_pingpong
> >   local address:  LID 0x0001, QPN 0x56004e, PSN 0x704d51
> >   remote address: LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2
> >
> > [root at mds1 ~]# ibv_rc_pingpong 173.16.1.1
> >   local address:  LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2, GID ::
> > client read: Success
> > Couldn't read remote address
> > [root at mds1 ~]#
>
> Looking at libibverbs/examples/rc_pingpong.c:
>
> static struct pingpong_dest *pp_client_exch_dest(const char *servername,
> int port,
>                                                  const struct
> pingpong_dest *my_dest)
> {
> ...
>         gid_to_wire_gid(&my_dest->gid, gid);
>         sprintf(msg, "%04x:%06x:%06x:%s", my_dest->lid, my_dest->qpn,
>                                                         my_dest->psn, gid);
>         if (write(sockfd, msg, sizeof msg) != sizeof msg) {
>                 fprintf(stderr, "Couldn't send local address\n");
>                 goto out;
>         }
>
>
>         if (read(sockfd, msg, sizeof msg) != sizeof msg) {
>                 perror("client read");
>                 fprintf(stderr, "Couldn't read remote address\n");
>                 goto out;
>         }
>
> This read is failing for some reason. This is some message exchange over
> some IP network (for example, IPoIB or ethernet).
>
> >
> > And how we test our ftree topology is working fine.
> >
> > Please go through the attachment.
>
> Looks like LIDs are assigned but can't tell about routing from info
> supplied but topology looks relatively simple (5 switches, homogenous 4x
> QDR links). Is the OpenSM log clean ? Any fat tree related messages. This
> is likely not SM issue.
>
> The next issues are end node related (probably with IPoIB configuration).
> Can you ping between the nodes which fail rc_pingpong ? If not,
>
> -- Hal
>
> >
> > Thank You
> > Atul Yadav
> >
> >
> > On Sat, Apr 12, 2014 at 12:14 AM, Hal Rosenstock <hal at dev.mellanox.co.il
> > <mailto:hal at dev.mellanox.co.il>> wrote:
> >
> >     On 4/11/2014 2:21 PM, Atul Yadav wrote:
> >     > Dear Team,
> >     >
> >     > We are trying to build Fat tree topology.
> >     > The details are given below:
> >     > Unmanaged switches 36 port  quantity 5
> >     > As per the some blog we need to modify the opensm.conf file
> >     > But we are unable to identify some parameter like:-
> >     >  root_guid_file    ???????
> >
> >     Fat tree routing will try to autodetect the roots but this may not
> >     work and it is better to specify the root GUIDs. In your case, they
> >     are the GUIDs for switches A and B.
> >
> >     The root GUID file is then provided to OpenSM either via the conf
> >     file or command line parameters. The command line parameter is [-a |
> >            --root_guid_file <path to file>]
> >
> >     OpenSM man page says:
> >
> >            -a, --root_guid_file <file name>
> >                   Set the root nodes for the Up/Down or Fat-Tree routing
> >     algorithm
> >                   to the guids provided in the given file (one to a
> line).
> >
> >     It also says:
> >
> >            If the root guid file  is  not  provided  (?-a?  or
> >      ?--root_guid_file?
> >            options),  the  topology has to be pure fat-tree that
> >     complies with the
> >            following rules:
> >              - Tree rank should be between two and eight (inclusively)
> >              - Switches of the same rank should have the same number
> >                of UP-going port groups*, unless they are root switches,
> >                in which case the shouldn?t have UP-going ports at all.
> >              - Switches of the same rank should have the same number
> >                of DOWN-going port groups, unless they are leaf switches.
> >              - Switches of the same rank should have the same number
> >                of ports in each UP-going port group.
> >              - Switches of the same rank should have the same number
> >                of ports in each DOWN-going port group.
> >              - All the CAs have to be at the same tree level (rank).
> >
> >            If the root guid file is provided, the topology doesn?t have
> >     to be pure
> >            fat-tree, and it should only comply with the following rules:
> >              - Tree rank should be between two and eight (inclusively)
> >              - All the Compute Nodes** have to be at the same tree level
> >     (rank).
> >                Note that non-compute node CAs are allowed here to be at
> >     different
> >                tree ranks.
> >
> >            *  ports that are connected to the same remote switch are
> >     referenced as
> >            port group.
> >
> >            **  list  of  compute  nodes  (CNs)  can  be  specified  by
> >     -u   or
> >            --cn_guid_file OpenSM options.
> >
> >     -- Hal
> >
> >     >
> >     > Need your input for this ?
> >     >
> >     >
> >     >
> >     >
> >     > Thank You
> >     > Atul Yadav
> >     >
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > ewg mailing list
> >     > ewg at lists.openfabrics.org <mailto:ewg at lists.openfabrics.org>
> >     > http://lists.openfabrics.org/mailman/listinfo/ewg
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20140412/f65818e5/attachment.html>
-------------- next part --------------
-------------------------------------------------
OpenSM 3.3.5
 Reading Cached Option File: /etc/rdma/opensm.conf
 Loading Cached Option:guid = 0x0002c9030042e421
 Loading Cached Option:sweep_interval = 120
 Loading Cached Option:routing_engine = ftree
 Loading Cached Option:use_ucast_cache = TRUE
 Loading Cached Option:root_guid_file = /etc/rdma/guid
Command Line Arguments:
 Daemon mode
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.5

Apr 12 20:44:15 982804 [67F8C700] 0x80 -> OpenSM 3.3.5
-------------------------------------------------
OpenSM 3.3.5
 Reading Cached Option File: /etc/rdma/opensm.conf
 Loading Cached Option:guid = 0x0002c9030042e421
 Loading Cached Option:sweep_interval = 120
 Loading Cached Option:routing_engine = ftree
 Loading Cached Option:use_ucast_cache = TRUE
 Loading Cached Option:root_guid_file = /etc/rdma/guid
Command Line Arguments:
 Daemon mode
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.5

Apr 12 20:44:15 982804 [67F8C700] 0x80 -> OpenSM 3.3.5
Entering DISCOVERING state

Apr 12 20:44:15 984514 [67F8C700] 0x02 -> osm_vendor_init: 1000 pending umads specified
Apr 12 20:44:15 984702 [67F8C700] 0x80 -> Entering DISCOVERING state
Entering MASTER state

Apr 12 20:44:15 984761 [67F8C700] 0x02 -> osm_vendor_bind: Binding to port 0x2c9030042e421
Apr 12 20:44:16 027506 [67F8C700] 0x02 -> osm_vendor_bind: Binding to port 0x2c9030042e421
Apr 12 20:44:16 027558 [67F8C700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0002c9030042e421
Apr 12 20:44:16 069014 [5CB78700] 0x80 -> Entering MASTER state
SUBNET UP

Apr 12 20:44:16 075363 [5CB78700] 0x02 -> fabric_dump_general_info: General fabric topology info
Apr 12 20:44:16 075368 [5CB78700] 0x02 -> fabric_dump_general_info: ============================
Apr 12 20:44:16 075371 [5CB78700] 0x02 -> fabric_dump_general_info:   - FatTree rank (roots to leaf switches): 2
Apr 12 20:44:16 075372 [5CB78700] 0x02 -> fabric_dump_general_info:   - FatTree max switch rank: 1
Apr 12 20:44:16 075374 [5CB78700] 0x02 -> fabric_dump_general_info:   - Fabric has 39 CAs, 39 CA ports (39 of them CNs), 5 switches
Apr 12 20:44:16 075376 [5CB78700] 0x02 -> fabric_dump_general_info:   - Fabric has 2 switches at rank 0 (roots)
Apr 12 20:44:16 075378 [5CB78700] 0x02 -> fabric_dump_general_info:   - Fabric has 3 switches at rank 1 (3 of them leafs)
Apr 12 20:44:16 075511 [5CB78700] 0x02 -> osm_ucast_mgr_process: ftree tables configured on all switches
Apr 12 20:44:16 098151 [5CB78700] 0x80 -> SUBNET UP
Apr 12 20:44:16 277047 [6077E700] 0x01 -> log_trap_info: Received Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) Producer:1 (Channel Adapter) from LID:1 TID:0x000000000000003c
Apr 12 20:44:16 277090 [6077E700] 0x02 -> trap_rcv_process_request: Trap 144 Node description update
Apr 12 20:44:16 277105 [6077E700] 0x02 -> log_notice: Reporting Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) from LID:1 GID:fe80::2:c903:42:e421
Apr 12 20:44:16 298708 [5CB78700] 0x02 -> osm_ucast_cache_process: Configuring switch tables using cached routing
Apr 12 20:44:16 299811 [5CB78700] 0x02 -> SUBNET UP
Apr 12 20:44:18 072914 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::ffff:ffff
Apr 12 20:44:18 073968 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:6d01
Apr 12 20:44:18 074165 [5F37C700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e421
Apr 12 20:44:18 074791 [64384700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::1
Apr 12 20:44:18 074855 [67589700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc21
Apr 12 20:44:18 074905 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc31
Apr 12 20:44:18 074940 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:498f
Apr 12 20:44:18 075018 [5DF7A700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:4983
Apr 12 20:44:18 075062 [62F82700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::fb
Apr 12 20:44:18 075126 [5DF7A700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff41:f801
Apr 12 20:44:18 075561 [5FD7D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:4993
Apr 12 20:44:18 075653 [62581700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1
Apr 12 20:44:18 076163 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:3a11
Apr 12 20:44:18 076192 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:499b
Apr 12 20:44:18 076299 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e301
Apr 12 20:44:18 076331 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::202
Apr 12 20:44:18 076354 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff41:f841
Apr 12 20:44:18 076419 [65786700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e321
Apr 12 20:44:18 076851 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e331
Apr 12 20:44:18 076932 [62F82700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc61
Apr 12 20:44:18 077120 [62581700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e441
Apr 12 20:44:18 077399 [66B88700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e511
Apr 12 20:44:18 077635 [64D85700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:6


-------------- next part --------------
A non-text attachment was scrubbed...
Name: ibdiagnet-report.tar.gz
Type: application/x-gzip
Size: 13608 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20140412/f65818e5/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: opensm.conf
Type: application/octet-stream
Size: 8436 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20140412/f65818e5/attachment.obj>


More information about the ewg mailing list