[ewg] OPENSM cONFIGURATION
Atul Yadav
atulyadavtech at gmail.com
Sat Apr 12 08:29:26 PDT 2014
Hi,
Yes, i am able to ping all the nodes connected with Infiniband switch
For more details please go through the attachment.
Thanks
Atul Yadav
On Sat, Apr 12, 2014 at 7:28 PM, Hal Rosenstock <hal at dev.mellanox.co.il>wrote:
> On 4/12/2014 6:59 AM, Atul Yadav wrote:
> > HI,
> >
> > Thanks for replying
> > In this artectuire, when we are doing ibv_rc_pingpong between two nodes
> > connected with same switch we are getting result. But when we use two
> > nodes with 2 switches we are getting error.
> >
> > Success:-
> > [root at oss1 ~]# ibv_rc_pingpong
> > local address: LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID ::
> > remote address: LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID ::
> > 8192000 bytes in 0.01 seconds = 6992.74 Mbit/sec
> > 1000 iters in 0.01 seconds = 9.37 usec/iter
> > [root at oss1 ~]#
> >
> > [root at mds1 ~]# ibv_rc_pingpong 173.16.1.52
> > local address: LID 0x0022, QPN 0x20004a, PSN 0x7c9dc2, GID ::
> > remote address: LID 0x001e, QPN 0x2c004a, PSN 0x554863, GID ::
> > 8192000 bytes in 0.01 seconds = 7084.97 Mbit/sec
> > 1000 iters in 0.01 seconds = 9.25 usec/iter
> > [root at mds1 ~]#
> >
> >
> >
> >
> > Error
> > [root at nalanda mvapich2-1.9]# ibv_rc_pingpong
> > local address: LID 0x0001, QPN 0x56004e, PSN 0x704d51
> > remote address: LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2
> >
> > [root at mds1 ~]# ibv_rc_pingpong 173.16.1.1
> > local address: LID 0x0022, QPN 0x1c004a, PSN 0x07a0b2, GID ::
> > client read: Success
> > Couldn't read remote address
> > [root at mds1 ~]#
>
> Looking at libibverbs/examples/rc_pingpong.c:
>
> static struct pingpong_dest *pp_client_exch_dest(const char *servername,
> int port,
> const struct
> pingpong_dest *my_dest)
> {
> ...
> gid_to_wire_gid(&my_dest->gid, gid);
> sprintf(msg, "%04x:%06x:%06x:%s", my_dest->lid, my_dest->qpn,
> my_dest->psn, gid);
> if (write(sockfd, msg, sizeof msg) != sizeof msg) {
> fprintf(stderr, "Couldn't send local address\n");
> goto out;
> }
>
>
> if (read(sockfd, msg, sizeof msg) != sizeof msg) {
> perror("client read");
> fprintf(stderr, "Couldn't read remote address\n");
> goto out;
> }
>
> This read is failing for some reason. This is some message exchange over
> some IP network (for example, IPoIB or ethernet).
>
> >
> > And how we test our ftree topology is working fine.
> >
> > Please go through the attachment.
>
> Looks like LIDs are assigned but can't tell about routing from info
> supplied but topology looks relatively simple (5 switches, homogenous 4x
> QDR links). Is the OpenSM log clean ? Any fat tree related messages. This
> is likely not SM issue.
>
> The next issues are end node related (probably with IPoIB configuration).
> Can you ping between the nodes which fail rc_pingpong ? If not,
>
> -- Hal
>
> >
> > Thank You
> > Atul Yadav
> >
> >
> > On Sat, Apr 12, 2014 at 12:14 AM, Hal Rosenstock <hal at dev.mellanox.co.il
> > <mailto:hal at dev.mellanox.co.il>> wrote:
> >
> > On 4/11/2014 2:21 PM, Atul Yadav wrote:
> > > Dear Team,
> > >
> > > We are trying to build Fat tree topology.
> > > The details are given below:
> > > Unmanaged switches 36 port quantity 5
> > > As per the some blog we need to modify the opensm.conf file
> > > But we are unable to identify some parameter like:-
> > > root_guid_file ???????
> >
> > Fat tree routing will try to autodetect the roots but this may not
> > work and it is better to specify the root GUIDs. In your case, they
> > are the GUIDs for switches A and B.
> >
> > The root GUID file is then provided to OpenSM either via the conf
> > file or command line parameters. The command line parameter is [-a |
> > --root_guid_file <path to file>]
> >
> > OpenSM man page says:
> >
> > -a, --root_guid_file <file name>
> > Set the root nodes for the Up/Down or Fat-Tree routing
> > algorithm
> > to the guids provided in the given file (one to a
> line).
> >
> > It also says:
> >
> > If the root guid file is not provided (?-a? or
> > ?--root_guid_file?
> > options), the topology has to be pure fat-tree that
> > complies with the
> > following rules:
> > - Tree rank should be between two and eight (inclusively)
> > - Switches of the same rank should have the same number
> > of UP-going port groups*, unless they are root switches,
> > in which case the shouldn?t have UP-going ports at all.
> > - Switches of the same rank should have the same number
> > of DOWN-going port groups, unless they are leaf switches.
> > - Switches of the same rank should have the same number
> > of ports in each UP-going port group.
> > - Switches of the same rank should have the same number
> > of ports in each DOWN-going port group.
> > - All the CAs have to be at the same tree level (rank).
> >
> > If the root guid file is provided, the topology doesn?t have
> > to be pure
> > fat-tree, and it should only comply with the following rules:
> > - Tree rank should be between two and eight (inclusively)
> > - All the Compute Nodes** have to be at the same tree level
> > (rank).
> > Note that non-compute node CAs are allowed here to be at
> > different
> > tree ranks.
> >
> > * ports that are connected to the same remote switch are
> > referenced as
> > port group.
> >
> > ** list of compute nodes (CNs) can be specified by
> > -u or
> > --cn_guid_file OpenSM options.
> >
> > -- Hal
> >
> > >
> > > Need your input for this ?
> > >
> > >
> > >
> > >
> > > Thank You
> > > Atul Yadav
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org <mailto:ewg at lists.openfabrics.org>
> > > http://lists.openfabrics.org/mailman/listinfo/ewg
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20140412/f65818e5/attachment.html>
-------------- next part --------------
-------------------------------------------------
OpenSM 3.3.5
Reading Cached Option File: /etc/rdma/opensm.conf
Loading Cached Option:guid = 0x0002c9030042e421
Loading Cached Option:sweep_interval = 120
Loading Cached Option:routing_engine = ftree
Loading Cached Option:use_ucast_cache = TRUE
Loading Cached Option:root_guid_file = /etc/rdma/guid
Command Line Arguments:
Daemon mode
Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.5
Apr 12 20:44:15 982804 [67F8C700] 0x80 -> OpenSM 3.3.5
-------------------------------------------------
OpenSM 3.3.5
Reading Cached Option File: /etc/rdma/opensm.conf
Loading Cached Option:guid = 0x0002c9030042e421
Loading Cached Option:sweep_interval = 120
Loading Cached Option:routing_engine = ftree
Loading Cached Option:use_ucast_cache = TRUE
Loading Cached Option:root_guid_file = /etc/rdma/guid
Command Line Arguments:
Daemon mode
Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.5
Apr 12 20:44:15 982804 [67F8C700] 0x80 -> OpenSM 3.3.5
Entering DISCOVERING state
Apr 12 20:44:15 984514 [67F8C700] 0x02 -> osm_vendor_init: 1000 pending umads specified
Apr 12 20:44:15 984702 [67F8C700] 0x80 -> Entering DISCOVERING state
Entering MASTER state
Apr 12 20:44:15 984761 [67F8C700] 0x02 -> osm_vendor_bind: Binding to port 0x2c9030042e421
Apr 12 20:44:16 027506 [67F8C700] 0x02 -> osm_vendor_bind: Binding to port 0x2c9030042e421
Apr 12 20:44:16 027558 [67F8C700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0002c9030042e421
Apr 12 20:44:16 069014 [5CB78700] 0x80 -> Entering MASTER state
SUBNET UP
Apr 12 20:44:16 075363 [5CB78700] 0x02 -> fabric_dump_general_info: General fabric topology info
Apr 12 20:44:16 075368 [5CB78700] 0x02 -> fabric_dump_general_info: ============================
Apr 12 20:44:16 075371 [5CB78700] 0x02 -> fabric_dump_general_info: - FatTree rank (roots to leaf switches): 2
Apr 12 20:44:16 075372 [5CB78700] 0x02 -> fabric_dump_general_info: - FatTree max switch rank: 1
Apr 12 20:44:16 075374 [5CB78700] 0x02 -> fabric_dump_general_info: - Fabric has 39 CAs, 39 CA ports (39 of them CNs), 5 switches
Apr 12 20:44:16 075376 [5CB78700] 0x02 -> fabric_dump_general_info: - Fabric has 2 switches at rank 0 (roots)
Apr 12 20:44:16 075378 [5CB78700] 0x02 -> fabric_dump_general_info: - Fabric has 3 switches at rank 1 (3 of them leafs)
Apr 12 20:44:16 075511 [5CB78700] 0x02 -> osm_ucast_mgr_process: ftree tables configured on all switches
Apr 12 20:44:16 098151 [5CB78700] 0x80 -> SUBNET UP
Apr 12 20:44:16 277047 [6077E700] 0x01 -> log_trap_info: Received Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) Producer:1 (Channel Adapter) from LID:1 TID:0x000000000000003c
Apr 12 20:44:16 277090 [6077E700] 0x02 -> trap_rcv_process_request: Trap 144 Node description update
Apr 12 20:44:16 277105 [6077E700] 0x02 -> log_notice: Reporting Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) from LID:1 GID:fe80::2:c903:42:e421
Apr 12 20:44:16 298708 [5CB78700] 0x02 -> osm_ucast_cache_process: Configuring switch tables using cached routing
Apr 12 20:44:16 299811 [5CB78700] 0x02 -> SUBNET UP
Apr 12 20:44:18 072914 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::ffff:ffff
Apr 12 20:44:18 073968 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:6d01
Apr 12 20:44:18 074165 [5F37C700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e421
Apr 12 20:44:18 074791 [64384700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::1
Apr 12 20:44:18 074855 [67589700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc21
Apr 12 20:44:18 074905 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc31
Apr 12 20:44:18 074940 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:498f
Apr 12 20:44:18 075018 [5DF7A700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:4983
Apr 12 20:44:18 075062 [62F82700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::fb
Apr 12 20:44:18 075126 [5DF7A700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff41:f801
Apr 12 20:44:18 075561 [5FD7D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:4993
Apr 12 20:44:18 075653 [62581700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1
Apr 12 20:44:18 076163 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:3a11
Apr 12 20:44:18 076192 [5E97B700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff52:499b
Apr 12 20:44:18 076299 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e301
Apr 12 20:44:18 076331 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::202
Apr 12 20:44:18 076354 [6117F700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff41:f841
Apr 12 20:44:18 076419 [65786700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e321
Apr 12 20:44:18 076851 [61B80700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e331
Apr 12 20:44:18 076932 [62F82700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff85:cc61
Apr 12 20:44:18 077120 [62581700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e441
Apr 12 20:44:18 077399 [66B88700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff42:e511
Apr 12 20:44:18 077635 [64D85700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ibdiagnet-report.tar.gz
Type: application/x-gzip
Size: 13608 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20140412/f65818e5/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: opensm.conf
Type: application/octet-stream
Size: 8436 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20140412/f65818e5/attachment.obj>
More information about the ewg
mailing list