[ewg] (no subject)

Hal Rosenstock hal.rosenstock at gmail.com
Sat Oct 17 04:34:55 PDT 2009


On Fri, Oct 16, 2009 at 2:20 PM, Mahmoud Hanafi <mhanafi at csc.com> wrote:
> We have a linux cluster running RH5.3 with ofed1.4 using Mellanox MT25418.
> The cluster is attached to a sun solaris10.7 thumper box. The thumper box
> export a zfs filesystem via NFS. linux clients mount the filesystem via
> IPoIB.
>
> Under filesystem I/O load the subnet manager gets repeated path record
> requests from the sun solaris box.

Do the path records all look the same or different in terms of
destinations (and sources) ?

Is the source GUID (0x0003ba000100d0a5) the Solaris thumper port GUID
(00-03-BA (hex) Sun Microsystems Inc.) ? The destination appears to be
some HP device (00-23-7D (hex) Hewlett Packard).

> This can bring the SM and the fabric down.

Are you referring to the load due to path requests or something else ?
Running the OpenSM are the logging level you appear to be using would
certainly slow things down greatly so I presume that was only done to
look further into what was going on.

> Any any one else had issue with solaris IB <-> Linux IB?

I haven't run Solaris <-> Linux IB in several years now but this used
to work but there have been a lot of changes.

> Any insight into what could be causing the issue?

Could you elaborate on the below ? I see one PathRecord response trace
and an ibdiagnet run which shows a bad link at direct route 1,11,23
from where that was run. You might want to debug the issue with that
link.

-- Hal

>
> Thanks,
> Mahmoud
>
> ----
>
> Oct 15 19:37:
>
> 59 952368 [41E02960] 0x08 -> PathRecord dump:
>
> service id ..............0x0000000000000000
>
> dgid .................... Oxfe80000000000000 : 0x00237dffff949819
>
> sgid .................... Oxfe80000000000000 : 0x0003ba000100d0a5
>
> dlid .................... 0
>
> slid .................... 0
>
> hop_flow_raw............ OxO
>
> tclass .................. OxO
>
> num_path_revers......... Ox81
>
> pkey .................... 0x0
>
> qos_class ............... OxO
>
> sl ......................OxO
>
> mtu .....................OxO
>
> rate .................... OxO
>
> pkt_life ................0x0
>
> preference .............. 0x0
>
> resv2 ................... OxO
>
> resv3 ................... OxO
>
> Oct 15 19:37:59 952376 [41E02960) 0x08 -> osm_pr_rcv_process: Unicast
>
> destination requested
>
> Oct 15 19:37:59 952382 [41E02960] 0x08 ->
>
> osm_pr_rcv_get_port pair_paths: Src port 0x0003ba000100d0a5, Dst port
>
> 0x00237dffff949819
>
> Oct 15 19:37:59 952388 [41E02960] 0x08 ->
>
> _osm_pr_rcv_get_port_pair_paths: Src LIDs [2 - 2], Dest LIDs [67-67]
>
> Oct 15 19:37:59 952393 [41E02960] 0x08 ->
>
> _osm pr_rcv_get_lid_pair_path: Src LID 2, Dest LID 67
>
> Oct 15 19:37:59 952399 [41E02960] 0x08 -> _osm_pr_rcv_get-path_parms:
>
> Path min MTU = 4, min rate = 6
>
> Oct 15 19:37:59 952408 [41E02960] 0x08 - > _osm_pr_rcv_get-path_parms:
>
> Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = OxFFFF, sl
>
> = 0
>
> Oct 15 19:37:59 952417 [41E02960] 0x08 - > _osm_pr_rcv_get_path_parms:
>
> Path min MTU = 4, min rate = 6
>
> Oct 15 19: 37:59 952423 [41E02960] 0x08 -> osm pr_rcv_get_path parms:
>
> Path params: mtu = 4, rate
>
> = 6, packet lifetime = 18, pkey = OxFFFF, sl
>
> = 0
>
> Oct 15 19:37:59 952428 [41E02960] 0x08
>
> -> osm_sa_respond: Returning 1
>
> records
>
> Oct 15 19:37:59 952433 [41E02960] 0x08 -
>
>> osm_vendor_get: Acquiring UMAD
>
> for p_madw = 0x2a9567f2c8, size = 120
>
> Oct 15 19:37:59 952439 [41E02960] 0x08 -> osm_vendor_get: Acquired UMAD
>
> 0x2a9567f390, size = 120
>
> Oct 15 19:37:59 952455 [41E02960] 0x08 -
>
>> osm_vendor_put: Retiring UMAD
>
> 0x2a9567f390
>
> Oct 15 19:37:59 952460 [41E02960] 0x08 ->
>
> •.osm_vendor_send: Completed
>
> sending response or unsolicited p_madw'"j= Ox2a9567f2b0
>
> Oct 15 19:37:59 952466 [41E02960] 0x08 -> osm
>
> _vendor_put: Retiring UMAD
>
> 0x724520
>
> ===============
>
> Loading IBDIAGNET from: /usr/1ib64
>
> / ibdiagnetl.2
>
> -W- Topology file is not specified.
>
> Reports regarding cluster links will use direct routes.
>
> Loading IBDM from: /usr/lib64 / ibdml.2
>
> - I- Using port 1 as the local port.
>
> - I- Discovering ... 103 nodes (7 Switches & 96 CA- s) discovered.
>
> -I ---------------------------------------------------
>
> - I- Bad Guids /LIDs Info
>
> -I -------------------------------------------------- -
>
> -I- No bad Guids were found
>
> -I -------------------------------------------------- -
>
> -I- Links With Logical State = INIT
>
> -I -------------------------------------------------- -
>
> -I- No bad Links (with logical state
>
> = INIT) were found
>
> -I ---------------------------------------------------
>
> -I- PM Counters Info
>
> -I -------------------------------------------------- -
>
> -I- No illegal PM counters values were found
>
> -I ---------------------------------------------------
>
> -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
>
> -I ---------------------------------------------------
>
> -I- PKey:Ox7fff Hosts:97 full:97 partial:0
>
> -I -------------------------------------------------- -
>
> -I- IPoIB Subnets Check
>
> -I ---------------------------------------------------
>
> -I- Subnet: IPv4 PKey:Ox7fff QKey:Ox00000blb MTU:2048Byte rate:lOGbps
>
> SL:OxOO
>
> -W- Suboptimal rate for group. Lowest member rate:20Gbps > grouprate:
>
> lOGbps
>
> -I ---------------------------------------------------
>
> -I- Bad Links Info
>
> -I- Errors have occurred on the following links
>
> (for errors details, look in log file
>
> / tmp/ibdiagnet.log):
>
> -I ----------------------------------------------------
>
> Link at the end of direct route "1,11,23"
>
> ----------------------------------------------------------------
>
> -I- Stages Status Report:
>
> STAGE
>
> Bad GUIDs
>
> /LIDS Check
>
> Link State Active Check
>
> Performance Counters Report
>
> Partitions Check
>
> IPoIB Subnets Check
>
> Link Errors Check
>
> Errors Warnings
>
> 0 0
>
> 0 0
>
> 0 0
>
> 0 0
>
> 0 1
>
> 0 0
>
> Please see
>
> /tmp/ibdiagnet.log for complete log
>
> - I- Done. Run time was 6 seconds.
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>



More information about the ewg mailing list