From vlad at lists.openfabrics.org  Sat Nov  1 03:06:32 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  1 Nov 2008 03:06:32 -0700 (PDT)
Subject: [ofa-general] ofa_1_4_kernel 20081101-0200 daily build status
Message-ID: <20081101100633.1A5A9E60D7F@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:
Build failed on x86_64 with linux-2.6.25
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc':
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap'
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.25_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.25'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.26
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc':
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap'
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.26_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.26'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.24
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: 'cpu_data' undeclared (first use in this function)
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: (Each undeclared identifier is reported only once
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: for each function it appears in.)
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.24_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.24'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-42.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-55.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.27
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc':
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap'
/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081101-0200_linux-2.6.27_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.27'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------


From rdreier at cisco.com  Sat Nov  1 11:15:10 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 01 Nov 2008 11:15:10 -0700
Subject: [ofa-general] [PATCH] RDMA/cxgb3: Fix too-big reserved field zeroing
	in iwch_post_zb_read()
Message-ID: <ada1vxvxqhd.fsf@cisco.com>

The array wqe->read.reserved has only two entries, but
iwch_post_zb_read() sets [0], [1], and [2], which is one too many.
This is harmless since it runs into the next field, rem_stag, which is
initialized correctly immediately after, but we might as well get
things right, especially since it makes the code smaller.

This was spotted by the Coverity checker (CID 2475).

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
I'll queue this up unless someone tells me I'm misreading things and
gooofed up here...

 drivers/infiniband/hw/cxgb3/iwch_qp.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 3e4585c..19661b2 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -745,7 +745,6 @@ int iwch_post_zb_read(struct iwch_qp *qhp)
 	wqe->read.rdmaop = T3_READ_REQ;
 	wqe->read.reserved[0] = 0;
 	wqe->read.reserved[1] = 0;
-	wqe->read.reserved[2] = 0;
 	wqe->read.rem_stag = cpu_to_be32(1);
 	wqe->read.rem_to = cpu_to_be64(1);
 	wqe->read.local_stag = cpu_to_be32(1);
-- 
1.6.0.2


From rdreier at cisco.com  Sat Nov  1 11:44:13 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sat, 01 Nov 2008 11:44:13 -0700
Subject: [ofa-general] Suspicious code in schedule_nes_timer()
Message-ID: <adawsfnwaki.fsf@cisco.com>

schedule_nes_timer() starts as follows.  Observe a couple of things:

	int schedule_nes_timer(struct nes_cm_node *cm_node, struct sk_buff *skb,
			enum nes_timer_type type, int send_retrans,
			int close_when_complete)
	{
		unsigned long  flags;
		struct nes_cm_core *cm_core = cm_node->cm_core;

>>> cm_node is directly dereferenced here...

		struct nes_timer_entry *new_send;
		int ret = 0;
		u32 was_timer_set;
	
		if (!cm_node)
			return -EINVAL;

>>> and then later tested for NULL...

so if cm_node is NULL, then the code will oops before it hits the return
-EINVAL.  It seems that callers must guarantee that cm_node isn't NULL,
so it would make sense to delete the "if (!cm_node)" test, right?

 - R.


From swise at opengridcomputing.com  Sat Nov  1 12:47:11 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 01 Nov 2008 14:47:11 -0500
Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: Fix too-big reserved field
	zeroing in iwch_post_zb_read()
In-Reply-To: <ada1vxvxqhd.fsf@cisco.com>
References: <ada1vxvxqhd.fsf@cisco.com>
Message-ID: <490CB23F.2070709@opengridcomputing.com>

Acked-by: Steve Wise <swise at opengridcomputing.com>


From sashak at voltaire.com  Sat Nov  1 12:58:07 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 1 Nov 2008 21:58:07 +0200
Subject: [ofa-general] [PATCH] Encode agent id in request transaction id.
Message-ID: <20081101195807.GA12081@sashak.voltaire.com>


For requests agent id will be encoded as bits 32-47 into MAD transaction
id (ibsim will use now higher bits (48-63) for client id encoding). So
response's agent id will be decoded from MAD and not resolved by
management class value. This is in order to simulate kernel's user_mad
layer behavior.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 ibsim/ibsim.c       |    2 +-
 ibsim/sim_mad.c     |    2 +-
 umad2sim/umad2sim.c |   13 ++++++++++++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c
index c050be1..149b6b9 100644
--- a/ibsim/ibsim.c
+++ b/ibsim/ibsim.c
@@ -668,7 +668,7 @@ int disconnect_client(int id)
 
 static Client *client_by_trid(Port *port, uint64_t trid)
 {
-	unsigned i = (unsigned)(trid >> 32);
+	unsigned i = (unsigned)(trid >> 48);
 	if (i < IBSIM_MAX_CLIENTS && clients[i].pid &&
 	    clients[i].port->portguid == port->portguid)
 		return &clients[i];
diff --git a/ibsim/sim_mad.c b/ibsim/sim_mad.c
index fbe81aa..c49f4cc 100644
--- a/ibsim/sim_mad.c
+++ b/ibsim/sim_mad.c
@@ -108,7 +108,7 @@ static uint64_t update_trid(uint8_t *mad, unsigned response, Client *cl)
 {
 	uint64_t trid = mad_get_field64(mad, 0, IB_MAD_TRID_F);
 	if (!response) {
-		trid = (trid&0xffffffffULL)|(((uint64_t)cl->id)<<32);
+		trid = (trid&0xffffffffffffULL)|(((uint64_t)cl->id)<<48);
 		mad_set_field64(mad, 0, IB_MAD_TRID_F, trid);
 	}
 	return trid;
diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c
index 2b37a8d..f896540 100644
--- a/umad2sim/umad2sim.c
+++ b/umad2sim/umad2sim.c
@@ -406,7 +406,12 @@ static ssize_t umad2sim_read(struct umad2sim_dev *dev, void *buf, size_t count)
 		mgmt_class = 0;
 	}
 
-	umad->agent_id = dev->agent_idx[mgmt_class];
+	if (mad_get_field(req.mad, 0, IB_MAD_RESPONSE_F)) {
+		uint64_t trid = mad_get_field64(req.mad, 0, IB_MAD_TRID_F);
+		umad->agent_id = (trid >> 32) & 0xffff;
+	} else
+		umad->agent_id = dev->agent_idx[mgmt_class];
+
 	umad->status = ntohl(req.status);
 	umad->timeout_ms = 0;
 	umad->retries = 0;
@@ -476,6 +481,12 @@ static ssize_t umad2sim_write(struct umad2sim_dev *dev,
 
 	req.length = htonll(cnt);
 
+	if (!mad_get_field(req.mad, 0, IB_MAD_RESPONSE_F)) {
+		uint64_t trid = mad_get_field64(req.mad, 0, IB_MAD_TRID_F);
+		trid = (trid&0xffff0000ffffffffULL)|(((uint64_t)umad->agent_id)<<32);
+		mad_set_field64(req.mad, 0, IB_MAD_TRID_F, trid);
+	}
+
 	cnt = write(dev->sim_client.fd_pktout, (void *)&req, sizeof(req));
 	if (cnt < 0) {
 		ERROR("umad2sim_write: cannot write\n");
-- 
1.6.0.3.517.g759a


From vlad at lists.openfabrics.org  Sun Nov  2 03:15:39 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  2 Nov 2008 03:15:39 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081102-0200 daily build status
Message-ID: <20081102111539.796CCE60E18@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:
Build failed on x86_64 with linux-2.6.26
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc':
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap'
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.26_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.26'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.25
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc':
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap'
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.25_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.25'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.24
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: 'cpu_data' undeclared (first use in this function)
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: (Each undeclared identifier is reported only once
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: for each function it appears in.)
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.24_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.24'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-55.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-42.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.27
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc':
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap'
/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast
make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1
make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath] Error 2
make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check/drivers/infiniband] Error 2
make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081102-0200_linux-2.6.27_x86_64_check] Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.27'
make: *** [kernel] Error 2
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-78.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------
Build failed on x86_64 with linux-2.6.9-67.ELsmp
Log:
patching file drivers/infiniband/hw/ipath/ipath_init_chip.c
Hunk #1 succeeded at 529 (offset 135 lines).
Hunk #2 succeeded at 537 (offset 135 lines).
Hunk #3 succeeded at 848 (offset 135 lines).
patching file drivers/infiniband/hw/ipath/ipath_sysfs.c
patching file drivers/infiniband/hw/ipath/ipath_user_pages.c
Patch ipath_0110_2.6.9.patch does not apply (enforce with -f)

Failed executing /usr/bin/quilt
----------------------------------------------------------------------------------


From jgarzik at pobox.com  Sun Nov  2 05:20:26 2008
From: jgarzik at pobox.com (Jeff Garzik)
Date: Sun, 02 Nov 2008 08:20:26 -0500
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
In-Reply-To: <aday7047jos.fsf@cisco.com>
References: <4907348E.7060508@mellanox.co.il> <490A8FA9.7080802@pobox.com>
	<aday7047jos.fsf@cisco.com>
Message-ID: <490DA91A.1030703@pobox.com>

Roland Dreier wrote:
>  > Roland, OK for me to put merge this via net-next (the standard avenue
>  > for drivers/net patches during -rc)?
> 
> Actually please let me review this and merge it through my tree, since
> it has a bigger impact on the IB side of mlx4.

It seems most appropriate to get an Acked-by from you, and merge through 
me tree, IMO.  While it clearly has IB impact, most of the changes are 
in the lower-level mlx4_en driver.

But if you feel strongly...

	Jeff


From ogerlitz at voltaire.com  Sun Nov  2 06:55:58 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 02 Nov 2008 16:55:58 +0200
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
In-Reply-To: <490DA91A.1030703@pobox.com>
References: <4907348E.7060508@mellanox.co.il>
	<490A8FA9.7080802@pobox.com>	<aday7047jos.fsf@cisco.com>
	<490DA91A.1030703@pobox.com>
Message-ID: <490DBF7E.70506@voltaire.com>

Jeff Garzik wrote:
> Roland Dreier wrote:
>> Actually please let me review this and merge it through my tree, 
>> since  it has a bigger impact on the IB side of mlx4.
> It seems most appropriate to get an Acked-by from you, and merge 
> through me tree, IMO.  While it clearly has IB impact, most of the 
> changes are in the lower-level mlx4_en driver.


Hi Jeff,

As of the importance and influence on this patch set on the IB stack, I 
believe the correct way to go would be to let Roland manage the review 
and integration as he has both (net, rdma and actually in the future 
also the storage stack would use this driver...) views in mind.

Or.


From monis at Voltaire.COM  Sun Nov  2 07:44:37 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Sun, 02 Nov 2008 17:44:37 +0200
Subject: [ofa-general] [PATCH] ipoib: Fix loss of connectivity after
	bonding failover on both sides
In-Reply-To: <490B448C.5080306@Voltaire.COM>
References: <490B448C.5080306@Voltaire.COM>
Message-ID: <490DCAE5.6010608@Voltaire.COM>

Yossi Etigin wrote:
> Fix bonding failover in the case poth peers have failover and gratuitous
> arp
> is lost.


The patch was tested and seems to fix the problem 

To reproduce with a simulation of a lost gratuitous
Host A pings constantly Host B. Both hosts with bonding interface (ib0 and ib1 as slaves)

Host B: ifconfig ib0 down
Host B: ifconfig ib1 down
Host A: ifconfig ib0 down
Host A: ifconfig ib1 down
Host B: ifconfig ib0 up
Host B: ifconfig ib1 up
Host A: ifconfig ib0 up
Host A: ifconfig ib1 up

Now, even when all interfaces are up and functioning, ping is not being replied.


From kliteyn at dev.mellanox.co.il  Sun Nov  2 07:54:47 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 02 Nov 2008 17:54:47 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_cache.c: fixing wrong memset
	size
Message-ID: <490DCD47.3000303@dev.mellanox.co.il>

Fixing wrong memset size in osm_ucast_cache.c

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_ucast_cache.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
index cfbc49a..9db8d59 100644
--- a/opensm/opensm/osm_ucast_cache.c
+++ b/opensm/opensm/osm_ucast_cache.c
@@ -118,7 +118,8 @@ static cache_switch_t *__cache_sw_new(uint16_t lid_ho)
 		return NULL;
 	}

-	memset(p_cache_sw->ports, 0, sizeof(*p_cache_sw->ports));
+	memset(p_cache_sw->ports, 0,
+	       sizeof(cache_port_t) * (CACHE_SW_PORTS + 1));
 	p_cache_sw->num_ports = CACHE_SW_PORTS + 1;

 	/* port[0] fields represent this switch details - lid and type */
-- 
1.5.1.4


From rdreier at cisco.com  Sun Nov  2 08:03:46 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 02 Nov 2008 08:03:46 -0800
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
In-Reply-To: <490DA91A.1030703@pobox.com> (Jeff Garzik's message of "Sun, 02
	Nov 2008 08:20:26 -0500")
References: <4907348E.7060508@mellanox.co.il> <490A8FA9.7080802@pobox.com>
	<aday7047jos.fsf@cisco.com> <490DA91A.1030703@pobox.com>
Message-ID: <adaprlew1wd.fsf@cisco.com>

 > It seems most appropriate to get an Acked-by from you, and merge
 > through me tree, IMO.  While it clearly has IB impact, most of the
 > changes are in the lower-level mlx4_en driver.

Actually most of the changes are in mlx4_core, which is the common HW
driver that both mlx4_en and mlx4_ib use, and which I've been
maintaining up till now:

 >  drivers/infiniband/hw/mlx4/cq.c   |    2 +-
 >  drivers/infiniband/hw/mlx4/main.c |    2 +-
 >  drivers/net/mlx4/cq.c             |   14 ++++++++--
 >  drivers/net/mlx4/en_cq.c          |    9 ++++--
 >  drivers/net/mlx4/en_main.c        |    4 +-
 >  drivers/net/mlx4/eq.c             |   47 ++++++++++++++++++++++++------------
 >  drivers/net/mlx4/main.c           |   14 ++++++----
 >  drivers/net/mlx4/mlx4.h           |    4 +-
 >  include/linux/mlx4/device.h       |    4 ++-

Not that it's a huge change anywhere, but only the mlx4_en changes are
in en_cq.c and en_main.c, ie 13 out of 100 changed lines.

In general I think I have a bigger chance of merging more mlx4_core
stuff through my tree, so it will probably be smoother in terms of
conflicts etc. if I carry this patch.

 - R.


From jgarzik at pobox.com  Sun Nov  2 08:17:00 2008
From: jgarzik at pobox.com (Jeff Garzik)
Date: Sun, 02 Nov 2008 11:17:00 -0500
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
In-Reply-To: <adaprlew1wd.fsf@cisco.com>
References: <4907348E.7060508@mellanox.co.il> <490A8FA9.7080802@pobox.com>
	<aday7047jos.fsf@cisco.com> <490DA91A.1030703@pobox.com>
	<adaprlew1wd.fsf@cisco.com>
Message-ID: <490DD27C.4070109@pobox.com>

Roland Dreier wrote:
> In general I think I have a bigger chance of merging more mlx4_core
> stuff through my tree, so it will probably be smoother in terms of
> conflicts etc. if I carry this patch.


Fine by me...

	Jeff


From sashak at voltaire.com  Sun Nov  2 10:16:51 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 2 Nov 2008 20:16:51 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_cache.c: fixing wrong
	memset size
In-Reply-To: <490DCD47.3000303@dev.mellanox.co.il>
References: <490DCD47.3000303@dev.mellanox.co.il>
Message-ID: <20081102181651.GP7502@sashak.voltaire.com>

Hi Yevgeny,

On 17:54 Sun 02 Nov     , Yevgeny Kliteynik wrote:
> Fixing wrong memset size in osm_ucast_cache.c
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_ucast_cache.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
> index cfbc49a..9db8d59 100644
> --- a/opensm/opensm/osm_ucast_cache.c
> +++ b/opensm/opensm/osm_ucast_cache.c
> @@ -118,7 +118,8 @@ static cache_switch_t *__cache_sw_new(uint16_t lid_ho)
>  		return NULL;
>  	}
> 
> -	memset(p_cache_sw->ports, 0, sizeof(*p_cache_sw->ports));
> +	memset(p_cache_sw->ports, 0,
> +	       sizeof(cache_port_t) * (CACHE_SW_PORTS + 1));
>  	p_cache_sw->num_ports = CACHE_SW_PORTS + 1;
> 
>  	/* port[0] fields represent this switch details - lid and type */

Then you obviously will need also to fix similar things (memset() and
memcpy() sizes) in __cache_add_port() function where ports array is
reallocated.

So why to not make it simpler, just in single alloc following *known*
switch's port numbers? Like below.

If it is fine for you I will push it out.

Sasha


>From c7e9e41cdea3164a07f9cbf47f68a8836f096524 Mon Sep 17 00:00:00 2001
From: Sasha Khapyorsky <sashak at voltaire.com>
Date: Sun, 2 Nov 2008 20:02:37 +0200
Subject: [PATCH] opensm/osm_ucase_cache: simplify cached links allocation code

Simplify cached links allocation code, fix related memset(), memcpy()
bugs.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_cache.c |  101 ++++++++++++---------------------------
 1 files changed, 31 insertions(+), 70 deletions(-)

diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
index cfbc49a..b142a14 100644
--- a/opensm/opensm/osm_ucast_cache.c
+++ b/opensm/opensm/osm_ucast_cache.c
@@ -70,11 +70,11 @@ typedef struct cache_switch {
 	cl_map_item_t map_item;
 	boolean_t dropped;
 	uint16_t max_lid_ho;
-	uint8_t num_ports;
-	cache_port_t *ports;
 	uint16_t num_hops;
 	uint8_t **hops;
 	uint8_t *lft;
+	uint8_t num_ports;
+	cache_port_t ports[0];
 } cache_switch_t;
 
 /**********************************************************************
@@ -104,22 +104,17 @@ static void __cache_sw_set_leaf(cache_switch_t * p_sw)
 /**********************************************************************
  **********************************************************************/
 
-static cache_switch_t *__cache_sw_new(uint16_t lid_ho)
+static cache_switch_t *__cache_sw_new(uint16_t lid_ho, unsigned num_ports)
 {
-	cache_switch_t *p_cache_sw = malloc(sizeof(cache_switch_t));
+	cache_switch_t *p_cache_sw = malloc(sizeof(cache_switch_t) +
+					    num_ports * sizeof(cache_port_t));
 	if (!p_cache_sw)
 		return NULL;
 
-	memset(p_cache_sw, 0, sizeof(*p_cache_sw));
+	memset(p_cache_sw, 0,
+	       sizeof(*p_cache_sw) + num_ports * sizeof(cache_port_t));
 
-	p_cache_sw->ports = malloc(sizeof(cache_port_t) * (CACHE_SW_PORTS + 1));
-	if (!p_cache_sw->ports) {
-		free(p_cache_sw);
-		return NULL;
-	}
-
-	memset(p_cache_sw->ports, 0, sizeof(*p_cache_sw->ports));
-	p_cache_sw->num_ports = CACHE_SW_PORTS + 1;
+	p_cache_sw->num_ports = num_ports;
 
 	/* port[0] fields represent this switch details - lid and type */
 	p_cache_sw->ports[0].remote_lid_ho = lid_ho;
@@ -161,79 +156,48 @@ static cache_switch_t *__cache_get_sw(osm_ucast_mgr_t * p_mgr, uint16_t lid_ho)
 
 /**********************************************************************
  **********************************************************************/
-
-static cache_switch_t *__cache_get_or_add_sw(osm_ucast_mgr_t * p_mgr,
-					     uint16_t lid_ho)
-{
-	cache_switch_t *p_cache_sw = __cache_get_sw(p_mgr, lid_ho);
-	if (!p_cache_sw) {
-		p_cache_sw = __cache_sw_new(lid_ho);
-		if (p_cache_sw)
-			cl_qmap_insert(&p_mgr->cache_sw_tbl, lid_ho,
-				       &p_cache_sw->map_item);
-	}
-	return p_cache_sw;
-}
-
-/**********************************************************************
- **********************************************************************/
-
-static void __cache_add_port(osm_ucast_mgr_t * p_mgr, uint16_t lid_ho,
-			     uint8_t port_num, uint16_t remote_lid_ho,
-			     boolean_t is_ca)
+static void __cache_add_sw_link(osm_ucast_mgr_t * p_mgr, osm_physp_t *p,
+				uint16_t remote_lid_ho, boolean_t is_ca)
 {
 	cache_switch_t *p_cache_sw;
+	uint16_t lid_ho = cl_ntoh16(osm_node_get_base_lid(p->p_node, 0));
 
 	OSM_LOG_ENTER(p_mgr->p_log);
 
-	if (!lid_ho || !remote_lid_ho || !port_num)
+	if (!lid_ho || !remote_lid_ho || !p->port_num)
 		goto Exit;
 
 	OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
 		"Caching switch port: lid %u [port %u] -> lid %u (%s)\n",
-		lid_ho, port_num, remote_lid_ho, (is_ca) ? "CA/RTR" : "SW");
+		lid_ho, p->port_num, remote_lid_ho, (is_ca) ? "CA/RTR" : "SW");
 
-	p_cache_sw = __cache_get_or_add_sw(p_mgr, lid_ho);
+	p_cache_sw = __cache_get_sw(p_mgr, lid_ho);
 	if (!p_cache_sw) {
-		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
-			"ERR AD01: Out of memory - cache is invalid\n");
-		osm_ucast_cache_invalidate(p_mgr);
-		goto Exit;
-	}
-
-	if (port_num >= p_cache_sw->num_ports) {
-		/* calculate new size of ports array, rounded
-		   up to a multiple of CACHE_SW_PORTS */
-		uint8_t new_size = CACHE_SW_PORTS *
-		    ((port_num + CACHE_SW_PORTS) / CACHE_SW_PORTS);
-		cache_port_t *ports =
-		    malloc(sizeof(cache_port_t) * (new_size + 1));
-		if (!ports) {
+		p_cache_sw = __cache_sw_new(lid_ho, p->p_node->sw->num_ports);
+		if (!p_cache_sw) {
 			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
-				"ERR AD02: Out of memory - cache is invalid\n");
+				"ERR AD01: Out of memory - cache is invalid\n");
 			osm_ucast_cache_invalidate(p_mgr);
 			goto Exit;
 		}
+		cl_qmap_insert(&p_mgr->cache_sw_tbl, lid_ho,
+			       &p_cache_sw->map_item);
+	}
 
-		memset(ports, 0, sizeof(*ports));
-
-		if (p_cache_sw->ports) {
-			memcpy(ports, p_cache_sw->ports,
-			       sizeof(*p_cache_sw->ports));
-			free(p_cache_sw->ports);
-		}
-
-		p_cache_sw->ports = ports;
-		p_cache_sw->num_ports = new_size + 1;
+	if (p->port_num >= p_cache_sw->num_ports) {
+		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
+			"ERR AD02: Wrong switch? - cache is invalid\n");
+		osm_ucast_cache_invalidate(p_mgr);
+		goto Exit;
 	}
 
 	if (is_ca)
 		__cache_sw_set_leaf(p_cache_sw);
 
-	if (p_cache_sw->ports[port_num].remote_lid_ho == 0) {
+	if (p_cache_sw->ports[p->port_num].remote_lid_ho == 0) {
 		/* cache this link only if it hasn't been already cached */
-		p_cache_sw->ports[port_num].remote_lid_ho = remote_lid_ho;
-		p_cache_sw->ports[port_num].is_leaf = is_ca;
+		p_cache_sw->ports[p->port_num].remote_lid_ho = remote_lid_ho;
+		p_cache_sw->ports[p->port_num].is_leaf = is_ca;
 	}
 Exit:
 	OSM_LOG_EXIT(p_mgr->p_log);
@@ -962,16 +926,13 @@ void osm_ucast_cache_add_link(osm_ucast_mgr_t * p_mgr,
 		lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2, 0));
 
 		/* lost switch-2-switch link - cache both sides */
-		__cache_add_port(p_mgr, lid_ho_1, p_physp1->port_num,
-				 lid_ho_2, FALSE);
-		__cache_add_port(p_mgr, lid_ho_2, p_physp2->port_num,
-				 lid_ho_1, FALSE);
+		__cache_add_sw_link(p_mgr, p_physp1, lid_ho_2, FALSE);
+		__cache_add_sw_link(p_mgr, p_physp2, lid_ho_1, FALSE);
 	} else {
 		lid_ho_2 = cl_ntoh16(osm_physp_get_base_lid(p_physp2));
 
 		/* lost link to CA/RTR - cache only switch side */
-		__cache_add_port(p_mgr, lid_ho_1, p_physp1->port_num,
-				 lid_ho_2, TRUE);
+		__cache_add_sw_link(p_mgr, p_physp1, lid_ho_2, TRUE);
 	}
 
 Exit:
-- 
1.6.0.3.517.g759a


From kliteyn at dev.mellanox.co.il  Sun Nov  2 12:59:12 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 02 Nov 2008 22:59:12 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_cache.c: fixing wrong
	memset size
In-Reply-To: <20081102181651.GP7502@sashak.voltaire.com>
References: <490DCD47.3000303@dev.mellanox.co.il>
	<20081102181651.GP7502@sashak.voltaire.com>
Message-ID: <490E14A0.8080105@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 17:54 Sun 02 Nov     , Yevgeny Kliteynik wrote:
>> Fixing wrong memset size in osm_ucast_cache.c
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/opensm/osm_ucast_cache.c |    3 ++-
>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
>> index cfbc49a..9db8d59 100644
>> --- a/opensm/opensm/osm_ucast_cache.c
>> +++ b/opensm/opensm/osm_ucast_cache.c
>> @@ -118,7 +118,8 @@ static cache_switch_t *__cache_sw_new(uint16_t lid_ho)
>>  		return NULL;
>>  	}
>>
>> -	memset(p_cache_sw->ports, 0, sizeof(*p_cache_sw->ports));
>> +	memset(p_cache_sw->ports, 0,
>> +	       sizeof(cache_port_t) * (CACHE_SW_PORTS + 1));
>>  	p_cache_sw->num_ports = CACHE_SW_PORTS + 1;
>>
>>  	/* port[0] fields represent this switch details - lid and type */
> 
> Then you obviously will need also to fix similar things (memset() and
> memcpy() sizes) in __cache_add_port() function where ports array is
> reallocated.
> 
> So why to not make it simpler, just in single alloc following *known*
> switch's port numbers? Like below.
> 
> If it is fine for you I will push it out.

Sure, this one is better.
Please apply.

-- Yevgeny

> Sasha
> 
> 
>>From c7e9e41cdea3164a07f9cbf47f68a8836f096524 Mon Sep 17 00:00:00 2001
> From: Sasha Khapyorsky <sashak at voltaire.com>
> Date: Sun, 2 Nov 2008 20:02:37 +0200
> Subject: [PATCH] opensm/osm_ucase_cache: simplify cached links allocation code
> 
> Simplify cached links allocation code, fix related memset(), memcpy()
> bugs.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_ucast_cache.c |  101 ++++++++++++---------------------------
>  1 files changed, 31 insertions(+), 70 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
> index cfbc49a..b142a14 100644
> --- a/opensm/opensm/osm_ucast_cache.c
> +++ b/opensm/opensm/osm_ucast_cache.c
> @@ -70,11 +70,11 @@ typedef struct cache_switch {
>  	cl_map_item_t map_item;
>  	boolean_t dropped;
>  	uint16_t max_lid_ho;
> -	uint8_t num_ports;
> -	cache_port_t *ports;
>  	uint16_t num_hops;
>  	uint8_t **hops;
>  	uint8_t *lft;
> +	uint8_t num_ports;
> +	cache_port_t ports[0];
>  } cache_switch_t;
>  
>  /**********************************************************************
> @@ -104,22 +104,17 @@ static void __cache_sw_set_leaf(cache_switch_t * p_sw)
>  /**********************************************************************
>   **********************************************************************/
>  
> -static cache_switch_t *__cache_sw_new(uint16_t lid_ho)
> +static cache_switch_t *__cache_sw_new(uint16_t lid_ho, unsigned num_ports)
>  {
> -	cache_switch_t *p_cache_sw = malloc(sizeof(cache_switch_t));
> +	cache_switch_t *p_cache_sw = malloc(sizeof(cache_switch_t) +
> +					    num_ports * sizeof(cache_port_t));
>  	if (!p_cache_sw)
>  		return NULL;
>  
> -	memset(p_cache_sw, 0, sizeof(*p_cache_sw));
> +	memset(p_cache_sw, 0,
> +	       sizeof(*p_cache_sw) + num_ports * sizeof(cache_port_t));
>  
> -	p_cache_sw->ports = malloc(sizeof(cache_port_t) * (CACHE_SW_PORTS + 1));
> -	if (!p_cache_sw->ports) {
> -		free(p_cache_sw);
> -		return NULL;
> -	}
> -
> -	memset(p_cache_sw->ports, 0, sizeof(*p_cache_sw->ports));
> -	p_cache_sw->num_ports = CACHE_SW_PORTS + 1;
> +	p_cache_sw->num_ports = num_ports;
>  
>  	/* port[0] fields represent this switch details - lid and type */
>  	p_cache_sw->ports[0].remote_lid_ho = lid_ho;
> @@ -161,79 +156,48 @@ static cache_switch_t *__cache_get_sw(osm_ucast_mgr_t * p_mgr, uint16_t lid_ho)
>  
>  /**********************************************************************
>   **********************************************************************/
> -
> -static cache_switch_t *__cache_get_or_add_sw(osm_ucast_mgr_t * p_mgr,
> -					     uint16_t lid_ho)
> -{
> -	cache_switch_t *p_cache_sw = __cache_get_sw(p_mgr, lid_ho);
> -	if (!p_cache_sw) {
> -		p_cache_sw = __cache_sw_new(lid_ho);
> -		if (p_cache_sw)
> -			cl_qmap_insert(&p_mgr->cache_sw_tbl, lid_ho,
> -				       &p_cache_sw->map_item);
> -	}
> -	return p_cache_sw;
> -}
> -
> -/**********************************************************************
> - **********************************************************************/
> -
> -static void __cache_add_port(osm_ucast_mgr_t * p_mgr, uint16_t lid_ho,
> -			     uint8_t port_num, uint16_t remote_lid_ho,
> -			     boolean_t is_ca)
> +static void __cache_add_sw_link(osm_ucast_mgr_t * p_mgr, osm_physp_t *p,
> +				uint16_t remote_lid_ho, boolean_t is_ca)
>  {
>  	cache_switch_t *p_cache_sw;
> +	uint16_t lid_ho = cl_ntoh16(osm_node_get_base_lid(p->p_node, 0));
>  
>  	OSM_LOG_ENTER(p_mgr->p_log);
>  
> -	if (!lid_ho || !remote_lid_ho || !port_num)
> +	if (!lid_ho || !remote_lid_ho || !p->port_num)
>  		goto Exit;
>  
>  	OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
>  		"Caching switch port: lid %u [port %u] -> lid %u (%s)\n",
> -		lid_ho, port_num, remote_lid_ho, (is_ca) ? "CA/RTR" : "SW");
> +		lid_ho, p->port_num, remote_lid_ho, (is_ca) ? "CA/RTR" : "SW");
>  
> -	p_cache_sw = __cache_get_or_add_sw(p_mgr, lid_ho);
> +	p_cache_sw = __cache_get_sw(p_mgr, lid_ho);
>  	if (!p_cache_sw) {
> -		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
> -			"ERR AD01: Out of memory - cache is invalid\n");
> -		osm_ucast_cache_invalidate(p_mgr);
> -		goto Exit;
> -	}
> -
> -	if (port_num >= p_cache_sw->num_ports) {
> -		/* calculate new size of ports array, rounded
> -		   up to a multiple of CACHE_SW_PORTS */
> -		uint8_t new_size = CACHE_SW_PORTS *
> -		    ((port_num + CACHE_SW_PORTS) / CACHE_SW_PORTS);
> -		cache_port_t *ports =
> -		    malloc(sizeof(cache_port_t) * (new_size + 1));
> -		if (!ports) {
> +		p_cache_sw = __cache_sw_new(lid_ho, p->p_node->sw->num_ports);
> +		if (!p_cache_sw) {
>  			OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
> -				"ERR AD02: Out of memory - cache is invalid\n");
> +				"ERR AD01: Out of memory - cache is invalid\n");
>  			osm_ucast_cache_invalidate(p_mgr);
>  			goto Exit;
>  		}
> +		cl_qmap_insert(&p_mgr->cache_sw_tbl, lid_ho,
> +			       &p_cache_sw->map_item);
> +	}
>  
> -		memset(ports, 0, sizeof(*ports));
> -
> -		if (p_cache_sw->ports) {
> -			memcpy(ports, p_cache_sw->ports,
> -			       sizeof(*p_cache_sw->ports));
> -			free(p_cache_sw->ports);
> -		}
> -
> -		p_cache_sw->ports = ports;
> -		p_cache_sw->num_ports = new_size + 1;
> +	if (p->port_num >= p_cache_sw->num_ports) {
> +		OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
> +			"ERR AD02: Wrong switch? - cache is invalid\n");
> +		osm_ucast_cache_invalidate(p_mgr);
> +		goto Exit;
>  	}
>  
>  	if (is_ca)
>  		__cache_sw_set_leaf(p_cache_sw);
>  
> -	if (p_cache_sw->ports[port_num].remote_lid_ho == 0) {
> +	if (p_cache_sw->ports[p->port_num].remote_lid_ho == 0) {
>  		/* cache this link only if it hasn't been already cached */
> -		p_cache_sw->ports[port_num].remote_lid_ho = remote_lid_ho;
> -		p_cache_sw->ports[port_num].is_leaf = is_ca;
> +		p_cache_sw->ports[p->port_num].remote_lid_ho = remote_lid_ho;
> +		p_cache_sw->ports[p->port_num].is_leaf = is_ca;
>  	}
>  Exit:
>  	OSM_LOG_EXIT(p_mgr->p_log);
> @@ -962,16 +926,13 @@ void osm_ucast_cache_add_link(osm_ucast_mgr_t * p_mgr,
>  		lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2, 0));
>  
>  		/* lost switch-2-switch link - cache both sides */
> -		__cache_add_port(p_mgr, lid_ho_1, p_physp1->port_num,
> -				 lid_ho_2, FALSE);
> -		__cache_add_port(p_mgr, lid_ho_2, p_physp2->port_num,
> -				 lid_ho_1, FALSE);
> +		__cache_add_sw_link(p_mgr, p_physp1, lid_ho_2, FALSE);
> +		__cache_add_sw_link(p_mgr, p_physp2, lid_ho_1, FALSE);
>  	} else {
>  		lid_ho_2 = cl_ntoh16(osm_physp_get_base_lid(p_physp2));
>  
>  		/* lost link to CA/RTR - cache only switch side */
> -		__cache_add_port(p_mgr, lid_ho_1, p_physp1->port_num,
> -				 lid_ho_2, TRUE);
> +		__cache_add_sw_link(p_mgr, p_physp1, lid_ho_2, TRUE);
>  	}
>  
>  Exit:


From rdreier at cisco.com  Sun Nov  2 21:34:24 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 02 Nov 2008 21:34:24 -0800
Subject: [ofa-general] Re: [PATCH 07/10] rdma/nes: reindent mis-indented
	spinlocks
In-Reply-To: <Pine.LNX.4.64.0810301336130.7072@wrl-59.cs.helsinki.fi> ("Ilpo
	=?utf-8?Q?J=C3=A4rvinen=22's?= message of "Thu,
	30 Oct 2008 13:39:43 +0200 (EET)")
References: <Pine.LNX.4.64.0810301307160.7072@wrl-59.cs.helsinki.fi>
	<Pine.LNX.4.64.0810301336130.7072@wrl-59.cs.helsinki.fi>
Message-ID: <adak5blwexr.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Sun Nov  2 21:41:17 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 02 Nov 2008 21:41:17 -0800
Subject: [ofa-general] Re: [PATCH v3] RDMA/nes: Mitigate compatibility issue
	regarding PCI write credits
In-Reply-To: <20081031183943.GA7376@ctung-MOBL> (Chien Tung's message of "Fri, 
	31 Oct 2008 13:39:43 -0500")
References: <20081031183943.GA7376@ctung-MOBL>
Message-ID: <adafxm9wema.fsf@cisco.com>

thanks, applied all three.


From rdreier at cisco.com  Sun Nov  2 21:47:49 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 02 Nov 2008 21:47:49 -0800
Subject: [ofa-general] Re: [PATCH] ipoib: fix hang in ipoib_flush_paths
In-Reply-To: <490B0040.3040802@Voltaire.COM> (Yossi Etigin's message of "Fri, 
	31 Oct 2008 14:55:28 +0200")
References: <490B0040.3040802@Voltaire.COM>
Message-ID: <adabpwxwebe.fsf@cisco.com>

thanks, applied.


From jackm at dev.mellanox.co.il  Mon Nov  3 01:39:27 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 3 Nov 2008 11:39:27 +0200
Subject: [ofa-general] poll CQ failed -2 with connectX
In-Reply-To: <200810271838.48510.ricklist@microway.com>
References: <200810271838.48510.ricklist@microway.com>
Message-ID: <200811031139.28122.jackm@dev.mellanox.co.il>

Rick,

Your problem was that you had a SUSE-packaged ofed-driver set 
(named ofed-kmp-default) installed on all your
machines (maybe automatically part of the OpenSuse install?):

For example, on one of your hosts, I ran
#> rpm -qi ofed-kmp-default
Name        : ofed-kmp-default             Relocations: (not relocatable)
Version     : 1.2.5_2.6.22.18_0.2               Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
Release     : 18.1                          Build Date: Mon Jun  9 12:42:40 2008
Install Date: Wed Jul 30 18:26:56 2008      Build Host: kalman.suse.de
Group       : System/Base                   Source RPM: ofed-1.2.5-18.1.src.rpm
Size        : 3359904                          License: GPL v2 or later
Signature   : DSA/SHA1, Mon Jun  9 12:47:02 2008, Key ID a84edae89c800aca
Packager    : http://bugs.opensuse.org
URL         : http://www.openfabrics.org
Summary     : Infiniband Kernel Modules

The SUSE-rpm driver set is based on OFED 1.2.5.
This RPM installs the OFED drivers under directory /lib/modules/<kernel version/updates/drivers.

When you then installed the OFED 1.3.1 and OFED 1.4 drivers, these new drivers were installed under
/lib/modules/<kernel version/updates/kernel/drivers, but the SUSE drivers were not uninstalled.

Both sets were present on the hosts.

When you started up the infiniband driver (/etc/init.d/openibd start), the older OFED 1.2.5 driver
was loaded into the kernel.

However, the userspace drivers used were indeed from OFED 1.3.1 and/or OFED 1.4, resulting in a mismatch
between kernel-space and userspace.

Specifically, ConnectX cards support XRC (Extended RC) in OFED 1.3.1 and OFED 1.4 (XRC was not present
in OFED 1.2.5).  The 1.3.1 / 1.4 userspace libraries identified some of the QPs created by the OFED 1.2.5
kernel modules as XRC QPs and returned an error as a result (correctly indicating that these "XRC" qp's
did not exist as XRC qp's).

In any event, uninstalling the SUSE RPMs fixed the problem.

Finally, the OFED installation script now checks for the SUSE-packaged drivers as well, so that if they are
present, they will be uninstalled when installing the OFED-packaged drivers. (this fix will be in
OFED 1.4-rc4, to be released this week).

- Jack

On Tuesday 28 October 2008 00:38, Rick Warner wrote:
> Hi all,
> 
> I am configuring an opteron cluster with connectX Infiniband.  I have a 
> problem that if I run one of the NAS tests, it works the first, and maybe 2nd 
> time, but after that the jobs instantly fail with messages like this-
> 
> [Rank 44][cm.c: line 860]poll CQ failed -2
> [Rank 51][cm.c: line 860]poll CQ failed -2
> [Rank 119][cm.c: line 860]poll CQ failed -2
> [Rank 85][cm.c: line 860]poll CQ failed -2
> [Rank 0][cm.c: line 860]poll CQ failed -2
> [Rank 9][cm.c: line 860]poll CQ failed -2
> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
> poll CQ failed -2
> [Rank 94][cm.c: line 860]poll CQ failed -2
> [Rank 111][cm.c: line 860]poll CQ failed -2
> 
> I can easily reproduce this with only 2 systems using a 16 process LU job, 
> class B.
> 
> Here are the configs I've tried-
> Suse 11 with distro provided IB driver and libraries,etc, using mvapich as 
> provided by ohio state
> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3
> 
> They all have the same basic problem.  I think one of them reported "Error 
> polling CQ" instead of "poll CQ failed".
> 
> If I replace the connectX cards with regular DDR cards the problem goes away.
> 
> I'm getting quite stumped at this point and would appreciate any suggestions 
> or patches.
> 
> Thanks,
> Rick


From dimitar.dimitrov at markit.com  Mon Nov  3 02:54:40 2008
From: dimitar.dimitrov at markit.com (Dimitar Dimitrov)
Date: Mon, 03 Nov 2008 11:54:40 +0100
Subject: [ofa-general] Mellanox OFED package vs. RedHat IB packages
Message-ID: <1225709680.15893.9.camel@wks-ubuntu.ops.marketxs.com>

Hello,

As a beginner with InfiniBand technology I would like to ask the
following:

Would there be any advantage in using the Mellanox provided OFED
packages over the already supplied ones from the RedHat repository? We
are using "Mellanox Technologies MT25208 InfiniHost III Ex" adapters. I
noticed a slight difference in the name of some of the applications
provided, but would expect the functionality to be the same. So which
one "is better"?

-- 
Regards,
Dimitar


The content of this e-mail is confidential and may be privileged. It may be read, copied and used only by the intended recipient and may not be disclosed, copied or distributed. If you received this email in error, please contact the sender immediately by return e-mail or by telephoning +44 20 7260 2000, delete it and do not disclose its contents to any person. You should take full responsibility for checking this email for viruses. Markit reserves the right to monitor all e-mail communications through its network.
Markit and its affiliated companies make no warranty as to the accuracy or completeness of any information contained in this message and hereby exclude any liability of any kind for the information contained herein. Any opinions expressed in this message are those of the author and do not necessarily reflect the opinions of Markit.
For full details about Markit, its offerings and legal terms and conditions, please see Markit’s website at http://www.markit.com <http://www.markit.com/> .


From vlad at lists.openfabrics.org  Mon Nov  3 03:22:40 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon,  3 Nov 2008 03:22:40 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081103-0200 daily build status
Message-ID: <20081103112240.D0743E60E77@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From constantine.gavrilov at gmail.com  Mon Nov  3 03:09:17 2008
From: constantine.gavrilov at gmail.com (Constantine Gavrilov)
Date: Mon, 03 Nov 2008 13:09:17 +0200
Subject: [ofa-general] patch: support long (above 14 bytes) HW addresses in
	arp_ioctl
Message-ID: <490EDBDD.1030104@gmail.com>

While working with OFED infiniband stack that uses 20 byte long HW 
addresses for IP over IB, I have paid attention to the following  
arp_ioctl problem.

The ioctl uses a data structure that limits a length of HW address to 14 
bytes. The IP stack and the arp cache code do not have that limitation. 
This leads to the following problems:

* arp_ioctl cannot be used to set, get, or delete arp entries for those 
adapters that have HW addresses longer than 14 bytes
* arp_ioctl will corrupt the kernel and user memory when this ioctl is 
used on the adapters that have HW addresses longer that 14 bytes.  This 
is because when copying the HW address, the arp_ioctl code copies 
dev->addr_len bytes without checking that addr_len is not above 14 
bytes. This is done both for copy_to_user() and memcpy() calls on kernel 
data structures allocated on stack. The memcpy() call in particular, 
will corrupt kernel stack.

Attached please find the patch that fixes both problems. In addition, 
the patch changes the maximal number of bytes for HW address that will 
be seen in /proc/net/arp from ~10 to ~30. Without the last change, 
output of /proc/net/arp truncates the the large MAC entries, which makes 
the arp utility useless.

The patch does not change the existing ABI but extends it.  The kernel 
structure used in arp_ioctl calls is changed to support larger 
addresses, while the user-space structure is extended by appending 
extra-space to the end of the structure if ATF_NEWARPCTL -- a new flag  
-- is set in arp_flags of existing user-space structure. This allows 
avoiding big changes to the existing code while preserving the ABI 
compatibility.

-- 
----------------------------------------
Constantine Gavrilov
Kernel Developer
Platform Group
XIV, an IBM global brand 
1 Azrieli Center, Tel-Aviv
Phone: +972-3-6074672
Fax:   +972-3-6959749
----------------------------------------


-------------- next part --------------
A non-text attachment was scrubbed...
Name: arp_ioctl.patch
Type: text/x-patch
Size: 5244 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/85c796c9/attachment.bin>

From vlad at dev.mellanox.co.il  Mon Nov  3 06:56:21 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Mon, 03 Nov 2008 16:56:21 +0200
Subject: [ofa-general] [ewg] OFED meeting agenda for today (Nov 3)
Message-ID: <490F1115.9040705@dev.mellanox.co.il>

Agenda for OFED meeting today on OFED 1.4 status:

1. OFED 1.4 status:
	- Updated MPI packages: mvapich-1.1.0-3103.src.rpm, mvapich2-trunk-3103.src.rpm

- Close RC4 date (originally planned to Nov 4)

2. Bugs review:
Id	Sev		Pri	OS	Assignee			Status		Summary
1221	major		P2	SLES 10	Jeffrey.C.Becker at nasa.gov	NEW		SLES10 sp2: remote logins via ssh fail due to rpcbind and automounter failures
1298	major		P3	RHEL 5	Jeffrey.C.Becker at nasa.gov	NEW		nfsrdma rh5.1 causes kernel panic
1299	major		P3	RHEL 5	Jeffrey.C.Becker at nasa.gov	NEW		nfs module is missing symbols in rh5.1
1283	blocker	P1	RHEL 5	jeremy.brown at qlogic.com			NEW		Intel MPI fails on Qlogc HCA
1326	blocker	P1	RHEL 4	jeremy.brown at qlogic.com			NEW		ipath driver fails to build on IA64 in the 10/28/08 daily build
1335	major		P3	Other	monis at voltaire.com		NEW		Bonding: packet lost during failover
1301	major		P3	RHEL 4	olgas at voltaire.com		NEW		Can not load rds module on RH4 up7
1323	blocker	P1	All	stefan.roscher at de.ibm.com		REOPENED	IB/ehca: possibillity of kernel panic under certain circumstances
1242	critical	P2	RHEL 4	yannick.cote at qlogic.com		NEW		kernel panic while running mpi2007 against ofed1.4 -- ib_ipath: ipath_sdma_verbs_send
1336	critical	P1	RHEL 5	bugzilla at openib.org		NEW		Can't to unloading the mlx4_ib module on ppc64

Regards,
Vladimir


From kliteyn at dev.mellanox.co.il  Mon Nov  3 07:05:56 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 03 Nov 2008 17:05:56 +0200
Subject: [ofa-general] [PATCH] opensm/osm_ucast_cache: fixing coredump
Message-ID: <490F1354.7060305@dev.mellanox.co.il>

Following the recent changes in ports allocation - fixing core dump.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_ucast_cache.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
index b142a14..13dee11 100644
--- a/opensm/opensm/osm_ucast_cache.c
+++ b/opensm/opensm/osm_ucast_cache.c
@@ -135,8 +135,6 @@ static void __cache_sw_destroy(cache_switch_t * p_sw)
 		free(p_sw->lft);
 	if (p_sw->hops)
 		free(p_sw->hops);
-	if (p_sw->ports)
-		free(p_sw->ports);
 	free(p_sw);
 }

-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Mon Nov  3 07:07:27 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 03 Nov 2008 17:07:27 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sa.c: adding missing include
Message-ID: <490F13AF.3040303@dev.mellanox.co.il>

Hi Sasha,

Adding missing include to fix compilation warning.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_sa.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c
index 6c02d5d..185557f 100644
--- a/opensm/opensm/osm_sa.c
+++ b/opensm/opensm/osm_sa.c
@@ -48,6 +48,7 @@
 #include <string.h>
 #include <ctype.h>
 #include <errno.h>
+#include <stdlib.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <complib/cl_qmap.h>
-- 
1.5.1.4


From sashak at voltaire.com  Mon Nov  3 07:24:28 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 3 Nov 2008 17:24:28 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_cache: fixing coredump
In-Reply-To: <490F1354.7060305@dev.mellanox.co.il>
References: <490F1354.7060305@dev.mellanox.co.il>
Message-ID: <20081103152428.GG31856@sashak.voltaire.com>

On 17:05 Mon 03 Nov     , Yevgeny Kliteynik wrote:
> Following the recent changes in ports allocation - fixing core dump.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_ucast_cache.c |    2 --
>  1 files changed, 0 insertions(+), 2 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
> index b142a14..13dee11 100644
> --- a/opensm/opensm/osm_ucast_cache.c
> +++ b/opensm/opensm/osm_ucast_cache.c
> @@ -135,8 +135,6 @@ static void __cache_sw_destroy(cache_switch_t * p_sw)
>  		free(p_sw->lft);
>  	if (p_sw->hops)
>  		free(p_sw->hops);
> -	if (p_sw->ports)
> -		free(p_sw->ports);

Sure. Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov  3 07:24:51 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 3 Nov 2008 17:24:51 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_sa.c: adding missing include
In-Reply-To: <490F13AF.3040303@dev.mellanox.co.il>
References: <490F13AF.3040303@dev.mellanox.co.il>
Message-ID: <20081103152451.GH31856@sashak.voltaire.com>

On 17:07 Mon 03 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Adding missing include to fix compilation warning.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From dimitar.dimitrov at markit.com  Mon Nov  3 08:12:54 2008
From: dimitar.dimitrov at markit.com (Dimitar Dimitrov)
Date: Mon, 03 Nov 2008 17:12:54 +0100
Subject: [ofa-general] Error during compile of ofed-1.3.1
Message-ID: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>

Installing OFED 1.3.1 on RHEL AS 4 (update 7), kernel 2.6.9-78.0.1.ELsmp
it ends up with the following error:

make[1]: Entering directory
`/usr/src/kernels/2.6.9-78.0.1.EL-smp-x86_64'
mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/.tmp_versions
make -f scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1
make -f scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband
make -f scripts/Makefile.build
obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core
  gcc
-Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/.addr.o.d -nostdinc -iwithprefix include -D__KERNEL__ -include include/linux/autoconf.h  -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/include/linux/autoconf.h     -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/include  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/debug  -I/usr/local/include/scst  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/ulp/srpt  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/net/cxgb3  -Iinclude     -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -fomit-frame-pointer -g -Wdeclaration-after-statement  -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks	 -Wno-sign-compare    -DMODULE -DKBUILD_BASENAME=addr -DKBUILD_MODNAME=ib_addr -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
In file included
from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/addr.c:32:
include/linux/inetdevice.h:50: field `mr_gq_timer' has incomplete type
include/linux/inetdevice.h:51: field `mr_ifc_timer' has incomplete type
include/linux/inetdevice.h:56: confused by earlier errors, bailing out
make[4]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/addr.o] Error 1
make[3]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core]
Error 2
make[2]: ***
[/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband] Error 2
make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1] Error
2
make[1]: Leaving directory `/usr/src/kernels/2.6.9-78.0.1.EL-smp-x86_64'
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.56254 (%build)


RPM build errors:
    user vlad does not exist - using root
    group vlad does not exist - using root
    user vlad does not exist - using root
    group vlad does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.56254 (%build)

Is there something that can be done about it?

-- 
Regards,
Dimitar


The content of this e-mail is confidential and may be privileged. It may be read, copied and used only by the intended recipient and may not be disclosed, copied or distributed. If you received this email in error, please contact the sender immediately by return e-mail or by telephoning +44 20 7260 2000, delete it and do not disclose its contents to any person. You should take full responsibility for checking this email for viruses. Markit reserves the right to monitor all e-mail communications through its network.
Markit and its affiliated companies make no warranty as to the accuracy or completeness of any information contained in this message and hereby exclude any liability of any kind for the information contained herein. Any opinions expressed in this message are those of the author and do not necessarily reflect the opinions of Markit.
For full details about Markit, its offerings and legal terms and conditions, please see Markit’s website at http://www.markit.com <http://www.markit.com/> .
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/ab049fed/attachment.html>

From constantine.gavrilov at gmail.com  Mon Nov  3 08:34:36 2008
From: constantine.gavrilov at gmail.com (Constantine Gavrilov)
Date: Mon, 03 Nov 2008 18:34:36 +0200
Subject: [ofa-general] Re: patch: support long (above 14 bytes) HW addresses
	in arp_ioctl
In-Reply-To: <490EDBDD.1030104@gmail.com>
References: <490EDBDD.1030104@gmail.com>
Message-ID: <490F281C.60800@gmail.com>

Updated version of the patch uses MAX_ADDR_LEN from netdevice.h as the 
maximal length of MAC address.

Constantine Gavrilov wrote:
> While working with OFED infiniband stack that uses 20 byte long HW 
> addresses for IP over IB, I have paid attention to the following  
> arp_ioctl problem.
>
> The ioctl uses a data structure that limits a length of HW address to 
> 14 bytes. The IP stack and the arp cache code do not have that 
> limitation. This leads to the following problems:
>
> * arp_ioctl cannot be used to set, get, or delete arp entries for 
> those adapters that have HW addresses longer than 14 bytes
> * arp_ioctl will corrupt the kernel and user memory when this ioctl is 
> used on the adapters that have HW addresses longer that 14 bytes.  
> This is because when copying the HW address, the arp_ioctl code copies 
> dev->addr_len bytes without checking that addr_len is not above 14 
> bytes. This is done both for copy_to_user() and memcpy() calls on 
> kernel data structures allocated on stack. The memcpy() call in 
> particular, will corrupt kernel stack.
>
> Attached please find the patch that fixes both problems. In addition, 
> the patch changes the maximal number of bytes for HW address that will 
> be seen in /proc/net/arp from ~10 to ~30. Without the last change, 
> output of /proc/net/arp truncates the the large MAC entries, which 
> makes the arp utility useless.
>
> The patch does not change the existing ABI but extends it.  The kernel 
> structure used in arp_ioctl calls is changed to support larger 
> addresses, while the user-space structure is extended by appending 
> extra-space to the end of the structure if ATF_NEWARPCTL -- a new 
> flag  -- is set in arp_flags of existing user-space structure. This 
> allows avoiding big changes to the existing code while preserving the 
> ABI compatibility.
>

-- 
----------------------------------------
Constantine Gavrilov
Kernel Developer
Platform Group
XIV, an IBM global brand 
1 Azrieli Center, Tel-Aviv
Phone: +972-3-6074672
Fax:   +972-3-6959749
----------------------------------------


-------------- next part --------------
A non-text attachment was scrubbed...
Name: arp_ioctl.patch
Type: text/x-patch
Size: 5246 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/da500721/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5355 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/da500721/attachment-0001.bin>

From constantine.gavrilov at gmail.com  Mon Nov  3 08:53:06 2008
From: constantine.gavrilov at gmail.com (Constantine Gavrilov)
Date: Mon, 03 Nov 2008 18:53:06 +0200
Subject: [ofa-general] Re: patch: support long (above 14 bytes) HW addresses
	in arp_ioctl
In-Reply-To: <490F281C.60800@gmail.com>
References: <490EDBDD.1030104@gmail.com> <490F281C.60800@gmail.com>
Message-ID: <490F2C72.3000008@gmail.com>


Try to resend as with a "web-friendly version". Hopefully, this can be 
read in the mail archive.

Constantine Gavrilov wrote:
> Updated version of the patch uses MAX_ADDR_LEN from netdevice.h as the 
> maximal length of MAC address.
>
> Constantine Gavrilov wrote:
>> While working with OFED infiniband stack that uses 20 byte long HW 
>> addresses for IP over IB, I have paid attention to the following  
>> arp_ioctl problem.
>>
>> The ioctl uses a data structure that limits a length of HW address to 
>> 14 bytes. The IP stack and the arp cache code do not have that 
>> limitation. This leads to the following problems:
>>
>> * arp_ioctl cannot be used to set, get, or delete arp entries for 
>> those adapters that have HW addresses longer than 14 bytes
>> * arp_ioctl will corrupt the kernel and user memory when this ioctl 
>> is used on the adapters that have HW addresses longer that 14 bytes.  
>> This is because when copying the HW address, the arp_ioctl code 
>> copies dev->addr_len bytes without checking that addr_len is not 
>> above 14 bytes. This is done both for copy_to_user() and memcpy() 
>> calls on kernel data structures allocated on stack. The memcpy() call 
>> in particular, will corrupt kernel stack.
>>
>> Attached please find the patch that fixes both problems. In addition, 
>> the patch changes the maximal number of bytes for HW address that 
>> will be seen in /proc/net/arp from ~10 to ~30. Without the last 
>> change, output of /proc/net/arp truncates the the large MAC entries, 
>> which makes the arp utility useless.
>>
>> The patch does not change the existing ABI but extends it.  The 
>> kernel structure used in arp_ioctl calls is changed to support larger 
>> addresses, while the user-space structure is extended by appending 
>> extra-space to the end of the structure if ATF_NEWARPCTL -- a new 
>> flag  -- is set in arp_flags of existing user-space structure. This 
>> allows avoiding big changes to the existing code while preserving the 
>> ABI compatibility.
>>
>

-- 
----------------------------------------
Constantine Gavrilov
Kernel Developer
Platform Group
XIV, an IBM global brand 
1 Azrieli Center, Tel-Aviv
Phone: +972-3-6074672
Fax:   +972-3-6959749
----------------------------------------


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: arp_ioctl.patch.txt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/4cb7951e/attachment.txt>

From rdreier at cisco.com  Mon Nov  3 09:39:30 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Nov 2008 09:39:30 -0800
Subject: [ofa-general] Re: patch: support long (above 14 bytes) HW addresses
	in arp_ioctl
In-Reply-To: <490EDBDD.1030104@gmail.com> (Constantine Gavrilov's message of
	"Mon, 03 Nov 2008 13:09:17 +0200")
References: <490EDBDD.1030104@gmail.com>
Message-ID: <ada63n4wvxp.fsf@cisco.com>

 > * arp_ioctl will corrupt the kernel and user memory when this ioctl is
 > used on the adapters that have HW addresses longer that 14 bytes.
 > This is because when copying the HW address, the arp_ioctl code copies
 > dev->addr_len bytes without checking that addr_len is not above 14
 > bytes. This is done both for copy_to_user() and memcpy() calls on
 > kernel data structures allocated on stack. The memcpy() call in
 > particular, will corrupt kernel stack.

It's not obvious to me after a quick glance where this kernel memory
corruption occurs, but clearly we should at least fix this bug.

 > The patch does not change the existing ABI but extends it.  The kernel
 > structure used in arp_ioctl calls is changed to support larger
 > addresses, while the user-space structure is extended by appending
 > extra-space to the end of the structure if ATF_NEWARPCTL -- a new flag
 > -- is set in arp_flags of existing user-space structure. This allows
 > avoiding big changes to the existing code while preserving the ABI
 > compatibility.

However, given that applications need to be changed to use this,
wouldn't it make more sense just to change those applications to use
rtnetlink, which already supports large hardware addresses?  ie is there
much point to extending a legacy ABI to add a feature that the preferred
modern interface already has?

 - R.


From john.russo at qlogic.com  Mon Nov  3 09:53:57 2008
From: john.russo at qlogic.com (John Russo)
Date: Mon, 3 Nov 2008 11:53:57 -0600
Subject: [ofa-general] BOF Slides for WinOF (Resend to correct list) 
Message-ID: <99863D2ED484D449811D97A4C44C9CBD96905A@EPEXCH2.qlogic.org>

Here are some slides to use for WinOF in the BOF presentation.

 
__________________________
John F. Russo
Manager, Engineering
QLogic Corporation
780 Fifth Avenue, Suite 140
King of Prussia, PA 19406
Direct: 610-233-4866
Main: 610-233-4800
Fax: 610-233-4777
Cell: 610-246-9903
Email: John.Russo at qlogic.com <mailto:John.Russo at qlogic.com> 
www.qlogic.com <http://www.qlogic.com> 

 
True success is the undeniable truth that we have proved ourselves.

-Joe Luppino-Esposito

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/1d734718/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3677 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/1d734718/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenFabrics BOF for WinOF.ppt
Type: application/vnd.ms-powerpoint
Size: 281600 bytes
Desc: OpenFabrics BOF for WinOF.ppt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/1d734718/attachment.ppt>

From dimitar.dimitrov at markit.com  Mon Nov  3 10:01:50 2008
From: dimitar.dimitrov at markit.com (Dimitar Dimitrov)
Date: Mon, 03 Nov 2008 19:01:50 +0100
Subject: [ofa-general] Re: Error during compile of ofed-1.3.1
In-Reply-To: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>
References: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>
Message-ID: <1225735310.15893.32.camel@wks-ubuntu.ops.marketxs.com>

After some trial and error installed OFED-1.4-20081103-0630 which worked
fine. However still not clear what was wrong with the official 1.3.1
release?

Dimitar

On Mon, 2008-11-03 at 17:12 +0100, Dimitar Dimitrov wrote:
> Installing OFED 1.3.1 on RHEL AS 4 (update 7), kernel
> 2.6.9-78.0.1.ELsmp it ends up with the following error:
> 
> make[1]: Entering directory
> `/usr/src/kernels/2.6.9-78.0.1.EL-smp-x86_64'
> mkdir -p /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/.tmp_versions
> make -f scripts/Makefile.build
> obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1
> make -f scripts/Makefile.build
> obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband
> make -f scripts/Makefile.build
> obj=/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core
>   gcc
> -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/.addr.o.d -nostdinc -iwithprefix include -D__KERNEL__ -include include/linux/autoconf.h  -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/include/linux/autoconf.h     -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/include  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/debug  -I/usr/local/include/scst  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/ulp/srpt  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/net/cxgb3  -Iinclude     -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -fomit-frame-pointer -g -Wdeclaration-after-statement  -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare    -DMODULE -DKBUILD_BASENAME=addr -DKBUILD_MODNAME=ib_addr -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
> In file included
> from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/addr.c:32:
> include/linux/inetdevice.h:50: field `mr_gq_timer' has incomplete type
> include/linux/inetdevice.h:51: field `mr_ifc_timer' has incomplete
> type
> include/linux/inetdevice.h:56: confused by earlier errors, bailing out
> make[4]: ***
> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core/addr.o] Error 1
> make[3]: ***
> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband/core]
> Error 2
> make[2]: ***
> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1/drivers/infiniband] Error
> 2
> make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3.1]
> Error 2
> make[1]: Leaving directory
> `/usr/src/kernels/2.6.9-78.0.1.EL-smp-x86_64'
> make: *** [kernel] Error 2
> error: Bad exit status from /var/tmp/rpm-tmp.56254 (%build)
> 
> 
> RPM build errors:
>     user vlad does not exist - using root
>     group vlad does not exist - using root
>     user vlad does not exist - using root
>     group vlad does not exist - using root
>     Bad exit status from /var/tmp/rpm-tmp.56254 (%build)
> 
> Is there something that can be done about it?
> 
> -- 
> Regards,
> Dimitar
-- 
Regards,
Dimitar


The content of this e-mail is confidential and may be privileged. It may be read, copied and used only by the intended recipient and may not be disclosed, copied or distributed. If you received this email in error, please contact the sender immediately by return e-mail or by telephoning +44 20 7260 2000, delete it and do not disclose its contents to any person. You should take full responsibility for checking this email for viruses. Markit reserves the right to monitor all e-mail communications through its network.
Markit and its affiliated companies make no warranty as to the accuracy or completeness of any information contained in this message and hereby exclude any liability of any kind for the information contained herein. Any opinions expressed in this message are those of the author and do not necessarily reflect the opinions of Markit.
For full details about Markit, its offerings and legal terms and conditions, please see Markit’s website at http://www.markit.com <http://www.markit.com/> .


From vlad at dev.mellanox.co.il  Mon Nov  3 10:19:43 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Mon, 03 Nov 2008 20:19:43 +0200
Subject: [ofa-general] Re: Error during compile of ofed-1.3.1
In-Reply-To: <1225735310.15893.32.camel@wks-ubuntu.ops.marketxs.com>
References: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>
	<1225735310.15893.32.camel@wks-ubuntu.ops.marketxs.com>
Message-ID: <490F40BF.1090605@dev.mellanox.co.il>

Dimitar Dimitrov wrote:
> After some trial and error installed OFED-1.4-20081103-0630 which worked
> fine. However still not clear what was wrong with the official 1.3.1
> release?
>
> Dimitar
>
>   
Hi Dimitar,
OFED-1.3.1 does not support  RedHat EL4 up7.
See, OFED-1.3.1/docs/OFED_release_notes.txt for the list of supported 
platforms.

Note, OFED-1.4-20081103-0630 have some IPoIB issues.
Please use OFED-1.4-20081102-0630, or wait about 30 min for the new 
daily build.

Regards,
Vladimir


From constantine.gavrilov at gmail.com  Mon Nov  3 10:56:44 2008
From: constantine.gavrilov at gmail.com (Constantine Gavrilov)
Date: Mon, 03 Nov 2008 20:56:44 +0200
Subject: [ofa-general] Re: patch: support long (above 14 bytes) HW addresses
	in arp_ioctl
In-Reply-To: <ada63n4wvxp.fsf@cisco.com>
References: <490EDBDD.1030104@gmail.com> <ada63n4wvxp.fsf@cisco.com>
Message-ID: <490F496C.2010608@gmail.com>


In arp_req_get() in net/arp.c, there is code:

memcpy(r->arp_ha.sa_data, neigh->ha, dev->addr_len);

dev->addr_len can be larger than size of r->arp_ha.sa_data. Inititally, 
I thought it would corrupt kernel stack. I was wrong, since r still has 
enough space not to overflow even for the largest HW address (32 bytes). 
It would corrupt the data structure though, and that corrupted reply 
would be propagated to user.

There is a similar situation in arp_req_set(), where a "junk" arp entry 
will be set if dev->addr_len is larger that 14 bytes. 

At the very minimum, both arp_req_set() and arp_req_get() should return 
error (-EINVAL), and not return junk or set junk. Truncated 
/proc/net/arp output should also be fixed.

I was not aware that rtnetlink is capable of doing things like arp  
table or interface manipulation (like netdevice ioctls). My applications 
needs to be able to manipulate arp cache for large macs, and I do not 
mind recompiling by adding a flag. I do not mind fixing arp cli to use 
this either (venerable arp  does use arp_ioctl). And there are many many 
legacy solutions that use arp_ioctl() in programs and arp utility in 
scripts. Consider porting those to infiniband.

Will rtnetlink work for any net_device (like netdevice ioctls do) for 
ARP and interface configurations calls or does it require special 
support in net_device itself? Any possible problems with rtnetlink?

Roland Dreier wrote:
>  > * arp_ioctl will corrupt the kernel and user memory when this ioctl is
>  > used on the adapters that have HW addresses longer that 14 bytes.
>  > This is because when copying the HW address, the arp_ioctl code copies
>  > dev->addr_len bytes without checking that addr_len is not above 14
>  > bytes. This is done both for copy_to_user() and memcpy() calls on
>  > kernel data structures allocated on stack. The memcpy() call in
>  > particular, will corrupt kernel stack.
>
> It's not obvious to me after a quick glance where this kernel memory
> corruption occurs, but clearly we should at least fix this bug.
>
>  > The patch does not change the existing ABI but extends it.  The kernel
>  > structure used in arp_ioctl calls is changed to support larger
>  > addresses, while the user-space structure is extended by appending
>  > extra-space to the end of the structure if ATF_NEWARPCTL -- a new flag
>  > -- is set in arp_flags of existing user-space structure. This allows
>  > avoiding big changes to the existing code while preserving the ABI
>  > compatibility.
>
> However, given that applications need to be changed to use this,
> wouldn't it make more sense just to change those applications to use
> rtnetlink, which already supports large hardware addresses?  ie is there
> much point to extending a legacy ABI to add a feature that the preferred
> modern interface already has?
>
>  - R.
>   

-- 
----------------------------------------
Constantine Gavrilov
Kernel Developer
Platform Group
XIV, an IBM global brand 
1 Azrieli Center, Tel-Aviv
Phone: +972-3-6074672
Fax:   +972-3-6959749
----------------------------------------


From chien.tin.tung at intel.com  Mon Nov  3 12:05:24 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Mon, 3 Nov 2008 14:05:24 -0600
Subject: [ofa-general] [PATCH] RDMA/nes: Initialize limit_maxrdreqsz to 0
Message-ID: <20081103200524.GA7140@ctung-MOBL>

From: Chien Tung <chien.tin.tung at intel.com>

RDMA/nes: Initialize limit_maxrdreqsz to 0

Initialize limit_maxrdreqsz to 0 so the workaround is off by default.

Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
Left out initialization from previous patch (commit 
633693660045b3e46a63ed618eb38a54339fbcc0).  Don't know how easy
it would be to fix the previous patch.

 drivers/infiniband/hw/nes/nes.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index aa1dc41..b60572e 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -95,7 +95,7 @@ unsigned int wqm_quanta = 0x10000;
 module_param(wqm_quanta, int, 0644);
 MODULE_PARM_DESC(wqm_quanta, "WQM quanta");
 
-static unsigned int limit_maxrdreqsz;
+static unsigned int limit_maxrdreqsz = 0;
 module_param(limit_maxrdreqsz, bool, 0644);
 MODULE_PARM_DESC(limit_maxrdreqsz, "Limit max read request size to 256 Bytes");
 

From chien.tin.tung at intel.com  Mon Nov  3 12:05:27 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Mon, 3 Nov 2008 14:05:27 -0600
Subject: [ofa-general] [PATCH] RDMA/nes: Check cm_node before using it
Message-ID: <20081103200527.GA7408@ctung-MOBL>

From: Chien Tung <chien.tin.tung at intel.com>

RDMA/nes: Check cm_node before using it

Moved cm_core assignment after cm_node check.

Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
 drivers/infiniband/hw/nes/nes_cm.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 2caf9da..31341fa 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -376,13 +376,16 @@ int schedule_nes_timer(struct nes_cm_node *cm_node, struct sk_buff *skb,
 		int close_when_complete)
 {
 	unsigned long  flags;
-	struct nes_cm_core *cm_core = cm_node->cm_core;
+	struct nes_cm_core *cm_core;
 	struct nes_timer_entry *new_send;
 	int ret = 0;
 	u32 was_timer_set;
 
 	if (!cm_node)
 		return -EINVAL;
+
+	cm_core = cm_node->cm_core;
+
 	new_send = kzalloc(sizeof(*new_send), GFP_ATOMIC);
 	if (!new_send)
 		return -1;


From rdreier at cisco.com  Mon Nov  3 15:47:07 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Nov 2008 15:47:07 -0800
Subject: [ofa-general] poll CQ failed -2 with connectX
In-Reply-To: <200811031139.28122.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Mon, 3 Nov 2008 11:39:27 +0200")
References: <200810271838.48510.ricklist@microway.com>
	<200811031139.28122.jackm@dev.mellanox.co.il>
Message-ID: <adaskq8v0ck.fsf@cisco.com>

 > However, the userspace drivers used were indeed from OFED 1.3.1
 > and/or OFED 1.4, resulting in a mismatch between kernel-space and
 > userspace.
 > 
 > Specifically, ConnectX cards support XRC (Extended RC) in OFED 1.3.1
 > and OFED 1.4 (XRC was not present in OFED 1.2.5).  The 1.3.1 / 1.4
 > userspace libraries identified some of the QPs created by the OFED
 > 1.2.5 kernel modules as XRC QPs and returned an error as a result
 > (correctly indicating that these "XRC" qp's did not exist as XRC
 > qp's).

I think we need newer userspace to continue to work with old kernels;
it's a huge pain if someone needs to roll back userspace just to test
an older kernel (eg if bisecting a regression or something like that).

The simplest thing would be for libmlx4 to check if the kernel driver
reports the XRC capability, say when creating the first QP for a given
process, and treat the QPN bits appropriately depending on whether the
kernel supports XRC or not.

 - R.


From rdreier at cisco.com  Mon Nov  3 15:53:26 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Nov 2008 15:53:26 -0800
Subject: [ofa-general] Re: patch: support long (above 14 bytes) HW addresses
	in arp_ioctl
In-Reply-To: <490F496C.2010608@gmail.com> (Constantine Gavrilov's message of
	"Mon, 03 Nov 2008 20:56:44 +0200")
References: <490EDBDD.1030104@gmail.com> <ada63n4wvxp.fsf@cisco.com>
	<490F496C.2010608@gmail.com>
Message-ID: <adaod0wv021.fsf@cisco.com>

[netdev added to cc list]

 > In arp_req_get() in net/arp.c, there is code:
 > 
 > memcpy(r->arp_ha.sa_data, neigh->ha, dev->addr_len);
 > 
 > dev->addr_len can be larger than size of
 > r->arp_ha.sa_data. Inititally, I thought it would corrupt kernel
 > stack. I was wrong, since r still has enough space not to overflow
 > even for the largest HW address (32 bytes). It would corrupt the data
 > structure though, and that corrupted reply would be propagated to
 > user.
 > 
 > There is a similar situation in arp_req_set(), where a "junk" arp
 > entry will be set if dev->addr_len is larger that 14 bytes. 
 > 
 > At the very minimum, both arp_req_set() and arp_req_get() should
 > return error (-EINVAL), and not return junk or set junk. Truncated
 > /proc/net/arp output should also be fixed.

The EINVAL return makes sense; I'm not sure /proc/net/arp is important
enough to fix.  I guess it depends on the impact of the fix.

 > I was not aware that rtnetlink is capable of doing things like arp
 > table or interface manipulation (like netdevice ioctls). My
 > applications needs to be able to manipulate arp cache for large macs,
 > and I do not mind recompiling by adding a flag. I do not mind fixing
 > arp cli to use this either (venerable arp  does use arp_ioctl). And
 > there are many many legacy solutions that use arp_ioctl() in programs
 > and arp utility in scripts. Consider porting those to infiniband.
 > 
 > Will rtnetlink work for any net_device (like netdevice ioctls do) for
 > ARP and interface configurations calls or does it require special
 > support in net_device itself? Any possible problems with rtnetlink?

rtnetlink is the preferred modern interface between userspace and kernel
for networking information.  There is also the "iproute2" package that
provides a good command line interface that is capable of handling IPoIB
addresses.  For example:

$ ip addr show dev ib1
5: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:00:01:65 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.145.74/24 brd 192.168.145.255 scope global ib1
    inet6 fe80::202:c903:0:165/64 scope link 
       valid_lft forever preferred_lft forever

$ ip neigh
192.168.145.73 dev ib1 lladdr 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:00:01:30 STALE
172.29.224.1 dev eth0 lladdr 00:00:0c:07:ac:e0 REACHABLE

and so on.

 - R.


From chu11 at llnl.gov  Mon Nov  3 16:39:51 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 03 Nov 2008 16:39:51 -0800
Subject: [ofa-general] [opensm patch] support dump_conf command in opensm
	console
Message-ID: <1225759191.7307.9.camel@cardanus.llnl.gov>

Hey Sasha,

When config files are rescanned and loaded, there's no way to know if
the right configuration was actually reloaded or not.  A console command
to dump the current config is a useful way to verify the loading of new
configs or not.

This patch assumes the fixes from my "fix qos config parsing bugs" is
accepted.

Al

-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-support-dump_conf-console-command.patch
Type: text/x-patch
Size: 8850 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081103/4aa7e419/attachment.bin>

From rdreier at cisco.com  Mon Nov  3 21:47:45 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Nov 2008 21:47:45 -0800
Subject: [ofa-general] Re: [PATCH] RDMA/nes: Initialize limit_maxrdreqsz to 0
In-Reply-To: <20081103200524.GA7140@ctung-MOBL> (Chien Tung's message of "Mon, 
	3 Nov 2008 14:05:24 -0600")
References: <20081103200524.GA7140@ctung-MOBL>
Message-ID: <adafxm8ujni.fsf@cisco.com>

 > Left out initialization from previous patch (commit 
 > 633693660045b3e46a63ed618eb38a54339fbcc0).  Don't know how easy
 > it would be to fix the previous patch.

In general, if I haven't asked Linus to pull a given patch yet, it's
easy to go back and amend it, and if I have asked him to pull, it's too
late to change things (we just add the fix later).

However, in this case:

 > -static unsigned int limit_maxrdreqsz;
 > +static unsigned int limit_maxrdreqsz = 0;

the "fix" is bogus -- unless I'm very confused, limit_maxrdreqsz is a
static variable which is already in BSS and hence initialized to zero.
And the kernel style is to leave off the superfluous initializer.

Running your patch through checkpatch.pl would have shown the clue
"ERROR: do not initialise statics to 0 or NULL" as well.

So I think the original patch is fine.

 - R.


From rdreier at cisco.com  Mon Nov  3 21:54:10 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 03 Nov 2008 21:54:10 -0800
Subject: [ofa-general] [PATCH] RDMA/nes: Check cm_node before using it
In-Reply-To: <20081103200527.GA7408@ctung-MOBL> (Chien Tung's message of "Mon, 
	3 Nov 2008 14:05:27 -0600")
References: <20081103200527.GA7408@ctung-MOBL>
Message-ID: <adabpwwujct.fsf@cisco.com>

 > RDMA/nes: Check cm_node before using it
 > 
 > Moved cm_core assignment after cm_node check.

This patch is fine -- but I've never seen the oops the current code
would cause, and I'm guessing you haven't either.  Is there any way that
schedule_nes_timer() gets passed a NULL cm_node?


 - R.


From ogerlitz at voltaire.com  Tue Nov  4 01:23:46 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 4 Nov 2008 11:23:46 +0200 (IST)
Subject: [ofa-general] [PATCH] perftest: don't attach the sender QP
Message-ID: <Pine.LNX.4.64.0811041122560.20425@zuben.voltaire.com>

don't attach the sender QP to the MGID

Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>

Index: perftest-1.2/send_bw.c
===================================================================
--- perftest-1.2.orig/send_bw.c
+++ perftest-1.2/send_bw.c
@@ -421,7 +421,7 @@ static struct pingpong_context *pp_init_
 			return NULL;
 		}

-		if ((user_parm->connection_type==UD) && (user_parm->use_mcg)) {
+		if ((user_parm->connection_type==UD) && (user_parm->use_mcg) && !user_parm->servername) {
 			union ibv_gid gid;
 			uint8_t mcg_gid[16] = MCG_GID;


From jackm at dev.mellanox.co.il  Tue Nov  4 01:26:56 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 4 Nov 2008 11:26:56 +0200
Subject: [ofa-general] poll CQ failed -2 with connectX
In-Reply-To: <adaskq8v0ck.fsf@cisco.com>
References: <200810271838.48510.ricklist@microway.com>
	<200811031139.28122.jackm@dev.mellanox.co.il>
	<adaskq8v0ck.fsf@cisco.com>
Message-ID: <200811041126.56200.jackm@dev.mellanox.co.il>

On Tuesday 04 November 2008 01:47, Roland Dreier wrote:
> The simplest thing would be for libmlx4 to check if the kernel driver
> reports the XRC capability, say when creating the first QP for a given
> process, and treat the QPN bits appropriately depending on whether the
> kernel supports XRC or not.
> 
Actually, I already have a patch which does query-device when allocating
a new user context.  Since we have an device-capability XRC flag (bit 20),
we can save that in the user context.

I submitted the patch to the list last October (2007):
http://lists.openfabrics.org/pipermail/general/2007-October/042351.html

(This XRC capability issue is another reason for having this patch --
need to save the device flags as well, and add a flags word to the user context).

When creating a CQ, we can then add a "kernel-supports-xrc" flag to the cq context,
and test for that during cq_poll_one when testing if the QPN is an XRC qpn or not.

I'll prepare a patch for libmlx4.  It won't be in time for ofed 1.4-rc4 since that
is going out already (possibly even today).

- Jack


From dorons at Voltaire.COM  Tue Nov  4 01:59:59 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Tue, 04 Nov 2008 11:59:59 +0200
Subject: [ofa-general] [PATCH] change log_max_size to MB
Message-ID: <49101D1F.4040605@Voltaire.COM>

fixes a bug that log-limit in opensm.conf is in bytes
while opensm '-L' option accept the size in MB

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 0422d0f..8406232 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1278,6 +1278,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 		opts_unpack_uint32("log_max_size",
 				   p_key, p_val,
 				   (void *) & p_opts->log_max_size);
+		p_opts->log_max_size * 1024 *1024; /* convert to MB */
 
 		opts_unpack_charp("partition_config_file",
 				  p_key, p_val, &p_opts->partition_config_file);
-- 
1.5.4


From jackm at dev.mellanox.co.il  Tue Nov  4 02:14:38 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 4 Nov 2008 12:14:38 +0200
Subject: [ofa-general] mlx4: Allow resetting capability mask to defaults with
	SET_PORT
Message-ID: <200811041214.39085.jackm@dev.mellanox.co.il>

mlx4: Allow resetting capability mask to defaults with SET_PORT

Commit 7ff93f8b7... introduced support for different port types.
As part of that support, SET_PORT is invoked to set the port type
during driver startup.  However, as a side-effect, for IB ports
the invocation of this command also sets the port capability mask
to zero (losing the default configuration values set by FW).

This fix introduces use of the new rcm (reset capability mask) bit
in the SET_PORT command (bit 30 of first mailbox dword) which resets
the capability mask to the FW default value for that port (ignoring
the value included in the command mailbox).

The fix is to set the rcm bit when first setting the port-type to IB,
thus also restoring the capability mask to its default value (rather
than to zero).
(The fix also sets the rqk bit to reset the Qkey violations counter).

The fix requires ConnectX fw 2.5.927 or later to operate properly;
it will do no harm, however, if the driver runs over earlier FW --
the problem simply will still occur.

This patch fixes Bugzilla 1183 (which occurred because the
IsTrapSupported bit in the capability mask was zeroed).

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/net/mlx4/port.c b/drivers/net/mlx4/port.c
index e2fdab4..145d6e1 100644
--- a/drivers/net/mlx4/port.c
+++ b/drivers/net/mlx4/port.c
@@ -273,7 +273,8 @@ int mlx4_SET_PORT(struct mlx4_dev *dev, u8 port)
 		((u8 *) mailbox->buf)[3] = 6;
 		((__be16 *) mailbox->buf)[4] = cpu_to_be16(1 << 15);
 		((__be16 *) mailbox->buf)[6] = cpu_to_be16(1 << 15);
-	}
+	} else
+		((u8 *) mailbox->buf)[3] = 3;
 	err = mlx4_cmd(dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT,
 		       MLX4_CMD_TIME_CLASS_B);
 

From vlad at dev.mellanox.co.il  Tue Nov  4 02:13:56 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 04 Nov 2008 12:13:56 +0200
Subject: [ofa-general] OFED Nov 3 2008 meeting summary on OFED 1.4 status
Message-ID: <49102064.7080004@dev.mellanox.co.il>

Meeting minutes on the web:
http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/

Meeting Summary:
==============
RC4 is delayed - will be released on Thursday Nov 6.

Details:
=======
Bugs to be fixed in RC4:

1283    blocker   P1    RHEL 5 yannick.cote at qlogic.com   NEW        Intel MPI fails on Qlogc HCA
1326    blocker   P1    RHEL 4 yannick.cote at qlogic.com   NEW        ipath driver fails to build on IA64 in the 10/28/08 daily build
1335    major     P3    Other  monis at voltaire.com        NEW        Bonding: packet lost during failover
1301    major     P3    RHEL 4 olgas at voltaire.com        NEW        Can not load rds module on RH4 up7
1323    blocker   P1    All    stefan.roscher at de.ibm.com REOPENED   IB/ehca: possibillity of kernel panic under certain circumstances
1242    critical  P2    RHEL 4 yannick.cote at qlogic.com   NEW        kernel panic while running mpi2007 against ofed1.4 -- ib_ipath: ipath_sdma_verbs_send
1336    critical  P1    RHEL 5 bugzilla at openib.org       NEW        Can't to unloading the mlx4_ib module on ppc64

Regards,
Vladimir


From dimitar.dimitrov at markit.com  Tue Nov  4 02:32:24 2008
From: dimitar.dimitrov at markit.com (Dimitar Dimitrov)
Date: Tue, 04 Nov 2008 11:32:24 +0100
Subject: [ofa-general] Re: Error during compile of ofed-1.3.1
In-Reply-To: <490F40BF.1090605@dev.mellanox.co.il>
References: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>
	<1225735310.15893.32.camel@wks-ubuntu.ops.marketxs.com>
	<490F40BF.1090605@dev.mellanox.co.il>
Message-ID: <1225794744.26563.6.camel@wks-ubuntu.ops.marketxs.com>

Hi Vladimir,

Thanks for your reply. I am compiling OFED-1.4-20081104-0127 right now
and hope it works ok.

I thought RedHat packages were using 1.3.1 (official release) as a base,
but now I see the source rpm package is numbered 1.3.2. So in this case
would I be safer using some of the 1.3.2 releases? I reckon I should
stick to the latest and wait patiently till the official 1.4 release
(also in a testing phase here).

Regards,
Dimitar

On Mon, 2008-11-03 at 20:19 +0200, Vladimir Sokolovsky wrote:
> Dimitar Dimitrov wrote:
> > After some trial and error installed OFED-1.4-20081103-0630 which worked
> > fine. However still not clear what was wrong with the official 1.3.1
> > release?
> >
> > Dimitar
> >
> >   
> Hi Dimitar,
> OFED-1.3.1 does not support  RedHat EL4 up7.
> See, OFED-1.3.1/docs/OFED_release_notes.txt for the list of supported 
> platforms.
> 
> Note, OFED-1.4-20081103-0630 have some IPoIB issues.
> Please use OFED-1.4-20081102-0630, or wait about 30 min for the new 
> daily build.
> 
> Regards,
> Vladimir
> 
-- 
Regards,
Dimitar


The content of this e-mail is confidential and may be privileged. It may be read, copied and used only by the intended recipient and may not be disclosed, copied or distributed. If you received this email in error, please contact the sender immediately by return e-mail or by telephoning +44 20 7260 2000, delete it and do not disclose its contents to any person. You should take full responsibility for checking this email for viruses. Markit reserves the right to monitor all e-mail communications through its network.
Markit and its affiliated companies make no warranty as to the accuracy or completeness of any information contained in this message and hereby exclude any liability of any kind for the information contained herein. Any opinions expressed in this message are those of the author and do not necessarily reflect the opinions of Markit.
For full details about Markit, its offerings and legal terms and conditions, please see Markit’s website at http://www.markit.com <http://www.markit.com/> .


From vlad at lists.openfabrics.org  Tue Nov  4 03:20:47 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue,  4 Nov 2008 03:20:47 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081104-0200 daily build status
Message-ID: <20081104112047.A91B3E60CF9@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From hal.rosenstock at gmail.com  Tue Nov  4 04:22:22 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 4 Nov 2008 07:22:22 -0500
Subject: ***SPAM*** Re: [ofa-general] mlx4: Allow resetting capability mask to
	defaults with SET_PORT
In-Reply-To: <200811041214.39085.jackm@dev.mellanox.co.il>
References: <200811041214.39085.jackm@dev.mellanox.co.il>
Message-ID: <f0e08f230811040422r7d81a298y9ad4151a03ef7cf4@mail.gmail.com>

Jack,

On Tue, Nov 4, 2008 at 5:14 AM, Jack Morgenstein
<jackm at dev.mellanox.co.il> wrote:
> mlx4: Allow resetting capability mask to defaults with SET_PORT
>
> Commit 7ff93f8b7... introduced support for different port types.
> As part of that support, SET_PORT is invoked to set the port type
> during driver startup.  However, as a side-effect, for IB ports
> the invocation of this command also sets the port capability mask
> to zero (losing the default configuration values set by FW).
>
> This fix introduces use of the new rcm (reset capability mask) bit
> in the SET_PORT command (bit 30 of first mailbox dword) which resets
> the capability mask to the FW default value for that port (ignoring
> the value included in the command mailbox).
>
> The fix is to set the rcm bit when first setting the port-type to IB,
> thus also restoring the capability mask to its default value (rather
> than to zero).
> (The fix also sets the rqk bit to reset the Qkey violations counter).
>
> The fix requires ConnectX fw 2.5.927 or later

Is this released firmware ? If not, when is it to be released ?

-- Hal

> to operate properly;
> it will do no harm, however, if the driver runs over earlier FW --
> the problem simply will still occur.
>
> This patch fixes Bugzilla 1183 (which occurred because the
> IsTrapSupported bit in the capability mask was zeroed).
>
> Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
>
> diff --git a/drivers/net/mlx4/port.c b/drivers/net/mlx4/port.c
> index e2fdab4..145d6e1 100644
> --- a/drivers/net/mlx4/port.c
> +++ b/drivers/net/mlx4/port.c
> @@ -273,7 +273,8 @@ int mlx4_SET_PORT(struct mlx4_dev *dev, u8 port)
>                ((u8 *) mailbox->buf)[3] = 6;
>                ((__be16 *) mailbox->buf)[4] = cpu_to_be16(1 << 15);
>                ((__be16 *) mailbox->buf)[6] = cpu_to_be16(1 << 15);
> -       }
> +       } else
> +               ((u8 *) mailbox->buf)[3] = 3;
>        err = mlx4_cmd(dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT,
>                       MLX4_CMD_TIME_CLASS_B);
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From jackm at dev.mellanox.co.il  Tue Nov  4 04:44:03 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 4 Nov 2008 14:44:03 +0200
Subject: [ofa-general] mlx4: Allow resetting capability mask to defaults
	with SET_PORT
In-Reply-To: <f0e08f230811040422r7d81a298y9ad4151a03ef7cf4@mail.gmail.com>
References: <200811041214.39085.jackm@dev.mellanox.co.il>
	<f0e08f230811040422r7d81a298y9ad4151a03ef7cf4@mail.gmail.com>
Message-ID: <200811041444.03355.jackm@dev.mellanox.co.il>

On Tuesday 04 November 2008 14:22, Hal Rosenstock wrote:
> Is this released firmware ? If not, when is it to be released ?

This FW has not yet been released.  The next ConnectX FW release
(which will include this change) is scheduled for the end of this month. 

- Jack


From jackm at dev.mellanox.co.il  Tue Nov  4 05:10:00 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 4 Nov 2008 15:10:00 +0200
Subject: [ofa-general] Error during compile of ofed-1.3.1
In-Reply-To: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>
References: <1225728774.15893.17.camel@wks-ubuntu.ops.marketxs.com>
Message-ID: <200811041510.00177.jackm@dev.mellanox.co.il>

On Monday 03 November 2008 18:12, Dimitar Dimitrov wrote:
> Installing OFED 1.3.1 on RHEL AS 4 (update 7), kernel 2.6.9-78.0.1.ELsmp
> it ends up with the following error:
> 
OFED 1.3.1 (released in June 2008) does not support update 7 (which
was released in July 2008).

The upcoming OFED 1.4 does support RHEL AS 4 (update 7).
(release candidates are already available.  The most recent is rc3;
rc4 should be out this week)

- Jack


From kelly at tradebotsystems.com  Tue Nov  4 06:51:37 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Tue, 4 Nov 2008 08:51:37 -0600
Subject: [ofa-general] infiniband multicast (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343482@tbmail2.tradebot.com>

I'm experimenting with multicast and am having an interesting issue.
The setup is ripped mostly from ib_send_lat.c.  I have a client which
sends and a server which reads.  All sends/recieves use a 2048 byte
message.

The client can send any number of messages at any message rate.  The
client spins in a tight loop while sending to reduce bursts of messages
(1000 messgages/sec are spread out over the sec).  The client embeds a
sequence number in the message.

After setup, the server does this:

post 2048 recvs
for(;;) {
    ibv_poll_cq in loop, waiting for completion
    post recv
    check sequence number
}

If I specify more than about 6500 messages/sec, I skip some sequences
and receive others multiple times.  I always receive the same number of
messages the client sent.  It appears as though all of the messages come
through, but I'm missing some and reading others twice.

I suspect that there is some trick to more reliable multicast messaging
that I don't know about.  Does anyone have hints for multicasting high
message rates with a small percentage of drops or misses?

Thanks,

-K


From chien.tin.tung at intel.com  Tue Nov  4 07:12:24 2008
From: chien.tin.tung at intel.com (Tung, Chien Tin)
Date: Tue, 4 Nov 2008 08:12:24 -0700
Subject: [ofa-general] [PATCH] RDMA/nes: Check cm_node before using it
In-Reply-To: <adabpwwujct.fsf@cisco.com>
References: <20081103200527.GA7408@ctung-MOBL> <adabpwwujct.fsf@cisco.com>
Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383030F148BD8@azsmsx501.amr.corp.intel.com>


>This patch is fine -- but I've never seen the oops the current code
>would cause, and I'm guessing you haven't either.  Is there 
>any way that
>schedule_nes_timer() gets passed a NULL cm_node?

I checked all caller of schedule_nes_timer() and didn't see any
instances of cm_node being NULL.  We have a few cm related patches 
coming, I will roll this change into one of them.

Thanks,

Chien

From rdreier at cisco.com  Tue Nov  4 09:14:02 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Nov 2008 09:14:02 -0800
Subject: [ofa-general] infiniband multicast (libibverbs)
In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343482@tbmail2.tradebot.com>
	(Kelly Burkhart's message of "Tue, 4 Nov 2008 08:51:37 -0600")
References: <98B0CDCB28A5EE4CB3678CD99406644E343482@tbmail2.tradebot.com>
Message-ID: <ada3ai7v2g5.fsf@cisco.com>

 > If I specify more than about 6500 messages/sec, I skip some sequences
 > and receive others multiple times.  I always receive the same number of
 > messages the client sent.  It appears as though all of the messages come
 > through, but I'm missing some and reading others twice.

Sounds like a bug in your code -- I don't know why you would see
duplicate messages unless you are somehow processing the same receive
buffer twice or something like that.

 - R.


From weiny2 at llnl.gov  Tue Nov  4 09:57:44 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 4 Nov 2008 09:57:44 -0800
Subject: [ofa-general] [PATCH] opensm/opensm/osm_state_mgr.c: Add check for
 valid physical port before using pointer.
Message-ID: <20081104095744.35893d4a.weiny2@llnl.gov>

>From 567c3893f24f4dc25ef5f4e74ef9deeb8ae541ad Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Mon, 3 Nov 2008 14:47:50 -0800
Subject: [PATCH] opensm/opensm/osm_state_mgr.c: Add check for valid physical port before using
 pointer.

   There are times when PortInfo fails which leaves osm_node_t with invalid
   osm_physp_t pointers.  In this case do not use an invalid pointer.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 opensm/opensm/osm_state_mgr.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index ba3b6bf..841438c 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -542,6 +542,12 @@ static void __osm_state_mgr_get_node_desc(IN cl_map_item_t * const p_object,
 
 	/* get a physp to request from. */
 	p_physp = osm_node_get_any_physp_ptr(p_node);
+	if (!osm_physp_is_valid(p_physp)) {
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR,
+			"__osm_state_mgr_get_node_desc: ERR 331C: "
+			"Failed to get valid physical port object\n");
+		goto exit;
+	}
 
 	mad_context.nd_context.node_guid = osm_node_get_node_guid(p_node);
 
-- 
1.5.4.5


From weiny2 at llnl.gov  Tue Nov  4 09:58:12 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 4 Nov 2008 09:58:12 -0800
Subject: [ofa-general] [PATCH] Add check for previous versions of plugins.
Message-ID: <20081104095812.2ff5920c.weiny2@llnl.gov>

>From 0db0d6667ed8baede1093a95127e2ce9c81959bd Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Mon, 3 Nov 2008 15:50:15 -0800
Subject: [PATCH] Add check for previous versions of plugins.

   If old interface plugins are available to OpenSM they will cause a crash.
   Check for this old version and error out gracefully.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 opensm/include/opensm/osm_event_plugin.h |    1 +
 opensm/opensm/osm_event_plugin.c         |   10 ++++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h
index b2deeba..0b80b63 100644
--- a/opensm/include/opensm/osm_event_plugin.h
+++ b/opensm/include/opensm/osm_event_plugin.h
@@ -150,6 +150,7 @@ typedef struct osm_epi_trap_event {
 #define OSM_EVENT_PLUGIN_IMPL_NAME "osm_event_plugin"
 #define OSM_EVENT_PLUGIN_INTERFACE_VER 2
 typedef struct osm_event_plugin {
+	int interface_version;
 	const char *osm_version;
 	void *(*create) (struct osm_opensm *osm);
 	void (*delete) (void *plugin_data);
diff --git a/opensm/opensm/osm_event_plugin.c b/opensm/opensm/osm_event_plugin.c
index c6999f5..86cabf0 100644
--- a/opensm/opensm/osm_event_plugin.c
+++ b/opensm/opensm/osm_event_plugin.c
@@ -96,6 +96,16 @@ osm_epi_plugin_t *osm_epi_construct(osm_opensm_t *osm, char *plugin_name)
 		goto Exit;
 	}
 
+	/* check for new interface */
+	if (rc->impl->interface_version < OSM_EVENT_PLUGIN_INTERFACE_VER) {
+		OSM_LOG(&osm->log, OSM_LOG_ERROR, "Error loading plugin"
+			"\'%s\': is the wrong interface version (%d); "
+			"OpenSM expected %d\n",
+			plugin_name, rc->impl->interface_version,
+			OSM_EVENT_PLUGIN_INTERFACE_VER);
+		goto Exit;
+	}
+
 	/* Check the version to make sure this module will work with us */
 	if (strcmp(rc->impl->osm_version, osm->osm_version)) {
 		OSM_LOG(&osm->log, OSM_LOG_ERROR, "Error loading plugin"
-- 
1.5.4.5


From rdreier at cisco.com  Tue Nov  4 10:52:52 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Nov 2008 10:52:52 -0800
Subject: [ofa-general] Re: mlx4: Allow resetting capability mask to defaults
	with SET_PORT
In-Reply-To: <200811041214.39085.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 4 Nov 2008 12:14:38 +0200")
References: <200811041214.39085.jackm@dev.mellanox.co.il>
Message-ID: <adaprlbtjaz.fsf@cisco.com>

 > The fix requires ConnectX fw 2.5.927 or later to operate properly;
 > it will do no harm, however, if the driver runs over earlier FW --
 > the problem simply will still occur.

This doesn't seem like an acceptable solution -- this means that anyone
using a new kernel with older firmware has a broken system.

Can't we just keep track of the current capability mask and make sure to
set it properly when doing the SET_PORT command?

Actually, looking at the code, it seems we really should unify the
multiple mlx4_SET_PORT implementations anyway.

 - R.


From rdreier at cisco.com  Tue Nov  4 10:57:26 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Nov 2008 10:57:26 -0800
Subject: [ofa-general] Re: [PATCH] libibverbs: Update Dotan's email in all of
	the files
In-Reply-To: <200810180435.00292.dotanba@gmail.com> (Dotan Barak's message of
	"Sat, 18 Oct 2008 04:35:00 +0200")
References: <200810180435.00292.dotanba@gmail.com>
Message-ID: <adaljvztj3d.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue Nov  4 11:00:57 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Nov 2008 11:00:57 -0800
Subject: [ofa-general] [PATCH] ipoib: fix crash in path_rec_completion
In-Reply-To: <490B01C6.7020302@Voltaire.COM> (Yossi Etigin's message of "Fri, 
	31 Oct 2008 15:01:58 +0200")
References: <490B01C6.7020302@Voltaire.COM>
Message-ID: <adahc6ntixi.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue Nov  4 11:17:21 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Nov 2008 11:17:21 -0800
Subject: [ofa-general] Re: [PATCH] mlx4/profile.c: fix warning res_name
	defined but not used
In-Reply-To: <20081023233255.GB14519@orion> (Alexander Beregalov's message of
	"Fri, 24 Oct 2008 03:32:55 +0400")
References: <20081023233255.GB14519@orion>
Message-ID: <adad4hbti66.fsf@cisco.com>

Thanks.  What if we fix this like the following instead -- change
mlx4_dbg so it always looks to the compiler like it uses all its
parameters?  This generates the same code for me, and looks cleaner in
that it actually reduces the amount of #ifdef'ed stuff.
---
 drivers/net/mlx4/mlx4.h |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index fa431fa..56a2e21 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -87,6 +87,9 @@ enum {
 
 #ifdef CONFIG_MLX4_DEBUG
 extern int mlx4_debug_level;
+#else /* CONFIG_MLX4_DEBUG */
+#define mlx4_debug_level	(0)
+#endif /* CONFIG_MLX4_DEBUG */
 
 #define mlx4_dbg(mdev, format, arg...)					\
 	do {								\
@@ -94,12 +97,6 @@ extern int mlx4_debug_level;
 			dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \
 	} while (0)
 
-#else /* CONFIG_MLX4_DEBUG */
-
-#define mlx4_dbg(mdev, format, arg...) do { (void) mdev; } while (0)
-
-#endif /* CONFIG_MLX4_DEBUG */
-
 #define mlx4_err(mdev, format, arg...) \
 	dev_err(&mdev->pdev->dev, format, ## arg)
 #define mlx4_info(mdev, format, arg...) \


From dotanba at gmail.com  Tue Nov  4 11:33:31 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Tue, 04 Nov 2008 21:33:31 +0200
Subject: [ofa-general] infiniband multicast (libibverbs)
In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343482@tbmail2.tradebot.com>
References: <98B0CDCB28A5EE4CB3678CD99406644E343482@tbmail2.tradebot.com>
Message-ID: <4910A38B.60900@gmail.com>

Kelly Burkhart wrote:
> I'm experimenting with multicast and am having an interesting issue.
> The setup is ripped mostly from ib_send_lat.c.  I have a client which
> sends and a server which reads.  All sends/recieves use a 2048 byte
> message.
>
> The client can send any number of messages at any message rate.  The
> client spins in a tight loop while sending to reduce bursts of messages
> (1000 messgages/sec are spread out over the sec).  The client embeds a
> sequence number in the message.
>
> After setup, the server does this:
>
> post 2048 recvs
> for(;;) {
>     ibv_poll_cq in loop, waiting for completion
>     post recv
>     check sequence number
> }
>
> If I specify more than about 6500 messages/sec, I skip some sequences
> and receive others multiple times.  I always receive the same number of
> messages the client sent.  It appears as though all of the messages come
> through, but I'm missing some and reading others twice.
>   
Do you use the "volatile" when you access the pointed memory buffer?
> I suspect that there is some trick to more reliable multicast messaging
> that I don't know about.  Does anyone have hints for multicasting high
> message rates with a small percentage of drops or misses?
>   
Do you have worst results than the ib_send_bw.c?
Can you try to send unicast messages (with minimum changes) to see if 
the issue is related to multicast send?

Anyway, you should remember that multicast messages are being sent over 
UD QPs and messages can be dropped.

Dotan


From cameron at harr.org  Tue Nov  4 11:38:03 2008
From: cameron at harr.org (Cameron Harr)
Date: Tue, 04 Nov 2008 12:38:03 -0700
Subject: [ofa-general] SRP/mlx4 interrupts throttling performance
In-Reply-To: <490B45B0.7030208@vlnb.net>
References: <48E386F6.5040502@fusionio.com>
	<48E38BAF.5000801@harr.org>		<48E6498A.3070002@mellanox.com>
	<48E65FE0.2060602@harr.org>		<48E67ACC.1020903@harr.org>
	<48E695F9.80703@harr.org>		<48E9E681.8090600@vlnb.net>
	<48EA2F42.80008@harr.org>	<e2e108260810070233q7dbcd377p16b094ea5a6b74a7@mail.gmail.com>
	<48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net>
	<48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org>
	<48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com>
	<48ECEA4D.7080504@harr.org> <48F79CA9.8090806@vlnb.net>
	<49022438.9030903@harr.org> <490B45B0.7030208@vlnb.net>
Message-ID: <4910A49B.1050004@harr.org>

Vladislav Bolkhovitin wrote:
> Cameron Harr wrote:
>> Vladislav Bolkhovitin wrote:
>>>> ** Sometimes the benchmark "zombied" (process doing no work, but 
>>>> process can't be killed) after running a certain amount of time. 
>>>> However, it wasn't repeatable in a reliable way, so I mark that 
>>>> this particular run has zombied before.
>>> That means that there is a bug somewhere. Usually such bugs are 
>>> found in few hours of code auditing (srpt driver is pretty simple) 
>>> or by using kernel debug facilities (example diff to .config 
>>> attached). I personally always prefer put my effort on fixing real 
>>> things, not inventing various workarounds, like srpt_thread in this 
>>> case.
>>>
>>> So I would:
>>>
>>>   1. Completely remove srpt thread and all related code. It doesn't do
>>> anything, which can't be done in SIRQ context (tasklet)
>>>
>>>   2. Audit the code to check if it does any action, which it 
>>> shouldn't do on SIRQ and fix it. This step isn't required, but 
>>> usually it saves a lot of time of puzzled debugging in the future.
>>>
>>>   3. Change in srpt_handle_rdma_comp() and  srpt_handle_new_iu()
>>> SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC.
>>

I'm assuming you didn't want me to implement this change this time, correct?

>> I also changed it in srpt_handle_err_comp()
>>> Then I would run the problematic tests (heavy tpc-h workload, e.g.) 
>>> on debug kernel and fix found problems.
>>>
>>> Anyway, Cameron, can you get the latest code from SCST trunk and try 
>>> with it? It was recently updated. Also please add the case with 
>>> changes from (3) above.
>> This is all with version 1.0.1 of SCST (v532).
>> In my fio test, I do runs with srpt thread=1 and then =0. When it was 
>> set to zero during the test, I got many errors printed out by FIO, 
>> and the target eventually crashed. This is the first part of a long 
>> call trace.
>>
>> NMI Watchdog detected LOCKUP on CPU 0
>> CPU 0
>> Modules linked in: ib_srpt(U) scst_vdisk(U) scst(U) fio_driver(PU) 
>> fio_port(PU) autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_ipoib 
>> mlx4_ib ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api 
>> nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight 
>> i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp 
>> parport i2c_i801 shpchp i2c_core e1000e mlx4_core i5000_edac edac_mc 
>> pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd 
>> ehci_hcd
>> Pid: 25732, comm: scsi_tgt0 Tainted: P      2.6.18-92.1.13.el5 #1
>> RIP: 0010:[<ffffffff80064bcb>]  [<ffffffff80064bcb>] 
>> .text.lock.spinlock+0x29/0x30
>> RSP: 0018:ffffffff80418a88  EFLAGS: 00000086
>> RAX: ffff810785307fd8 RBX: ffffffff884e68a0 RCX: 0000000000000000
>> RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff884e68a0
>> RBP: ffffffff884e62a0 R08: ffff810790926900 R09: ffff8107909268e8
>> R10: 0000000000000018 R11: ffffffff884fcab3 R12: 0000000000000001
>> R13: 0000000000000001 R14: 0000000000000000 R15: ffff8107f0f374c0
>> FS:  0000000000000000(0000) GS:ffffffff803a0000(0000) 
>> knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 00000037bc0986d0 CR3: 0000000000201000 CR4: 00000000000006e0
>> Process scsi_tgt0 (pid: 25732, threadinfo ffff810785306000, task 
>> ffff810810852100)
>> Stack:  0000000000000000 ffffffff884c509d ffff8107909268e8 
>> ffff810790926900
>>  00000002071dd688 0000020000000220 0000000000000200 00000000da984c08
>>  0000000000000000 ffff8107909267f0 ffff810806ceee20 0000000000000001
>> Call Trace:
>>  <IRQ>  [<ffffffff884c509d>] :scst:sgv_pool_alloc+0x10c/0x5d3
>>  [<ffffffff884c1f85>] :scst:scst_alloc_space+0x5b/0x106
>>  [<ffffffff884bdc90>] :scst:scst_process_active_cmd+0x4fc/0x131c
>>  [<ffffffff884bee46>] :scst:scst_cmd_init_done+0x17f/0x3ef
>>  [<ffffffff884fb1ff>] :ib_srpt:srpt_handle_new_iu+0x281/0x4e7
>>  [<ffffffff8835ec3d>] :mlx4_ib:mlx4_ib_free_srq_wqe+0x27/0x4f
>>  [<ffffffff883591da>] :mlx4_ib:get_sw_cqe+0x12/0x30
>>  [<ffffffff88359c97>] :mlx4_ib:mlx4_ib_poll_cq+0x432/0x48f
>>  [<ffffffff884fcc43>] :ib_srpt:srpt_completion+0x190/0x250
>>  [<ffffffff8811aa5b>] :mlx4_core:mlx4_eq_int+0x3b/0x26f
>>  [<ffffffff8811ac9e>] :mlx4_core:mlx4_msi_x_interrupt+0xf/0x17
>
> According to this trace, Vu was incorrect when he wrote that 
> srpt_handle_new_iu called on tasklet context. It at least sometimes 
> called from IRQ context. Try with the attached patch. It's against the 
> latest trunk.
I tried with the latest scst and srpt as of this morning. Previously, I 
had used srpt-1.0.0. The following results are with BLOCKIO, and I'll 
have a NULLIO in a bit. You can see from here that I don't hang any 
more, but the srpt thread=0 are a little lower.

As before this run was done with ioengine=libaio and iodepth=16. I 
pretty much always get significantly better performance with libaio than 
with sync or other engines. Also, the iodepth setting tended to give me 
better results.
----------------------------------------------
type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=1 iops=67073.48
type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 iops=54876.82
type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=1 iops=74858.00
type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 iops=75357.15
type=randwrite  bs=512  drives=3 scst_threads=1 srptthread=1 iops=83257.72
type=randwrite  bs=4k   drives=3 scst_threads=1 srptthread=1 iops=82186.79
type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=1 iops=59908.06
type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 iops=50982.91
type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=1 iops=99243.07
type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 iops=79670.62
type=randwrite  bs=512  drives=3 scst_threads=2 srptthread=1 iops=102898.37
type=randwrite  bs=4k   drives=3 scst_threads=2 srptthread=1 iops=92248.25
type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=1 iops=63086.77
type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 iops=53020.41
type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=1 iops=95990.06
type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 iops=77487.26
type=randwrite  bs=512  drives=3 scst_threads=3 srptthread=1 iops=105945.85
type=randwrite  bs=4k   drives=3 scst_threads=3 srptthread=1 iops=95389.01
type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=0 iops=50299.36
type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 iops=48070.11
type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=0 iops=54017.21
type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 iops=50407.20
type=randwrite  bs=512  drives=3 scst_threads=1 srptthread=0 iops=55822.11
type=randwrite  bs=4k   drives=3 scst_threads=1 srptthread=0 iops=50447.82
type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=0 iops=60672.48
type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 iops=48811.93
type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=0 iops=81919.87
type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 iops=72912.99
type=randwrite  bs=512  drives=3 scst_threads=2 srptthread=0 iops=91036.45
type=randwrite  bs=4k   drives=3 scst_threads=2 srptthread=0 iops=88994.63
type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=0 iops=58929.21
type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 iops=48698.90
type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=0 iops=83967.58
type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 iops=73932.36
type=randwrite  bs=512  drives=3 scst_threads=3 srptthread=0 iops=96686.46
type=randwrite  bs=4k   drives=3 scst_threads=3 srptthread=0 iops=88689.27


From a.beregalov at gmail.com  Tue Nov  4 12:43:57 2008
From: a.beregalov at gmail.com (Alexander Beregalov)
Date: Tue, 4 Nov 2008 23:43:57 +0300
Subject: [ofa-general] Re: [PATCH] mlx4/profile.c: fix warning res_name
	defined but not used
In-Reply-To: <adad4hbti66.fsf@cisco.com>
References: <20081023233255.GB14519@orion> <adad4hbti66.fsf@cisco.com>
Message-ID: <a4423d670811041243r58c9f469wae23ecd8e93b2c29@mail.gmail.com>

2008/11/4 Roland Dreier <rdreier at cisco.com>:
> Thanks.  What if we fix this like the following instead -- change
> mlx4_dbg so it always looks to the compiler like it uses all its
> parameters?  This generates the same code for me, and looks cleaner in
> that it actually reduces the amount of #ifdef'ed stuff.
Yes, it looks better.
> ---
>  drivers/net/mlx4/mlx4.h |    9 +++------
>  1 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
> index fa431fa..56a2e21 100644
> --- a/drivers/net/mlx4/mlx4.h
> +++ b/drivers/net/mlx4/mlx4.h
> @@ -87,6 +87,9 @@ enum {
>
>  #ifdef CONFIG_MLX4_DEBUG
>  extern int mlx4_debug_level;
> +#else /* CONFIG_MLX4_DEBUG */
> +#define mlx4_debug_level       (0)
> +#endif /* CONFIG_MLX4_DEBUG */
>
>  #define mlx4_dbg(mdev, format, arg...)                                 \
>        do {                                                            \
> @@ -94,12 +97,6 @@ extern int mlx4_debug_level;
>                        dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \
>        } while (0)
>
> -#else /* CONFIG_MLX4_DEBUG */
> -
> -#define mlx4_dbg(mdev, format, arg...) do { (void) mdev; } while (0)
> -
> -#endif /* CONFIG_MLX4_DEBUG */
> -
>  #define mlx4_err(mdev, format, arg...) \
>        dev_err(&mdev->pdev->dev, format, ## arg)
>  #define mlx4_info(mdev, format, arg...) \
>


From cameron at harr.org  Tue Nov  4 13:01:32 2008
From: cameron at harr.org (Cameron Harr)
Date: Tue, 04 Nov 2008 14:01:32 -0700
Subject: [ofa-general] SRP/mlx4 interrupts throttling performance
In-Reply-To: <4910A49B.1050004@harr.org>
References: <48E386F6.5040502@fusionio.com>
	<48E38BAF.5000801@harr.org>		<48E6498A.3070002@mellanox.com>
	<48E65FE0.2060602@harr.org>		<48E67ACC.1020903@harr.org>
	<48E695F9.80703@harr.org>		<48E9E681.8090600@vlnb.net>
	<48EA2F42.80008@harr.org>	<e2e108260810070233q7dbcd377p16b094ea5a6b74a7@mail.gmail.com>
	<48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net>
	<48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org>
	<48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com>
	<48ECEA4D.7080504@harr.org> <48F79CA9.8090806@vlnb.net>
	<49022438.9030903@harr.org> <490B45B0.7030208@vlnb.net>
	<4910A49B.1050004@harr.org>
Message-ID: <4910B82C.6070904@harr.org>


Cameron Harr wrote:
> I tried with the latest scst and srpt as of this morning. Previously, 
> I had used srpt-1.0.0. The following results are with BLOCKIO, and 
> I'll have a NULLIO in a bit. You can see from here that I don't hang 
> any more, but the srpt thread=0 are a little lower.
>
> As before this run was done with ioengine=libaio and iodepth=16. I 
> pretty much always get significantly better performance with libaio 
> than with sync or other engines. Also, the iodepth setting tended to 
> give me better results.
> ----------------------------------------------
> type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=1 
> iops=67073.48
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 
> iops=54876.82
> type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=1 
> iops=74858.00
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 
> iops=75357.15
> type=randwrite  bs=512  drives=3 scst_threads=1 srptthread=1 
> iops=83257.72
> type=randwrite  bs=4k   drives=3 scst_threads=1 srptthread=1 
> iops=82186.79
> type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=1 
> iops=59908.06
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 
> iops=50982.91
> type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=1 
> iops=99243.07
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 
> iops=79670.62
> type=randwrite  bs=512  drives=3 scst_threads=2 srptthread=1 
> iops=102898.37
> type=randwrite  bs=4k   drives=3 scst_threads=2 srptthread=1 
> iops=92248.25
> type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=1 
> iops=63086.77
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 
> iops=53020.41
> type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=1 
> iops=95990.06
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 
> iops=77487.26
> type=randwrite  bs=512  drives=3 scst_threads=3 srptthread=1 
> iops=105945.85
> type=randwrite  bs=4k   drives=3 scst_threads=3 srptthread=1 
> iops=95389.01
> type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=0 
> iops=50299.36
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 
> iops=48070.11
> type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=0 
> iops=54017.21
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 
> iops=50407.20
> type=randwrite  bs=512  drives=3 scst_threads=1 srptthread=0 
> iops=55822.11
> type=randwrite  bs=4k   drives=3 scst_threads=1 srptthread=0 
> iops=50447.82
> type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=0 
> iops=60672.48
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 
> iops=48811.93
> type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=0 
> iops=81919.87
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 
> iops=72912.99
> type=randwrite  bs=512  drives=3 scst_threads=2 srptthread=0 
> iops=91036.45
> type=randwrite  bs=4k   drives=3 scst_threads=2 srptthread=0 
> iops=88994.63
> type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=0 
> iops=58929.21
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 
> iops=48698.90
> type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=0 
> iops=83967.58
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 
> iops=73932.36
> type=randwrite  bs=512  drives=3 scst_threads=3 srptthread=0 
> iops=96686.46
> type=randwrite  bs=4k   drives=3 scst_threads=3 srptthread=0 
> iops=88689.27
>

And here are the results with NULLIO, sorted by block size. Having the 
SRPT thread=0 actually shows some benefit here:
-------------------------------------------
type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 iops=140700.40
type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 iops=89167.67
type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 iops=125166.68
type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 iops=136699.05
type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 iops=127363.18
type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 iops=91205.03
type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 iops=94412.46
type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 iops=84354.34
type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 iops=155053.30
type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 iops=102480.27
type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 iops=141045.50
type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 iops=99681.15
type=randwrite  bs=4k   drives=3 scst_threads=1 srptthread=0 iops=173182.91
type=randwrite  bs=4k   drives=3 scst_threads=1 srptthread=1 iops=117629.27
type=randwrite  bs=4k   drives=3 scst_threads=2 srptthread=0 iops=99960.51
type=randwrite  bs=4k   drives=3 scst_threads=2 srptthread=1 iops=103412.00
type=randwrite  bs=4k   drives=3 scst_threads=3 srptthread=0 iops=120926.77
type=randwrite  bs=4k   drives=3 scst_threads=3 srptthread=1 iops=100368.32
type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=0 iops=102232.77
type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=1 iops=139095.94
type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=0 iops=130327.29
type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=1 iops=159158.20
type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=0 iops=136153.84
type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=1 iops=92417.19
type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=0 iops=126892.60
type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=1 iops=99436.74
type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=0 iops=101566.13
type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=1 iops=142292.97
type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=0 iops=166114.78
type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=1 iops=155634.89
type=randwrite  bs=512  drives=3 scst_threads=1 srptthread=0 iops=131368.01
type=randwrite  bs=512  drives=3 scst_threads=1 srptthread=1 iops=186550.24
type=randwrite  bs=512  drives=3 scst_threads=2 srptthread=0 iops=139813.79
type=randwrite  bs=512  drives=3 scst_threads=2 srptthread=1 iops=162499.08
type=randwrite  bs=512  drives=3 scst_threads=3 srptthread=0 iops=154777.28
type=randwrite  bs=512  drives=3 scst_threads=3 srptthread=1 iops=187425.87


From kelly at tradebotsystems.com  Tue Nov  4 13:24:31 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Tue, 4 Nov 2008 15:24:31 -0600
Subject: [ofa-general] infiniband multicast (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343488@tbmail2.tradebot.com>

 
> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> 
>  > If I specify more than about 6500 messages/sec, I skip 
> some sequences
>  > and receive others multiple times.  I always receive the 
> same number of
>  > messages the client sent.  It appears as though all of the 
> messages come
>  > through, but I'm missing some and reading others twice.
> 
> Sounds like a bug in your code -- I don't know why you would see
> duplicate messages unless you are somehow processing the same receive
> buffer twice or something like that.

I am (or was) processing the same buffer over and over.  I ripped
from ib_send_lat which does the same thing.  The difference is
send_lat waits for a reply before sending a second message.  I'm 
sending rapidly without waiting for a reply.  The surprising thing
to me was that my recv buffer received data ahead of me waiting on
the cq.

I modified my code to read into a circular list of buffers which
appears to have solved the problem at the cost of more memory usage.

Thanks,

-K


From rdreier at cisco.com  Tue Nov  4 13:37:00 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 04 Nov 2008 13:37:00 -0800
Subject: [ofa-general] [GIT PULL] please pull infiniband.git
Message-ID: <ada1vxrtbpf.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This will get the following changes; mostly low-level hardware changes
plus a couple of small patches that fix IPoIB crashes.

Chien Tung (2):
      RDMA/nes: Correct handling of PBL resources
      RDMA/nes: Mitigate compatibility issue regarding PCIe write credits

Ilpo Järvinen (1):
      RDMA/nes: Reindent mis-indented spinlocks

Ralph Campbell (1):
      IB/ipath: Fix RDMA write with immediate copy of last packet

Roland Dreier (3):
      RDMA/cxgb3: Fix too-big reserved field zeroing in iwch_post_zb_read()
      mlx4_core: Fix unused variable warning
      Merge branches 'cxgb3', 'ehca', 'ipath', 'ipoib', 'mlx4' and 'nes' into for-next

Stefan Roscher (1):
      IB/ehca: Remove reference to special QP in case of port activation failure

Vadim Makhervaks (1):
      RDMA/nes: Fix CQ allocation scheme for multicast receive queue apps

Yossi Etigin (3):
      IPoIB: Don't enable napi when it's already enabled
      IPoIB: Fix hang in ipoib_flush_paths()
      IPoIB: Fix crash in path_rec_completion()

 drivers/infiniband/hw/cxgb3/iwch_qp.c     |    1 -
 drivers/infiniband/hw/ehca/ehca_irq.c     |    7 ++-
 drivers/infiniband/hw/ehca/ehca_qp.c      |    5 ++
 drivers/infiniband/hw/ipath/ipath_ruc.c   |   10 ++--
 drivers/infiniband/hw/nes/nes.c           |   16 +++++++
 drivers/infiniband/hw/nes/nes_hw.h        |    1 +
 drivers/infiniband/hw/nes/nes_verbs.c     |   64 +++++++++++++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_main.c |    6 ++-
 drivers/net/mlx4/mlx4.h                   |    9 +---
 9 files changed, 82 insertions(+), 37 deletions(-)


diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index 3e4585c..19661b2 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -745,7 +745,6 @@ int iwch_post_zb_read(struct iwch_qp *qhp)
 	wqe->read.rdmaop = T3_READ_REQ;
 	wqe->read.reserved[0] = 0;
 	wqe->read.reserved[1] = 0;
-	wqe->read.reserved[2] = 0;
 	wqe->read.rem_stag = cpu_to_be32(1);
 	wqe->read.rem_to = cpu_to_be64(1);
 	wqe->read.local_stag = cpu_to_be32(1);
diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index cb55be0..9e43459 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -370,6 +370,10 @@ static void parse_ec(struct ehca_shca *shca, u64 eqe)
 	switch (ec) {
 	case 0x30: /* port availability change */
 		if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) {
+			/* only for autodetect mode important */
+			if (ehca_nr_ports >= 0)
+				break;
+
 			int suppress_event;
 			/* replay modify_qp for sqps */
 			spin_lock_irqsave(&sport->mod_sqp_lock, flags);
@@ -387,8 +391,7 @@ static void parse_ec(struct ehca_shca *shca, u64 eqe)
 			sport->port_state = IB_PORT_ACTIVE;
 			dispatch_port_event(shca, port, IB_EVENT_PORT_ACTIVE,
 					    "is active");
-			ehca_query_sma_attr(shca, port,
-					    &sport->saved_attr);
+			ehca_query_sma_attr(shca, port, &sport->saved_attr);
 		} else {
 			sport->port_state = IB_PORT_DOWN;
 			dispatch_port_event(shca, port, IB_EVENT_PORT_ERR,
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 4d54b9f..9e05ee2 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -860,6 +860,11 @@ static struct ehca_qp *internal_create_qp(
 	if (qp_type == IB_QPT_GSI) {
 		h_ret = ehca_define_sqp(shca, my_qp, init_attr);
 		if (h_ret != H_SUCCESS) {
+			kfree(my_qp->mod_qp_parm);
+			my_qp->mod_qp_parm = NULL;
+			/* the QP pointer is no longer valid */
+			shca->sport[init_attr->port_num - 1].ibqp_sqp[qp_type] =
+				NULL;
 			ret = ehca2ib_return_code(h_ret);
 			goto create_qp_exit6;
 		}
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index fc0f6d9..2296832 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -156,7 +156,7 @@ bail:
 /**
  * ipath_get_rwqe - copy the next RWQE into the QP's RWQE
  * @qp: the QP
- * @wr_id_only: update wr_id only, not SGEs
+ * @wr_id_only: update qp->r_wr_id only, not qp->r_sge
  *
  * Return 0 if no RWQE is available, otherwise return 1.
  *
@@ -173,8 +173,6 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 	u32 tail;
 	int ret;
 
-	qp->r_sge.sg_list = qp->r_sg_list;
-
 	if (qp->ibqp.srq) {
 		srq = to_isrq(qp->ibqp.srq);
 		handler = srq->ibsrq.event_handler;
@@ -206,8 +204,10 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only)
 		wqe = get_rwqe_ptr(rq, tail);
 		if (++tail >= rq->size)
 			tail = 0;
-	} while (!wr_id_only && !ipath_init_sge(qp, wqe, &qp->r_len,
-						&qp->r_sge));
+		if (wr_id_only)
+			break;
+		qp->r_sge.sg_list = qp->r_sg_list;
+	} while (!ipath_init_sge(qp, wqe, &qp->r_len, &qp->r_sge));
 	qp->r_wr_id = wqe->wr_id;
 	wq->tail = tail;
 
diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index a2b04d6..aa1dc41 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -95,6 +95,10 @@ unsigned int wqm_quanta = 0x10000;
 module_param(wqm_quanta, int, 0644);
 MODULE_PARM_DESC(wqm_quanta, "WQM quanta");
 
+static unsigned int limit_maxrdreqsz;
+module_param(limit_maxrdreqsz, bool, 0644);
+MODULE_PARM_DESC(limit_maxrdreqsz, "Limit max read request size to 256 Bytes");
+
 LIST_HEAD(nes_adapter_list);
 static LIST_HEAD(nes_dev_list);
 
@@ -588,6 +592,18 @@ static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_i
 						nesdev->nesadapter->port_count;
 	}
 
+	if ((limit_maxrdreqsz ||
+	     ((nesdev->nesadapter->phy_type[0] == NES_PHY_TYPE_GLADIUS) &&
+	      (hw_rev == NE020_REV1))) &&
+	    (pcie_get_readrq(pcidev) > 256)) {
+		if (pcie_set_readrq(pcidev, 256))
+			printk(KERN_ERR PFX "Unable to set max read request"
+				" to 256 bytes\n");
+		else
+			nes_debug(NES_DBG_INIT, "Max read request size set"
+				" to 256 bytes\n");
+	}
+
 	tasklet_init(&nesdev->dpc_tasklet, nes_dpc, (unsigned long)nesdev);
 
 	/* bring up the Control QP */
diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h
index 610b9d8..bc0b4de 100644
--- a/drivers/infiniband/hw/nes/nes_hw.h
+++ b/drivers/infiniband/hw/nes/nes_hw.h
@@ -40,6 +40,7 @@
 #define NES_PHY_TYPE_ARGUS     4
 #define NES_PHY_TYPE_PUMA_1G   5
 #define NES_PHY_TYPE_PUMA_10G  6
+#define NES_PHY_TYPE_GLADIUS   7
 
 #define NES_MULTICAST_PF_MAX 8
 
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 932e56f..d36c9a0 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -220,14 +220,14 @@ static int nes_bind_mw(struct ib_qp *ibqp, struct ib_mw *ibmw,
 	if (nesqp->ibqp_state > IB_QPS_RTS)
 		return -EINVAL;
 
-		spin_lock_irqsave(&nesqp->lock, flags);
+	spin_lock_irqsave(&nesqp->lock, flags);
 
 	head = nesqp->hwqp.sq_head;
 	qsize = nesqp->hwqp.sq_tail;
 
 	/* Check for SQ overflow */
 	if (((head + (2 * qsize) - nesqp->hwqp.sq_tail) % qsize) == (qsize - 1)) {
-			spin_unlock_irqrestore(&nesqp->lock, flags);
+		spin_unlock_irqrestore(&nesqp->lock, flags);
 		return -EINVAL;
 	}
 
@@ -269,7 +269,7 @@ static int nes_bind_mw(struct ib_qp *ibqp, struct ib_mw *ibmw,
 	nes_write32(nesdev->regs+NES_WQE_ALLOC,
 			(1 << 24) | 0x00800000 | nesqp->hwqp.qp_id);
 
-		spin_unlock_irqrestore(&nesqp->lock, flags);
+	spin_unlock_irqrestore(&nesqp->lock, flags);
 
 	return 0;
 }
@@ -349,7 +349,7 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd,
 			if (nesfmr->nesmr.pbls_used > nesadapter->free_4kpbl) {
 				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
 				ret = -ENOMEM;
-				goto failed_vpbl_alloc;
+				goto failed_vpbl_avail;
 			} else {
 				nesadapter->free_4kpbl -= nesfmr->nesmr.pbls_used;
 			}
@@ -357,7 +357,7 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd,
 			if (nesfmr->nesmr.pbls_used > nesadapter->free_256pbl) {
 				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
 				ret = -ENOMEM;
-				goto failed_vpbl_alloc;
+				goto failed_vpbl_avail;
 			} else {
 				nesadapter->free_256pbl -= nesfmr->nesmr.pbls_used;
 			}
@@ -391,14 +391,14 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd,
 			goto failed_vpbl_alloc;
 		}
 
-		nesfmr->root_vpbl.leaf_vpbl = kzalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_KERNEL);
+		nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1;
+		nesfmr->root_vpbl.leaf_vpbl = kzalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_ATOMIC);
 		if (!nesfmr->root_vpbl.leaf_vpbl) {
 			spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
 			ret = -ENOMEM;
 			goto failed_leaf_vpbl_alloc;
 		}
 
-		nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1;
 		nes_debug(NES_DBG_MR, "two level pbl, root_vpbl.pbl_vbase=%p"
 				" leaf_pbl_cnt=%d root_vpbl.leaf_vpbl=%p\n",
 				nesfmr->root_vpbl.pbl_vbase, nesfmr->leaf_pbl_cnt, nesfmr->root_vpbl.leaf_vpbl);
@@ -519,6 +519,16 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd,
 				nesfmr->root_vpbl.pbl_pbase);
 
 	failed_vpbl_alloc:
+	if (nesfmr->nesmr.pbls_used != 0) {
+		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
+		if (nesfmr->nesmr.pbl_4k)
+			nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used;
+		else
+			nesadapter->free_256pbl += nesfmr->nesmr.pbls_used;
+		spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+	}
+
+failed_vpbl_avail:
 	kfree(nesfmr);
 
 	failed_fmr_alloc:
@@ -534,18 +544,14 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd,
  */
 static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
 {
+	unsigned long flags;
 	struct nes_mr *nesmr = to_nesmr_from_ibfmr(ibfmr);
 	struct nes_fmr *nesfmr = to_nesfmr(nesmr);
 	struct nes_vnic *nesvnic = to_nesvnic(ibfmr->device);
 	struct nes_device *nesdev = nesvnic->nesdev;
-	struct nes_mr temp_nesmr = *nesmr;
+	struct nes_adapter *nesadapter = nesdev->nesadapter;
 	int i = 0;
 
-	temp_nesmr.ibmw.device = ibfmr->device;
-	temp_nesmr.ibmw.pd = ibfmr->pd;
-	temp_nesmr.ibmw.rkey = ibfmr->rkey;
-	temp_nesmr.ibmw.uobject = NULL;
-
 	/* free the resources */
 	if (nesfmr->leaf_pbl_cnt == 0) {
 		/* single PBL case */
@@ -561,8 +567,24 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr)
 		pci_free_consistent(nesdev->pcidev, 8192, nesfmr->root_vpbl.pbl_vbase,
 				nesfmr->root_vpbl.pbl_pbase);
 	}
+	nesmr->ibmw.device = ibfmr->device;
+	nesmr->ibmw.pd = ibfmr->pd;
+	nesmr->ibmw.rkey = ibfmr->rkey;
+	nesmr->ibmw.uobject = NULL;
 
-	return nes_dealloc_mw(&temp_nesmr.ibmw);
+	if (nesfmr->nesmr.pbls_used != 0) {
+		spin_lock_irqsave(&nesadapter->pbl_lock, flags);
+		if (nesfmr->nesmr.pbl_4k) {
+			nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used;
+			WARN_ON(nesadapter->free_4kpbl > nesadapter->max_4kpbl);
+		} else {
+			nesadapter->free_256pbl += nesfmr->nesmr.pbls_used;
+			WARN_ON(nesadapter->free_256pbl > nesadapter->max_256pbl);
+		}
+		spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+	}
+
+	return nes_dealloc_mw(&nesmr->ibmw);
 }
 
 
@@ -1595,7 +1617,7 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries,
 		nes_ucontext->mcrqf = req.mcrqf;
 		if (nes_ucontext->mcrqf) {
 			if (nes_ucontext->mcrqf & 0x80000000)
-				nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 12 + (nes_ucontext->mcrqf & 0xf) - 1;
+				nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 28 + 2 * ((nes_ucontext->mcrqf & 0xf) - 1);
 			else if (nes_ucontext->mcrqf & 0x40000000)
 				nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff;
 			else
@@ -3212,7 +3234,7 @@ static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
 	if (nesqp->ibqp_state > IB_QPS_RTS)
 		return -EINVAL;
 
-		spin_lock_irqsave(&nesqp->lock, flags);
+	spin_lock_irqsave(&nesqp->lock, flags);
 
 	head = nesqp->hwqp.sq_head;
 
@@ -3337,7 +3359,7 @@ static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
 				(counter << 24) | 0x00800000 | nesqp->hwqp.qp_id);
 	}
 
-		spin_unlock_irqrestore(&nesqp->lock, flags);
+	spin_unlock_irqrestore(&nesqp->lock, flags);
 
 	if (err)
 		*bad_wr = ib_wr;
@@ -3368,7 +3390,7 @@ static int nes_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr,
 	if (nesqp->ibqp_state > IB_QPS_RTS)
 		return -EINVAL;
 
-		spin_lock_irqsave(&nesqp->lock, flags);
+	spin_lock_irqsave(&nesqp->lock, flags);
 
 	head = nesqp->hwqp.rq_head;
 
@@ -3421,7 +3443,7 @@ static int nes_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr,
 		nes_write32(nesdev->regs+NES_WQE_ALLOC, (counter<<24) | nesqp->hwqp.qp_id);
 	}
 
-		spin_unlock_irqrestore(&nesqp->lock, flags);
+	spin_unlock_irqrestore(&nesqp->lock, flags);
 
 	if (err)
 		*bad_wr = ib_wr;
@@ -3453,7 +3475,7 @@ static int nes_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 
 	nes_debug(NES_DBG_CQ, "\n");
 
-		spin_lock_irqsave(&nescq->lock, flags);
+	spin_lock_irqsave(&nescq->lock, flags);
 
 	head = nescq->hw_cq.cq_head;
 	cq_size = nescq->hw_cq.cq_size;
@@ -3562,7 +3584,7 @@ static int nes_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry)
 	nes_debug(NES_DBG_CQ, "Reporting %u completions for CQ%u.\n",
 			cqe_count, nescq->hw_cq.cq_number);
 
-		spin_unlock_irqrestore(&nescq->lock, flags);
+	spin_unlock_irqrestore(&nescq->lock, flags);
 
 	return cqe_count;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index fddded7..85257f6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -106,12 +106,13 @@ int ipoib_open(struct net_device *dev)
 
 	ipoib_dbg(priv, "bringing up interface\n");
 
-	napi_enable(&priv->napi);
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
 	if (ipoib_pkey_dev_delay_open(dev))
 		return 0;
 
+	napi_enable(&priv->napi);
+
 	if (ipoib_ib_dev_open(dev)) {
 		napi_disable(&priv->napi);
 		return -EINVAL;
@@ -546,6 +547,7 @@ static int path_rec_start(struct net_device *dev,
 	if (path->query_id < 0) {
 		ipoib_warn(priv, "ib_sa_path_rec_get failed: %d\n", path->query_id);
 		path->query = NULL;
+		complete(&path->done);
 		return path->query_id;
 	}
 
@@ -662,7 +664,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 			skb_push(skb, sizeof *phdr);
 			__skb_queue_tail(&path->queue, skb);
 
-			if (path_rec_start(dev, path)) {
+			if (!path->query && path_rec_start(dev, path)) {
 				spin_unlock_irqrestore(&priv->lock, flags);
 				path_free(dev, path);
 				return;
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index fa431fa..56a2e21 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -87,6 +87,9 @@ enum {
 
 #ifdef CONFIG_MLX4_DEBUG
 extern int mlx4_debug_level;
+#else /* CONFIG_MLX4_DEBUG */
+#define mlx4_debug_level	(0)
+#endif /* CONFIG_MLX4_DEBUG */
 
 #define mlx4_dbg(mdev, format, arg...)					\
 	do {								\
@@ -94,12 +97,6 @@ extern int mlx4_debug_level;
 			dev_printk(KERN_DEBUG, &mdev->pdev->dev, format, ## arg); \
 	} while (0)
 
-#else /* CONFIG_MLX4_DEBUG */
-
-#define mlx4_dbg(mdev, format, arg...) do { (void) mdev; } while (0)
-
-#endif /* CONFIG_MLX4_DEBUG */
-
 #define mlx4_err(mdev, format, arg...) \
 	dev_err(&mdev->pdev->dev, format, ## arg)
 #define mlx4_info(mdev, format, arg...) \


From kelly at tradebotsystems.com  Tue Nov  4 13:37:36 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Tue, 4 Nov 2008 15:37:36 -0600
Subject: [ofa-general] infiniband multicast (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343489@tbmail2.tradebot.com>

 
> -----Original Message-----
> From: Dotan Barak [mailto:dotanba at gmail.com] 
> 
> Kelly Burkhart wrote:
> > If I specify more than about 6500 messages/sec, I skip some 
> sequences
> > and receive others multiple times.  I always receive the 
> same number of
> > messages the client sent.  It appears as though all of the 
> messages come
> > through, but I'm missing some and reading others twice.
> >   
> Do you use the "volatile" when you access the pointed memory buffer?

I do not.  I noticed this with the post_buf and poll_buf variables
in pingpong_context, but they're not used in send_lat.  I assumed
they only replied to RDMA.

Do I need to be using volatile anywhere with UD send?


> > I suspect that there is some trick to more reliable 
> multicast messaging
> > that I don't know about.  Does anyone have hints for 
> multicasting high
> > message rates with a small percentage of drops or misses?
> >   
> Do you have worst results than the ib_send_bw.c?
> Can you try to send unicast messages (with minimum changes) to see if 
> the issue is related to multicast send?
> 
> Anyway, you should remember that multicast messages are being 
> sent over 
> UD QPs and messages can be dropped.

I solved (or hid) my problem by recv-ing into multiple buffers.  I do
realize that multicast messages can be dropped, but I want to know what
level of one-way reliability and message rate I can achieve.

Since I was receiving all messages before, I don't think my results 
were different than ib_send_bw.  My problem was not realizing that my 
buffer could be clobbered prior to me polling the cq for the work
completion.

Thanks,

-K


From devesh28 at gmail.com  Tue Nov  4 21:46:36 2008
From: devesh28 at gmail.com (Devesh Sharma)
Date: Wed, 5 Nov 2008 11:16:36 +0530
Subject: [ofa-general] infiniband multicast (libibverbs)
In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343488@tbmail2.tradebot.com>
References: <Ack+oMHXunbAnZPAR0KK6ZipPjMrAAAIV5NA>
	<98B0CDCB28A5EE4CB3678CD99406644E343488@tbmail2.tradebot.com>
Message-ID: <309a667c0811042146m56c1a1d4od0e03a823c4ff098@mail.gmail.com>

are you taking care that ibv_poll_cq is not a blocking call, I mean you are
not considering it as blocking call and just going ahead with the sequence
number check?

On 11/5/08, Kelly Burkhart <kelly at tradebotsystems.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Roland Dreier [mailto:rdreier at cisco.com]
> >
> >  > If I specify more than about 6500 messages/sec, I skip
> > some sequences
> >  > and receive others multiple times.  I always receive the
> > same number of
> >  > messages the client sent.  It appears as though all of the
> > messages come
> >  > through, but I'm missing some and reading others twice.
> >
> > Sounds like a bug in your code -- I don't know why you would see
> > duplicate messages unless you are somehow processing the same receive
> > buffer twice or something like that.
>
> I am (or was) processing the same buffer over and over.  I ripped
> from ib_send_lat which does the same thing.  The difference is
> send_lat waits for a reply before sending a second message.  I'm
> sending rapidly without waiting for a reply.  The surprising thing
> to me was that my recv buffer received data ahead of me waiting on
> the cq.
>
> I modified my code to read into a circular list of buffers which
> appears to have solved the problem at the cost of more memory usage.
>
> Thanks,
>
> -K
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081105/a8d92105/attachment.html>

From devesh28 at gmail.com  Tue Nov  4 21:49:19 2008
From: devesh28 at gmail.com (Devesh Sharma)
Date: Wed, 5 Nov 2008 11:19:19 +0530
Subject: [ofa-general] infiniband multicast (libibverbs)
In-Reply-To: <309a667c0811042146m56c1a1d4od0e03a823c4ff098@mail.gmail.com>
References: <98B0CDCB28A5EE4CB3678CD99406644E343488@tbmail2.tradebot.com>
	<309a667c0811042146m56c1a1d4od0e03a823c4ff098@mail.gmail.com>
Message-ID: <309a667c0811042149x498f4f9fhfa74330a94ce59ea@mail.gmail.com>

Correction in my post :  I mean you are not considering it as non-blocking
call (not taking care of this behaviour) and just going ahead with the
sequence number check?

On 11/5/08, Devesh Sharma <devesh28 at gmail.com> wrote:
>
> are you taking care that ibv_poll_cq is not a blocking call, I mean you are
> not considering it as blocking call and just going ahead with the sequence
> number check?
>
> On 11/5/08, Kelly Burkhart <kelly at tradebotsystems.com> wrote:
>>
>>
>>
>> > -----Original Message-----
>> > From: Roland Dreier [mailto:rdreier at cisco.com]
>> >
>> >  > If I specify more than about 6500 messages/sec, I skip
>> > some sequences
>> >  > and receive others multiple times.  I always receive the
>> > same number of
>> >  > messages the client sent.  It appears as though all of the
>> > messages come
>> >  > through, but I'm missing some and reading others twice.
>> >
>> > Sounds like a bug in your code -- I don't know why you would see
>> > duplicate messages unless you are somehow processing the same receive
>> > buffer twice or something like that.
>>
>> I am (or was) processing the same buffer over and over.  I ripped
>> from ib_send_lat which does the same thing.  The difference is
>> send_lat waits for a reply before sending a second message.  I'm
>> sending rapidly without waiting for a reply.  The surprising thing
>> to me was that my recv buffer received data ahead of me waiting on
>> the cq.
>>
>> I modified my code to read into a circular list of buffers which
>> appears to have solved the problem at the cost of more memory usage.
>>
>> Thanks,
>>
>> -K
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081105/477214b0/attachment.html>

From vlad at lists.openfabrics.org  Wed Nov  5 03:20:53 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed,  5 Nov 2008 03:20:53 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081105-0200 daily build status
Message-ID: <20081105112053.5EAACE60CB2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From jackm at dev.mellanox.co.il  Wed Nov  5 04:44:01 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 5 Nov 2008 14:44:01 +0200
Subject: [ofa-general] [PATCH V2] mlx4: save default port ib capabilities,
	and use when setting port type to IB.
In-Reply-To: <adaprlbtjaz.fsf@cisco.com>
References: <200811041214.39085.jackm@dev.mellanox.co.il>
	<adaprlbtjaz.fsf@cisco.com>
Message-ID: <200811051444.02306.jackm@dev.mellanox.co.il>

mlx4: save default port ib capabilities, and use when setting port type to IB.

Commit 7ff93f8b7... introduced support for different port types.
As part of that support, SET_PORT is invoked to set the port type
during driver startup.  However, as a side-effect, for IB ports
the invocation of this command also sets the port capability mask
to zero (losing the default configuration values set by FW).

To fix this, get the default ib port capabilities (via a MAD_IFC
Port Info query) during driver startup, and save them for use in
the mlx4_SET_PORT command when setting the port-type to Infiniband.

This patch does not require a firmware modification to the
ConnectX SET_PORT command (per Roland's feedback on previous proposed fix).

This patch fixes bugzilla 1183 (which occurred because the
IsTrapSupported bit in the capability mask was zeroed).

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 468921b..90a0281 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -753,6 +753,7 @@ static int mlx4_setup_hca(struct mlx4_dev *dev)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	int err;
 	int port;
+	__be32 ib_port_default_caps;
 
 	err = mlx4_init_uar_table(dev);
 	if (err) {
@@ -852,6 +853,13 @@ static int mlx4_setup_hca(struct mlx4_dev *dev)
 	}
 
 	for (port = 1; port <= dev->caps.num_ports; port++) {
+		ib_port_default_caps = 0;
+		err = mlx4_get_port_ib_caps(dev, port, &ib_port_default_caps);
+		if (err)
+			mlx4_warn(dev, "failed to get port %d default "
+				  "ib capabilities (%d). Continuing with "
+				  "caps = 0\n", port, err);
+		dev->caps.ib_port_def_cap[port] = ib_port_default_caps;
 		err = mlx4_SET_PORT(dev, port);
 		if (err) {
 			mlx4_err(dev, "Failed to set port %d, aborting\n",
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index fa431fa..183ab9d 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -388,5 +388,6 @@ void mlx4_init_mac_table(struct mlx4_dev *dev, struct mlx4_mac_table *table);
 void mlx4_init_vlan_table(struct mlx4_dev *dev, struct mlx4_vlan_table *table);
 
 int mlx4_SET_PORT(struct mlx4_dev *dev, u8 port);
+int mlx4_get_port_ib_caps(struct mlx4_dev *dev, u8 port, __be32 *caps);
 
 #endif /* MLX4_H */
diff --git a/drivers/net/mlx4/port.c b/drivers/net/mlx4/port.c
index e2fdab4..0a057e5 100644
--- a/drivers/net/mlx4/port.c
+++ b/drivers/net/mlx4/port.c
@@ -258,6 +258,42 @@ out:
 }
 EXPORT_SYMBOL_GPL(mlx4_unregister_vlan);
 
+int mlx4_get_port_ib_caps(struct mlx4_dev *dev, u8 port, __be32 *caps)
+{
+	struct mlx4_cmd_mailbox *inmailbox, *outmailbox;
+	u8 *inbuf, *outbuf;
+	int err;
+
+	inmailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(inmailbox))
+		return PTR_ERR(inmailbox);
+
+	outmailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(outmailbox)) {
+		mlx4_free_cmd_mailbox(dev, inmailbox);
+		return PTR_ERR(outmailbox);
+	}
+
+	inbuf = inmailbox->buf;
+	outbuf = outmailbox->buf;
+	memset(inbuf, 0, 256);
+	memset(outbuf, 0, 256);
+	inbuf[0] = 1;
+	inbuf[1] = 1;
+	inbuf[2] = 1;
+	inbuf[3] = 1;
+	*(__be16 *) (&inbuf[16]) = cpu_to_be16(0x0015);
+	*(__be32 *) (&inbuf[20]) = cpu_to_be32(port);
+
+	err = mlx4_cmd_box(dev, inmailbox->dma, outmailbox->dma, port, 3,
+			   MLX4_CMD_MAD_IFC, MLX4_CMD_TIME_CLASS_C);
+	if (!err)
+		*caps = *(__be32 *) (outbuf + 84);
+	mlx4_free_cmd_mailbox(dev, inmailbox);
+	mlx4_free_cmd_mailbox(dev, outmailbox);
+	return err;
+}
+
 int mlx4_SET_PORT(struct mlx4_dev *dev, u8 port)
 {
 	struct mlx4_cmd_mailbox *mailbox;
@@ -273,7 +309,8 @@ int mlx4_SET_PORT(struct mlx4_dev *dev, u8 port)
 		((u8 *) mailbox->buf)[3] = 6;
 		((__be16 *) mailbox->buf)[4] = cpu_to_be16(1 << 15);
 		((__be16 *) mailbox->buf)[6] = cpu_to_be16(1 << 15);
-	}
+	} else
+		((__be32 *) mailbox->buf)[1] = dev->caps.ib_port_def_cap[port];
 	err = mlx4_cmd(dev, mailbox->dma, port, is_eth, MLX4_CMD_SET_PORT,
 		       MLX4_CMD_TIME_CLASS_B);
 
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index bd9977b..371086f 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -179,6 +179,7 @@ struct mlx4_caps {
 	int			num_ports;
 	int			vl_cap[MLX4_MAX_PORTS + 1];
 	int			ib_mtu_cap[MLX4_MAX_PORTS + 1];
+	__be32			ib_port_def_cap[MLX4_MAX_PORTS + 1];
 	u64			def_mac[MLX4_MAX_PORTS + 1];
 	int			eth_mtu_cap[MLX4_MAX_PORTS + 1];
 	int			gid_table_len[MLX4_MAX_PORTS + 1];


From vlad at mellanox.co.il  Wed Nov  5 04:57:34 2008
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 5 Nov 2008 14:57:34 +0200
Subject: [ofa-general] [PATCH] IB/mlx4: Set umem field to NULL in
	mlx4_ib_alloc_fast_reg_mr to avoid oops
Message-ID: <20081105125734.GA23862@mellanox.co.il>

Set mr->umem to NULL in mlx4_ib_alloc_fast_reg_mr, to avoid invoking
ib_umem_release() during ib_dereg_mr() and getting a kernel oops.

Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
---
 drivers/infiniband/hw/mlx4/mr.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 87f5c5a..8e4d26d 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -205,6 +205,7 @@ struct ib_mr *mlx4_ib_alloc_fast_reg_mr(struct ib_pd *pd,
 		goto err_mr;
 
 	mr->ibmr.rkey = mr->ibmr.lkey = mr->mmr.key;
+	mr->umem = NULL;
 
 	return &mr->ibmr;
 
-- 
1.5.6.3


From kelly at tradebotsystems.com  Wed Nov  5 05:53:15 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Wed, 5 Nov 2008 07:53:15 -0600
Subject: [ofa-general] infiniband multicast (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E34348A@tbmail2.tradebot.com>

It is non-blocking.  I spin, calling ibv_poll_cq until it returns a
non-zero.
 

________________________________

	From: Devesh Sharma [mailto:devesh28 at gmail.com] 
	Sent: Tuesday, November 04, 2008 11:49 PM
	To: Kelly Burkhart
	Cc: Roland Dreier; general at lists.openfabrics.org
	Subject: Re: [ofa-general] infiniband multicast (libibverbs)
	
	
	Correction in my post :  I mean you are not considering it as
non-blocking call (not taking care of this behaviour) and just going
ahead with the sequence number check? 
	
	
	On 11/5/08, Devesh Sharma <devesh28 at gmail.com> wrote: 

		are you taking care that ibv_poll_cq is not a blocking
call, I mean you are not considering it as blocking call and just going
ahead with the sequence number check? 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081105/97065fd9/attachment.html>

From yevgenyp at mellanox.co.il  Wed Nov  5 06:48:36 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Wed, 05 Nov 2008 16:48:36 +0200
Subject: [ofa-general] [PATCH] mlx4_en: Pause parameters per port
Message-ID: <4911B244.30205@mellanox.co.il>

Before the change the driver reported the same pause parameters
for all the ports, even only one of them was modified.

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
 drivers/net/mlx4/en_netdev.c |    8 ++++----
 drivers/net/mlx4/en_params.c |   30 ++++++++++++++++--------------
 drivers/net/mlx4/mlx4_en.h   |    8 ++++----
 3 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/drivers/net/mlx4/en_netdev.c b/drivers/net/mlx4/en_netdev.c
index a339afb..12d736a 100644
--- a/drivers/net/mlx4/en_netdev.c
+++ b/drivers/net/mlx4/en_netdev.c
@@ -656,10 +656,10 @@ static int mlx4_en_start_port(struct net_device *dev)
 	/* Configure port */
 	err = mlx4_SET_PORT_general(mdev->dev, priv->port,
 				    priv->rx_skb_size + ETH_FCS_LEN,
-				    mdev->profile.tx_pause,
-				    mdev->profile.tx_ppp,
-				    mdev->profile.rx_pause,
-				    mdev->profile.rx_ppp);
+				    priv->prof->tx_pause,
+				    priv->prof->tx_ppp,
+				    priv->prof->rx_pause,
+				    priv->prof->rx_ppp);
 	if (err) {
 		mlx4_err(mdev, "Failed setting port general configurations"
 			       " for port %d, with error %d\n", priv->port, err);
diff --git a/drivers/net/mlx4/en_params.c b/drivers/net/mlx4/en_params.c
index c2e69b1..95706ee 100644
--- a/drivers/net/mlx4/en_params.c
+++ b/drivers/net/mlx4/en_params.c
@@ -90,6 +90,7 @@ MLX4_EN_PARM_INT(rx_ring_size2, MLX4_EN_AUTO_CONF, "Rx ring size for port 2");
 int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 {
 	struct mlx4_en_profile *params = &mdev->profile;
+	int i;

 	params->rx_moder_cnt = min_t(int, rx_moder_cnt, MLX4_EN_AUTO_CONF);
 	params->rx_moder_time = min_t(int, rx_moder_time, MLX4_EN_AUTO_CONF);
@@ -97,11 +98,13 @@ int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 	params->rss_xor = (rss_xor != 0);
 	params->rss_mask = rss_mask & 0x1f;
 	params->num_lro = min_t(int, num_lro , MLX4_EN_MAX_LRO_DESCRIPTORS);
-	params->rx_pause = pprx;
-	params->rx_ppp = pfcrx;
-	params->tx_pause = pptx;
-	params->tx_ppp = pfctx;
-	if (params->rx_ppp || params->tx_ppp) {
+	for (i = 1; i <= MLX4_MAX_PORTS; i++) {
+		params->prof[i].rx_pause = pprx;
+		params->prof[i].rx_ppp = pfcrx;
+		params->prof[i].tx_pause = pptx;
+		params->prof[i].tx_ppp = pfctx;
+	}
+	if (pfcrx || pfctx) {
 		params->prof[1].tx_ring_num = MLX4_EN_TX_RING_NUM;
 		params->prof[2].tx_ring_num = MLX4_EN_TX_RING_NUM;
 	} else {
@@ -407,14 +410,14 @@ static int mlx4_en_set_pauseparam(struct net_device *dev,
 	struct mlx4_en_dev *mdev = priv->mdev;
 	int err;

-	mdev->profile.tx_pause = pause->tx_pause != 0;
-	mdev->profile.rx_pause = pause->rx_pause != 0;
+	priv->prof->tx_pause = pause->tx_pause != 0;
+	priv->prof->rx_pause = pause->rx_pause != 0;
 	err = mlx4_SET_PORT_general(mdev->dev, priv->port,
 				    priv->rx_skb_size + ETH_FCS_LEN,
-				    mdev->profile.tx_pause,
-				    mdev->profile.tx_ppp,
-				    mdev->profile.rx_pause,
-				    mdev->profile.rx_ppp);
+				    priv->prof->tx_pause,
+				    priv->prof->tx_ppp,
+				    priv->prof->rx_pause,
+				    priv->prof->rx_ppp);
 	if (err)
 		mlx4_err(mdev, "Failed setting pause params to\n");

@@ -425,10 +428,9 @@ static void mlx4_en_get_pauseparam(struct net_device *dev,
 				 struct ethtool_pauseparam *pause)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	struct mlx4_en_dev *mdev = priv->mdev;

-	pause->tx_pause = mdev->profile.tx_pause;
-	pause->rx_pause = mdev->profile.rx_pause;
+	pause->tx_pause = priv->prof->tx_pause;
+	pause->rx_pause = priv->prof->rx_pause;
 }

 static void mlx4_en_get_ringparam(struct net_device *dev,
diff --git a/drivers/net/mlx4/mlx4_en.h b/drivers/net/mlx4/mlx4_en.h
index 11fb17c..98ddc08 100644
--- a/drivers/net/mlx4/mlx4_en.h
+++ b/drivers/net/mlx4/mlx4_en.h
@@ -322,6 +322,10 @@ struct mlx4_en_port_profile {
 	u32 rx_ring_num;
 	u32 tx_ring_size;
 	u32 rx_ring_size;
+	u8 rx_pause;
+	u8 rx_ppp;
+	u8 tx_pause;
+	u8 tx_ppp;
 };

 struct mlx4_en_profile {
@@ -333,10 +337,6 @@ struct mlx4_en_profile {
 	int rx_moder_cnt;
 	int rx_moder_time;
 	int auto_moder;
-	u8 rx_pause;
-	u8 rx_ppp;
-	u8 tx_pause;
-	u8 tx_ppp;
 	u8 no_reset;
 	struct mlx4_en_port_profile prof[MLX4_MAX_PORTS + 1];
 };
-- 
1.5.4


From yevgenyp at mellanox.co.il  Wed Nov  5 06:53:50 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Wed, 05 Nov 2008 16:53:50 +0200
Subject: [ofa-general] [PATCH] mlx4_en: Start port error flow bug fix
Message-ID: <4911B37E.3020900@mellanox.co.il>

Tried to deactivate rx ring that wasn't activated,
used wrong index.

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
 drivers/net/mlx4/en_netdev.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/mlx4/en_netdev.c b/drivers/net/mlx4/en_netdev.c
index 12d736a..96e709d 100644
--- a/drivers/net/mlx4/en_netdev.c
+++ b/drivers/net/mlx4/en_netdev.c
@@ -706,7 +706,7 @@ tx_err:
 	mlx4_en_release_rss_steer(priv);
 rx_err:
 	for (i = 0; i < priv->rx_ring_num; i++)
-		mlx4_en_deactivate_rx_ring(priv, &priv->rx_ring[rx_index]);
+		mlx4_en_deactivate_rx_ring(priv, &priv->rx_ring[i]);
 cq_err:
 	while (rx_index--)
 		mlx4_en_deactivate_cq(priv, &priv->rx_cq[rx_index]);
-- 
1.5.4


From rdreier at cisco.com  Wed Nov  5 09:58:58 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Nov 2008 09:58:58 -0800
Subject: [ofa-general] Re: [PATCH] mlx4_en: Pause parameters per port
In-Reply-To: <4911B244.30205@mellanox.co.il> (Yevgeny Petrilin's message of
	"Wed, 05 Nov 2008 16:48:36 +0200")
References: <4911B244.30205@mellanox.co.il>
Message-ID: <adamygerr4t.fsf@cisco.com>

Jeff, please go ahead and merge both of these mlx4_en patches.

Yevgeny, I think it would be helpful for you to say (after the --- line,
so the git tools strip it automatically) whether I should merge the
patch or if it's something for Jeff, just to make things smoother and
clearer.

 - R.


From rdreier at cisco.com  Wed Nov  5 10:57:22 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Nov 2008 10:57:22 -0800
Subject: [ofa-general] Re: [PATCH] IB/mlx4: Set umem field to NULL in
	mlx4_ib_alloc_fast_reg_mr to avoid oops
In-Reply-To: <20081105125734.GA23862@mellanox.co.il> (Vladimir Sokolovsky's
	message of "Wed, 5 Nov 2008 14:57:34 +0200")
References: <20081105125734.GA23862@mellanox.co.il>
Message-ID: <adaabcerofh.fsf@cisco.com>

thanks, applied


From kelly at tradebotsystems.com  Wed Nov  5 11:16:56 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Wed, 5 Nov 2008 13:16:56 -0600
Subject: [ofa-general] Managing work completions (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343497@tbmail2.tradebot.com>

I'm now trying to work out the best approach for managing completions
without spinning on ibv_poll_cq.

I think what I want to do is create a completion channel and operate
similarly to the last example in the ibv_ack_cq_events man page.

The man page states that ibv_ack_cq_events is mandatory, however, the
examples in perftest don't ack when in event mode.  Is this a bug in
the perftest programs or a bug in the man page?

Is it possible use epoll to block on struct ibv_comp_channel::fd then
use
ibv_poll_cq to grab completions when epoll wakes up the process?  Then
(I
hope) it would be unnecessary to call ibv_get/ack_cq_event(s).  Or is
it necessary to call these functions in place of ibv_poll_cq when a
completion channel is used?

Again, thanks for your advice,

-K


From rdreier at cisco.com  Wed Nov  5 11:35:57 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 05 Nov 2008 11:35:57 -0800
Subject: [ofa-general] Managing work completions (libibverbs)
In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343497@tbmail2.tradebot.com>
	(Kelly Burkhart's message of "Wed, 5 Nov 2008 13:16:56 -0600")
References: <98B0CDCB28A5EE4CB3678CD99406644E343497@tbmail2.tradebot.com>
Message-ID: <ada63n2rmn6.fsf@cisco.com>

 > The man page states that ibv_ack_cq_events is mandatory, however, the
 > examples in perftest don't ack when in event mode.  Is this a bug in
 > the perftest programs or a bug in the man page?

I guess it would be a bug in the perftest programs, but the only need to
call ibv_ack_cq_events() is when destroying a CQ -- ibv_destroy_cq()
will wait until all CQ events are ACKed before returning.

 > Is it possible use epoll to block on struct ibv_comp_channel::fd then
 > use ibv_poll_cq to grab completions when epoll wakes up the process?
 > Then (I hope) it would be unnecessary to call
 > ibv_get/ack_cq_event(s).  Or is it necessary to call these functions
 > in place of ibv_poll_cq when a completion channel is used?

You can use epoll to get comp channel events, but you'll need to collect
the event with ibv_get_cq_event() to rearm things.  epoll tells you when
the fd becomes readable, but you'll need to actually read all the
events queued on the fd before waiting again.  The overhead of
ibv_get_cq_event() should not be too high compared to the overhead of
sleeping and getting woken up again by an interrupt, and you can always
amortize ibv_ack_cq_events() by just keeping a counter of the number of
events you read and only calling ibv_ack_cq_events() occasionally.

 - R.


From kelly at tradebotsystems.com  Wed Nov  5 13:04:00 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Wed, 5 Nov 2008 15:04:00 -0600
Subject: [ofa-general] Managing work completions (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343499@tbmail2.tradebot.com>

 
> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> 
> You can use epoll to get comp channel events, but you'll need 
> to collect
> the event with ibv_get_cq_event() to rearm things.  epoll 
> tells you when
> the fd becomes readable, but you'll need to actually read all the
> events queued on the fd before waiting again.  The overhead of
> ibv_get_cq_event() should not be too high compared to the overhead of
> sleeping and getting woken up again by an interrupt, and you 
> can always
> amortize ibv_ack_cq_events() by just keeping a counter of the 
> number of
> events you read and only calling ibv_ack_cq_events() occasionally.


Digging through the code to see what resource I hog if I don't ack
frequently enough:  It appears that ibv_ack_cq_events only increments
an integer in the CQ (and doesn't free or return some resource).  So I
could just count gets and ack them all immediately prior to
destructing the CQ.

Why be so picky about matching acks with gets?

-K


From roland.list at gmail.com  Wed Nov  5 13:17:10 2008
From: roland.list at gmail.com (Roland Dreier)
Date: Wed, 5 Nov 2008 13:17:10 -0800
Subject: [ofa-general] Managing work completions (libibverbs)
In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343499@tbmail2.tradebot.com>
References: <98B0CDCB28A5EE4CB3678CD99406644E343499@tbmail2.tradebot.com>
Message-ID: <f8ca0a150811051317j3c274c8cnd40092a4db4a599f@mail.gmail.com>

Yes, exactly: by keeping your own count, you avoid the pthread lock
overhead in ack_events.

The acking of events is required to avoid a race where a consumer gets
an event for a CQ after destroying that CQ.

 - R.

On 11/5/08, Kelly Burkhart <kelly at tradebotsystems.com> wrote:
>
>
>> -----Original Message-----
>> From: Roland Dreier [mailto:rdreier at cisco.com]
>>
>> You can use epoll to get comp channel events, but you'll need
>> to collect
>> the event with ibv_get_cq_event() to rearm things.  epoll
>> tells you when
>> the fd becomes readable, but you'll need to actually read all the
>> events queued on the fd before waiting again.  The overhead of
>> ibv_get_cq_event() should not be too high compared to the overhead of
>> sleeping and getting woken up again by an interrupt, and you
>> can always
>> amortize ibv_ack_cq_events() by just keeping a counter of the
>> number of
>> events you read and only calling ibv_ack_cq_events() occasionally.
>
>
> Digging through the code to see what resource I hog if I don't ack
> frequently enough:  It appears that ibv_ack_cq_events only increments
> an integer in the CQ (and doesn't free or return some resource).  So I
> could just count gets and ack them all immediately prior to
> destructing the CQ.
>
> Why be so picky about matching acks with gets?
>
> -K
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>

-- 
Sent from my mobile device


From John.Marshall at ec.gc.ca  Wed Nov  5 14:34:47 2008
From: John.Marshall at ec.gc.ca (John Marshall)
Date: Wed, 05 Nov 2008 22:34:47 +0000
Subject: [ofa-general] OOM problem with ib_ipoib?
In-Reply-To: <f8ca0a150810291111m22de5052u6de4497af9459d1f@mail.gmail.com>
References: <48FF6DFA.9080409@ec.gc.ca> <48FFA62D.3030305@ec.gc.ca>	
	<490083D0.5000807@ec.gc.ca> <aday708mmoa.fsf@cisco.com>	
	<490876DF.2020705@ec.gc.ca>
	<f8ca0a150810291111m22de5052u6de4497af9459d1f@mail.gmail.com>
Message-ID: <49121F87.6010204@ec.gc.ca>

Roland Dreier wrote:
>> The curious thing is that the OOM occurs even when the ib interfaces
>> are _not even UP_, although the ib_ipoib module is loaded. So, I cannot
>> see how it can be an allocation issue in such a case related to usage. Am I
>> missing something here?
>>     
>
> The IPoIB CM code allocates receive buffers even before the interface is brought
> up.  Maybe the wrong thing to do, but that's how the code is now at least.
>
>   
>> As well, shouldn't the OS handle this transparently via the pdflush which
>> will write out the data and free up memory? Or does the pdflush not
>> distinguish between total memory and low memory so that a problem
>> occurs (yet the OOM happens even when the interfaces are not UP!)?
>>     
>
> You may really have no free lowmem... keep in mind that the linux mm really
> does not behave well with 32G of RAM and a 32-bit kernel.  It's fundamentally
> and insane config and so no one tunes for it.
>   
Progress!

1) I have done further tests and am comfortable that they do not happen on
the x86-64 platform.

2) More tests using the same equipment but again with bigmem and, given
your pointer on lowmem, have found that if I tweak the system with sysctl
setting of:
    vm.lowmem_reserve_ratio=128 128 32
things seem to work well. I do this on _both_ the server and the client 
sides
(lowmem issues also pop up on the client side when using nfs).

Thanks,
John


From John.Marshall at ec.gc.ca  Wed Nov  5 14:47:37 2008
From: John.Marshall at ec.gc.ca (John Marshall)
Date: Wed, 05 Nov 2008 22:47:37 +0000
Subject: [ofa-general] nfs/rdma slow with uncached data
Message-ID: <49122289.5030107@ec.gc.ca>

Hi,

I have done some nfs/rdma tests and found impressive transfer
rates, but only when the file data is in cache. On the other hand,
using straight nfs over ipoib I am able to get decent transfer
rates one first read (not cached) when I tweak:
    echo 128 > /proc/fs/nfsd/pool_threads
    echo 1024 > /sys/block/sd?/queue/nr_requests
    echo 16384 > /sys/block/sd?/queue/read_ahead_kb (or using blockdev 
with 8192)

My question: Does the nfs/rdma setup bypass the readahead
mechanism in the kernel?

If it does,
1) this may account for the major difference described above,
2) explain why transfers are _very_ fast only on the second go
     around--because it is in cached (assuming it fits in the cache)

For both cases, I am using a 2.6.26 bigmem kernel with the necessary
tweaks.

Thanks,
John


From akepner at sgi.com  Wed Nov  5 17:23:07 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 5 Nov 2008 17:23:07 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
Message-ID: <20081106012307.GP31163@sgi.com>


Way back in:

http://lists.openfabrics.org/pipermail/general/2008-May/050196.html

I described an IPoIB-related panic we were seeing on large 
clusters. The signature was a backtrace like this:

	skb_over_panic
	:ib_ipoib:ipoib_ib_handle_rx_wc
	:ib_ipoib:ipoib_poll
	net_rx_action
	.....

The bug is difficult to reproduce, but we finally got a crashdump, 
and the problem appears to be that stale skb pointers on the tx_ring 
were left pointing to skbs that had been since reused, so that the 
skb's data region was now unexpectedly short, etc. 

Recently LLNL reported something similar:

http://lists.openfabrics.org/pipermail/general/2008-October/054824.html

A patch similar to the following seems to fix thing up. 

Ira, Al, if this looks OK, can you please sign off on it?

Signed-off-by: Arthur Kepner <akepner at sgi.com>

--- 

 ipoib_cm.c |    5 +++++
 ipoib_ib.c |    4 ++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 7b14c2c..8f8650b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -200,6 +200,7 @@ static void ipoib_cm_free_rx_ring(struct net_device *dev,
 			ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
 					      rx_ring[i].mapping);
 			dev_kfree_skb_any(rx_ring[i].skb);
+			rx_ring[i].skb = NULL;
 		}
 
 	vfree(rx_ring);
@@ -736,6 +737,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
 		++dev->stats.tx_errors;
 		dev_kfree_skb_any(skb);
+		tx_req->skb = NULL;
 		return;
 	}
 
@@ -747,6 +749,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 		++dev->stats.tx_errors;
 		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
 		dev_kfree_skb_any(skb);
+		tx_req->skb = NULL;
 	} else {
 		dev->trans_start = jiffies;
 		++tx->tx_head;
@@ -785,6 +788,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	dev->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
+	tx_req->skb = NULL;
 
 	netif_tx_lock(dev);
 
@@ -1179,6 +1183,7 @@ timeout:
 		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
 				    DMA_TO_DEVICE);
 		dev_kfree_skb_any(tx_req->skb);
+		tx_req->skb = NULL;
 		++p->tx_tail;
 		netif_tx_lock_bh(p->dev);
 		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 28eb6f0..f7e3497 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -383,6 +383,7 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	dev->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
+	tx_req->skb = NULL;
 
 	++priv->tx_tail;
 	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
@@ -572,6 +573,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
 		++dev->stats.tx_errors;
 		dev_kfree_skb_any(skb);
+		tx_req->skb = NULL;
 		return;
 	}
 
@@ -594,6 +596,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		--priv->tx_outstanding;
 		ipoib_dma_unmap_tx(priv->ca, tx_req);
 		dev_kfree_skb_any(skb);
+		tx_req->skb = NULL;
 		if (netif_queue_stopped(dev))
 			netif_wake_queue(dev);
 	} else {
@@ -833,6 +836,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 							(ipoib_sendq_size - 1)];
 				ipoib_dma_unmap_tx(priv->ca, tx_req);
 				dev_kfree_skb_any(tx_req->skb);
+				tx_req->skb = NULL;
 				++priv->tx_tail;
 				--priv->tx_outstanding;
 			}


From chu11 at llnl.gov  Wed Nov  5 17:46:03 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 05 Nov 2008 17:46:03 -0800
Subject: [ofa-general] Re: [PATCH] ipoib: null tx/rx_ring skb pointers on
	free
In-Reply-To: <20081106012307.GP31163@sgi.com>
References: <20081106012307.GP31163@sgi.com>
Message-ID: <1225935964.13371.5.camel@cardanus.llnl.gov>

Hey Arthur,

On Wed, 2008-11-05 at 17:23 -0800, akepner at sgi.com wrote:
> Way back in:
> 
> http:// lists.openfabrics.org/pipermail/general/2008-May/050196.html
> 
> I described an IPoIB-related panic we were seeing on large 
> clusters. The signature was a backtrace like this:
> 
> 	skb_over_panic
> 	:ib_ipoib:ipoib_ib_handle_rx_wc
> 	:ib_ipoib:ipoib_poll
> 	net_rx_action
> 	.....
> 
> The bug is difficult to reproduce, but we finally got a crashdump, 
> and the problem appears to be that stale skb pointers on the tx_ring 
> were left pointing to skbs that had been since reused, so that the 
> skb's data region was now unexpectedly short, etc. 
> 
> Recently LLNL reported something similar:
> 
> http:// lists.openfabrics.org/pipermail/general/2008-October/054824.html
> 
> A patch similar to the following seems to fix thing up. 
> 
> Ira, Al, if this looks OK, can you please sign off on it?

Looks good to me.

Al

> Signed-off-by: Arthur Kepner <akepner at sgi.com>
> 
> --- 
> 
>  ipoib_cm.c |    5 +++++
>  ipoib_ib.c |    4 ++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 7b14c2c..8f8650b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -200,6 +200,7 @@ static void ipoib_cm_free_rx_ring(struct net_device *dev,
>  			ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
>  					      rx_ring[i].mapping);
>  			dev_kfree_skb_any(rx_ring[i].skb);
> +			rx_ring[i].skb = NULL;
>  		}
>  
>  	vfree(rx_ring);
> @@ -736,6 +737,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -747,6 +749,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  		++dev->stats.tx_errors;
>  		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  	} else {
>  		dev->trans_start = jiffies;
>  		++tx->tx_head;
> @@ -785,6 +788,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	netif_tx_lock(dev);
>  
> @@ -1179,6 +1183,7 @@ timeout:
>  		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
>  				    DMA_TO_DEVICE);
>  		dev_kfree_skb_any(tx_req->skb);
> +		tx_req->skb = NULL;
>  		++p->tx_tail;
>  		netif_tx_lock_bh(p->dev);
>  		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 28eb6f0..f7e3497 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -383,6 +383,7 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	++priv->tx_tail;
>  	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> @@ -572,6 +573,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -594,6 +596,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  		--priv->tx_outstanding;
>  		ipoib_dma_unmap_tx(priv->ca, tx_req);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		if (netif_queue_stopped(dev))
>  			netif_wake_queue(dev);
>  	} else {
> @@ -833,6 +836,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
>  							(ipoib_sendq_size - 1)];
>  				ipoib_dma_unmap_tx(priv->ca, tx_req);
>  				dev_kfree_skb_any(tx_req->skb);
> +				tx_req->skb = NULL;
>  				++priv->tx_tail;
>  				--priv->tx_outstanding;
>  			}
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From chu11 at llnl.gov  Wed Nov  5 17:46:03 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 05 Nov 2008 17:46:03 -0800
Subject: [ofa-general] Re: [PATCH] ipoib: null tx/rx_ring skb pointers on
	free
In-Reply-To: <20081106012307.GP31163@sgi.com>
References: <20081106012307.GP31163@sgi.com>
Message-ID: <1225935964.13371.5.camel@cardanus.llnl.gov>

Hey Arthur,

On Wed, 2008-11-05 at 17:23 -0800, akepner at sgi.com wrote:
> Way back in:
> 
> http:// lists.openfabrics.org/pipermail/general/2008-May/050196.html
> 
> I described an IPoIB-related panic we were seeing on large 
> clusters. The signature was a backtrace like this:
> 
> 	skb_over_panic
> 	:ib_ipoib:ipoib_ib_handle_rx_wc
> 	:ib_ipoib:ipoib_poll
> 	net_rx_action
> 	.....
> 
> The bug is difficult to reproduce, but we finally got a crashdump, 
> and the problem appears to be that stale skb pointers on the tx_ring 
> were left pointing to skbs that had been since reused, so that the 
> skb's data region was now unexpectedly short, etc. 
> 
> Recently LLNL reported something similar:
> 
> http:// lists.openfabrics.org/pipermail/general/2008-October/054824.html
> 
> A patch similar to the following seems to fix thing up. 
> 
> Ira, Al, if this looks OK, can you please sign off on it?

Looks good to me.

Al

> Signed-off-by: Arthur Kepner <akepner at sgi.com>
> 
> --- 
> 
>  ipoib_cm.c |    5 +++++
>  ipoib_ib.c |    4 ++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 7b14c2c..8f8650b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -200,6 +200,7 @@ static void ipoib_cm_free_rx_ring(struct net_device *dev,
>  			ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
>  					      rx_ring[i].mapping);
>  			dev_kfree_skb_any(rx_ring[i].skb);
> +			rx_ring[i].skb = NULL;
>  		}
>  
>  	vfree(rx_ring);
> @@ -736,6 +737,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -747,6 +749,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  		++dev->stats.tx_errors;
>  		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  	} else {
>  		dev->trans_start = jiffies;
>  		++tx->tx_head;
> @@ -785,6 +788,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	netif_tx_lock(dev);
>  
> @@ -1179,6 +1183,7 @@ timeout:
>  		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
>  				    DMA_TO_DEVICE);
>  		dev_kfree_skb_any(tx_req->skb);
> +		tx_req->skb = NULL;
>  		++p->tx_tail;
>  		netif_tx_lock_bh(p->dev);
>  		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 28eb6f0..f7e3497 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -383,6 +383,7 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	++priv->tx_tail;
>  	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> @@ -572,6 +573,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -594,6 +596,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  		--priv->tx_outstanding;
>  		ipoib_dma_unmap_tx(priv->ca, tx_req);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		if (netif_queue_stopped(dev))
>  			netif_wake_queue(dev);
>  	} else {
> @@ -833,6 +836,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
>  							(ipoib_sendq_size - 1)];
>  				ipoib_dma_unmap_tx(priv->ca, tx_req);
>  				dev_kfree_skb_any(tx_req->skb);
> +				tx_req->skb = NULL;
>  				++priv->tx_tail;
>  				--priv->tx_outstanding;
>  			}
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From weiny2 at llnl.gov  Wed Nov  5 17:53:52 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Wed, 5 Nov 2008 17:53:52 -0800
Subject: [ofa-general] Re: [PATCH] ipoib: null tx/rx_ring skb pointers on
	free
In-Reply-To: <20081106012307.GP31163@sgi.com>
References: <20081106012307.GP31163@sgi.com>
Message-ID: <20081105175352.476ac69e.weiny2@llnl.gov>

On Wed, 5 Nov 2008 17:23:07 -0800
akepner at sgi.com wrote:

> 
> Way back in:
> 
> http:// lists.openfabrics.org/pipermail/general/2008-May/050196.html
> 
> I described an IPoIB-related panic we were seeing on large 
> clusters. The signature was a backtrace like this:
> 
> 	skb_over_panic
> 	:ib_ipoib:ipoib_ib_handle_rx_wc
> 	:ib_ipoib:ipoib_poll
> 	net_rx_action
> 	.....
> 
> The bug is difficult to reproduce, but we finally got a crashdump, 
> and the problem appears to be that stale skb pointers on the tx_ring 
> were left pointing to skbs that had been since reused, so that the 
> skb's data region was now unexpectedly short, etc. 
> 
> Recently LLNL reported something similar:
> 
> http:// lists.openfabrics.org/pipermail/general/2008-October/054824.html
> 
> A patch similar to the following seems to fix thing up. 
> 
> Ira, Al, if this looks OK, can you please sign off on it?

Yep, looks good.

Ira

> 
> Signed-off-by: Arthur Kepner <akepner at sgi.com>
> 
> --- 
> 
>  ipoib_cm.c |    5 +++++
>  ipoib_ib.c |    4 ++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 7b14c2c..8f8650b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -200,6 +200,7 @@ static void ipoib_cm_free_rx_ring(struct net_device *dev,
>  			ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
>  					      rx_ring[i].mapping);
>  			dev_kfree_skb_any(rx_ring[i].skb);
> +			rx_ring[i].skb = NULL;
>  		}
>  
>  	vfree(rx_ring);
> @@ -736,6 +737,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -747,6 +749,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  		++dev->stats.tx_errors;
>  		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  	} else {
>  		dev->trans_start = jiffies;
>  		++tx->tx_head;
> @@ -785,6 +788,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	netif_tx_lock(dev);
>  
> @@ -1179,6 +1183,7 @@ timeout:
>  		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
>  				    DMA_TO_DEVICE);
>  		dev_kfree_skb_any(tx_req->skb);
> +		tx_req->skb = NULL;
>  		++p->tx_tail;
>  		netif_tx_lock_bh(p->dev);
>  		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 28eb6f0..f7e3497 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -383,6 +383,7 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	++priv->tx_tail;
>  	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> @@ -572,6 +573,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -594,6 +596,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  		--priv->tx_outstanding;
>  		ipoib_dma_unmap_tx(priv->ca, tx_req);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		if (netif_queue_stopped(dev))
>  			netif_wake_queue(dev);
>  	} else {
> @@ -833,6 +836,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
>  							(ipoib_sendq_size - 1)];
>  				ipoib_dma_unmap_tx(priv->ca, tx_req);
>  				dev_kfree_skb_any(tx_req->skb);
> +				tx_req->skb = NULL;
>  				++priv->tx_tail;
>  				--priv->tx_outstanding;
>  			}
> 


From jgarzik at pobox.com  Wed Nov  5 21:43:31 2008
From: jgarzik at pobox.com (Jeff Garzik)
Date: Thu, 06 Nov 2008 00:43:31 -0500
Subject: [ofa-general] Re: [PATCH] mlx4_en: Start port error flow bug fix
In-Reply-To: <4911B37E.3020900@mellanox.co.il>
References: <4911B37E.3020900@mellanox.co.il>
Message-ID: <49128403.6000205@pobox.com>

Yevgeny Petrilin wrote:
> Tried to deactivate rx ring that wasn't activated,
> used wrong index.
> 
> Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
> ---
>  drivers/net/mlx4/en_netdev.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/mlx4/en_netdev.c b/drivers/net/mlx4/en_netdev.c
> index 12d736a..96e709d 100644
> --- a/drivers/net/mlx4/en_netdev.c
> +++ b/drivers/net/mlx4/en_netdev.c
> @@ -706,7 +706,7 @@ tx_err:
>  	mlx4_en_release_rss_steer(priv);
>  rx_err:
>  	for (i = 0; i < priv->rx_ring_num; i++)
> -		mlx4_en_deactivate_rx_ring(priv, &priv->rx_ring[rx_index]);
> +		mlx4_en_deactivate_rx_ring(priv, &priv->rx_ring[i]);
>  cq_err:

applied


From jgarzik at pobox.com  Wed Nov  5 21:45:22 2008
From: jgarzik at pobox.com (Jeff Garzik)
Date: Thu, 06 Nov 2008 00:45:22 -0500
Subject: [ofa-general] Re: [PATCH] mlx4_en: Pause parameters per port
In-Reply-To: <4911B244.30205@mellanox.co.il>
References: <4911B244.30205@mellanox.co.il>
Message-ID: <49128472.7080607@pobox.com>

Yevgeny Petrilin wrote:
> Before the change the driver reported the same pause parameters
> for all the ports, even only one of them was modified.
> 
> Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
> ---
>  drivers/net/mlx4/en_netdev.c |    8 ++++----
>  drivers/net/mlx4/en_params.c |   30 ++++++++++++++++--------------
>  drivers/net/mlx4/mlx4_en.h   |    8 ++++----
>  3 files changed, 24 insertions(+), 22 deletions(-)

Is this a regression fix?  It doesn't look like one to me, so I am 
planning to hold this for 2.6.29 (davem/net-next-2.6.git), unless there 
are problems with this plan?

	Jeff


From yevgenyp at mellanox.co.il  Wed Nov  5 22:53:40 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Thu, 06 Nov 2008 08:53:40 +0200
Subject: [ofa-general] Re: Re: [PATCH] mlx4_en: Pause parameters per port
In-Reply-To: <49128472.7080607@pobox.com>
References: <4911B244.30205@mellanox.co.il> <49128472.7080607@pobox.com>
Message-ID: <49129474.1030607@mellanox.co.il>

Jeff Garzik wrote:
> Is this a regression fix?  It doesn't look like one to me, so I am
> planning to hold this for 2.6.29 (davem/net-next-2.6.git), unless there
> are problems with this plan?
> 

This is regression fix. When setting pause parameters for one port, they would only change for that port,
but both of the ports report the same parameters. It means that the second port reports wrong pause parameters.

Yevgeny


From eli at dev.mellanox.co.il  Thu Nov  6 00:40:32 2008
From: eli at dev.mellanox.co.il (Eli Cohen)
Date: Thu, 6 Nov 2008 10:40:32 +0200
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106012307.GP31163@sgi.com>
References: <20081106012307.GP31163@sgi.com>
Message-ID: <20081106084031.GA25354@mtls03>

On Wed, Nov 05, 2008 at 05:23:07PM -0800, akepner at sgi.com wrote:

Hi Arthur,
looking a the patch I don't understand why it should fix the problem
you're seeing. I suspect we may be hiding the problem.

> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 7b14c2c..8f8650b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -200,6 +200,7 @@ static void ipoib_cm_free_rx_ring(struct net_device *dev,
>  			ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
>  					      rx_ring[i].mapping);
>  			dev_kfree_skb_any(rx_ring[i].skb);
> +			rx_ring[i].skb = NULL;
>  		}
>  
>  	vfree(rx_ring);

This is not needed since the ring is being freed.

> @@ -736,6 +737,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}

Here we will never get completion so why do we need this?

>  
> @@ -747,6 +749,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
>  		++dev->stats.tx_errors;
>  		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;

Also here, we don't get a completion.

>  	} else {
>  		dev->trans_start = jiffies;
>  		++tx->tx_head;
> @@ -785,6 +788,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
And here we already got the completion so we shouldn't exptect another
free of the SKB.
>  
>  	netif_tx_lock(dev);
>  
> @@ -1179,6 +1183,7 @@ timeout:
>  		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
>  				    DMA_TO_DEVICE);
>  		dev_kfree_skb_any(tx_req->skb);
> +		tx_req->skb = NULL;
and here we're freeing the ring
>  		++p->tx_tail;
>  		netif_tx_lock_bh(p->dev);
>  		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 28eb6f0..f7e3497 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -383,6 +383,7 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>  	dev->stats.tx_bytes += tx_req->skb->len;
>  
>  	dev_kfree_skb_any(tx_req->skb);
> +	tx_req->skb = NULL;
>  
>  	++priv->tx_tail;
>  	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
> @@ -572,6 +573,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
>  		++dev->stats.tx_errors;
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		return;
>  	}
>  
> @@ -594,6 +596,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
>  		--priv->tx_outstanding;
>  		ipoib_dma_unmap_tx(priv->ca, tx_req);
>  		dev_kfree_skb_any(skb);
> +		tx_req->skb = NULL;
>  		if (netif_queue_stopped(dev))
>  			netif_wake_queue(dev);
>  	} else {
> @@ -833,6 +836,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
>  							(ipoib_sendq_size - 1)];
>  				ipoib_dma_unmap_tx(priv->ca, tx_req);
>  				dev_kfree_skb_any(tx_req->skb);
> +				tx_req->skb = NULL;
>  				++priv->tx_tail;
>  				--priv->tx_outstanding;
>  			}
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From devesh28 at gmail.com  Thu Nov  6 01:09:01 2008
From: devesh28 at gmail.com (Devesh Sharma)
Date: Thu, 6 Nov 2008 14:39:01 +0530
Subject: ***SPAM*** Re: [ofa-general] infiniband multicast (libibverbs)
In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E34348A@tbmail2.tradebot.com>
References: <98B0CDCB28A5EE4CB3678CD99406644E34348A@tbmail2.tradebot.com>
Message-ID: <309a667c0811060109g7b75d7dan28ee098c34c6c18d@mail.gmail.com>

ok, try to do sequence number check after a slight delay say after 100ns
delay. Is it possible that DMA latancies are comming into picture? Roland or
Dotan can comment on this!

On 11/5/08, Kelly Burkhart <kelly at tradebotsystems.com> wrote:
>
>  It is non-blocking.  I spin, calling ibv_poll_cq until it returns a
> non-zero.
>
>
>  ------------------------------
> *From:* Devesh Sharma [mailto:devesh28 at gmail.com]
> *Sent:* Tuesday, November 04, 2008 11:49 PM
> *To:* Kelly Burkhart
> *Cc:* Roland Dreier; general at lists.openfabrics.org
> *Subject:* Re: [ofa-general] infiniband multicast (libibverbs)
>
>
> Correction in my post :  I mean you are not considering it as non-blocking
> call (not taking care of this behaviour) and just going ahead with the
> sequence number check?
>
> On 11/5/08, Devesh Sharma <devesh28 at gmail.com> wrote:
>>
>> are you taking care that ibv_poll_cq is not a blocking call, I mean you
>> are not considering it as blocking call and just going ahead with the
>> sequence number check?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081106/fed44d18/attachment.html>

From dorons at Voltaire.COM  Thu Nov  6 01:21:29 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 11:21:29 +0200
Subject: [ofa-general] [PATCH 0/2] add and install default configuration
	files
Message-ID: <4912B719.3040907@Voltaire.COM>

The following patches will add default configuration files
and install them by opensm rpm.


From dorons at Voltaire.COM  Thu Nov  6 01:24:26 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 11:24:26 +0200
Subject: [ofa-general] [PATCH 1/2] add default configuration files
In-Reply-To: <4912B719.3040907@Voltaire.COM>
References: <4912B719.3040907@Voltaire.COM>
Message-ID: <4912B7CA.9080508@Voltaire.COM>

add default configuration files:
opensm.conf
partitions.conf
qos-policy.conf
root-nodes.conf

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/scripts/opensm.conf     |  331 ++++++++++++++++++++++++++++++++++++++++
 opensm/scripts/partitions.conf |  100 ++++++++++++
 opensm/scripts/qos-policy.conf |    2 +
 opensm/scripts/root-nodes.conf |    3 +
 4 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 opensm/scripts/opensm.conf
 create mode 100644 opensm/scripts/partitions.conf
 create mode 100644 opensm/scripts/qos-policy.conf
 create mode 100644 opensm/scripts/root-nodes.conf

diff --git a/opensm/scripts/opensm.conf b/opensm/scripts/opensm.conf
new file mode 100644
index 0000000..89e4145
--- /dev/null
+++ b/opensm/scripts/opensm.conf
@@ -0,0 +1,331 @@
+#
+# DEVICE ATTRIBUTES OPTIONS
+#
+# The port GUID on which the OpenSM is running
+guid 0x0000000000000000
+
+# M_Key value sent to all ports qualifying all Set(PortInfo)
+m_key 0x0000000000000000
+
+# The lease period used for the M_Key on this subnet in [sec]
+m_key_lease_period 0
+
+# SM_Key value of the SM used for SM authentication
+sm_key 0x0000000000000001
+
+# SM_Key value to qualify rcv SA queries as 'trusted'
+sa_key 0x0000000000000001
+
+# Note that for both values above (sm_key and sa_key)
+# OpenSM version 3.2.1 and below used the default value '1'
+# in a host byte order, it is fixed now but you may need to
+# change the values to interoperate with old OpenSM running
+# on a little endian machine.
+
+# Subnet prefix used on this subnet
+subnet_prefix 0xfe80000000000000
+
+# The LMC value used on this subnet
+lmc 0
+
+# lmc_esp0 determines whether LMC value used on subnet is used for
+# enhanced switch port 0. If TRUE, LMC value for subnet is used for
+# ESP0. Otherwise, LMC value for ESP0s is 0.
+lmc_esp0 FALSE
+
+# The code of maximal time a packet can live in a switch
+# The actual time is 4.096usec * 2^<packet_life_time>
+# The value 0x14 disables this mechanism
+packet_life_time 0x12
+
+# The number of sequential packets dropped that cause the port
+# to enter the VLStalled state. The result of setting this value to
+# zero is undefined.
+vl_stall_count 0x07
+
+# The number of sequential packets dropped that cause the port
+# to enter the VLStalled state. This value is for switch ports
+# driving a CA or router port. The result of setting this value
+# to zero is undefined.
+leaf_vl_stall_count 0x07
+
+# The code of maximal time a packet can wait at the head of
+# transmission queue.
+# The actual time is 4.096usec * 2^<head_of_queue_lifetime>
+# The value 0x14 disables this mechanism
+head_of_queue_lifetime 0x12
+
+# The maximal time a packet can wait at the head of queue on
+# switch port connected to a CA or router port
+leaf_head_of_queue_lifetime 0x10
+
+# Limit the maximal operational VLs
+max_op_vls 5
+
+# Force PortInfo:LinkSpeedEnabled on switch ports
+# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port
+# Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port
+# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 "PortInfo")
+#    1: 2.5 Gbps
+#    3: 2.5 or 5.0 Gbps
+#    5: 2.5 or 10.0 Gbps
+#    7: 2.5 or 5.0 or 10.0 Gbps
+#    2,4,6,8-14 Reserved
+#    Default 15: set to PortInfo:LinkSpeedSupported
+force_link_speed 15
+
+# The subnet_timeout code that will be set for all the ports
+# The actual timeout is 4.096usec * 2^<subnet_timeout>
+subnet_timeout 18
+
+# Threshold of local phy errors for sending Trap 129
+local_phy_errors_threshold 0x08
+
+# Threshold of credit overrun errors for sending Trap 130
+overrun_errors_threshold 0x08
+
+#
+# PARTITIONING OPTIONS
+#
+# Partition configuration file to be used
+partition_config_file /etc/opensm/partitions.conf
+
+# Disable partition enforcement by switches
+no_partition_enforcement FALSE
+
+#
+# SWEEP OPTIONS
+#
+# The number of seconds between subnet sweeps (0 disables it)
+sweep_interval 10
+
+# If TRUE cause all lids to be reassigned
+reassign_lids FALSE
+
+# If TRUE forces every sweep to be a heavy sweep
+force_heavy_sweep FALSE
+
+# If TRUE every trap will cause a heavy sweep.
+# NOTE: successive identical traps (>10) are suppressed
+sweep_on_trap TRUE
+
+#
+# ROUTING OPTIONS
+#
+# If TRUE count switches as link subscriptions
+port_profile_switch_nodes FALSE
+
+# Name of file with port guids to be ignored by port profiling
+port_prof_ignore_file (null)
+
+# Routing engine
+# Multiple routing engines can be specified separated by
+# commas so that specific ordering of routing algorithms will
+# be tried if earlier routing engines fail.
+# Supported engines: minhop, updn, file, ftree, lash, dor
+routing_engine minhop
+
+# Connect roots (use FALSE if unsure)
+connect_roots FALSE
+
+# Use unicast routing cache (use FALSE if unsure)
+use_ucast_cache FALSE
+
+# Lid matrix dump file name
+lid_matrix_dump_file (null)
+
+# LFTs file name
+lfts_file (null)
+
+# The file holding the root node guids (for fat-tree or Up/Down)
+# One guid in each line
+root_guid_file (null)
+
+# The file holding the fat-tree compute node guids
+# One guid in each line
+cn_guid_file (null)
+
+# The file holding the node ids which will be used by Up/Down algorithm instead
+# of GUIDs (one guid and id in each line)
+ids_guid_file (null)
+
+# The file holding guid routing order guids (for MinHop and Up/Down)
+guid_routing_order_file (null)
+
+# SA database file name
+sa_db_file (null)
+
+#
+# HANDOVER - MULTIPLE SMs OPTIONS
+#
+# SM priority used for deciding who is the master
+# Range goes from 0 (lowest priority) to 15 (highest).
+sm_priority 14
+
+# If TRUE other SMs on the subnet should be ignored
+ignore_other_sm FALSE
+
+# Timeout in [msec] between two polls of active master SM
+sminfo_polling_timeout 10000
+
+# Number of failing polls of remote SM that declares it dead
+polling_retry_number 4
+
+# If TRUE honor the guid2lid file when coming out of standby
+# state, if such file exists and is valid
+honor_guid2lid_file FALSE
+
+#
+# TIMING AND THREADING OPTIONS
+#
+# Maximum number of SMPs sent in parallel
+max_wire_smps 4
+
+# The maximum time in [msec] allowed for a transaction to complete
+transaction_timeout 200
+
+# Maximal time in [msec] a message can stay in the incoming message queue.
+# If there is more than one message in the queue and the last message
+# stayed in the queue more than this value, any SA request will be
+# immediately returned with a BUSY status.
+max_msg_fifo_timeout 10000
+
+# Use a single thread for handling SA queries
+single_thread FALSE
+
+#
+# MISC OPTIONS
+#
+# Daemon mode
+daemon FALSE
+
+# SM Inactive
+sm_inactive FALSE
+
+# Babbling Port Policy
+babbling_port_policy FALSE
+
+#
+# Performance Manager Options
+#
+# perfmgr enable
+perfmgr FALSE
+
+# perfmgr redirection enable
+perfmgr_redir TRUE
+
+# sweep time in seconds
+perfmgr_sweep_time_s 180
+
+# Max outstanding queries
+perfmgr_max_outstanding_queries 500
+
+#
+# Event DB Options
+#
+# Dump file to dump the events to
+event_db_dump_file (null)
+
+#
+# Event Plugin Options
+#
+event_plugin_name (null)
+
+#
+# Node name map for mapping node's to more descriptive node descriptions
+# (man ibnetdiscover for more information)
+#
+node_name_map_name (null)
+
+#
+# DEBUG FEATURES
+#
+# The log flags used
+log_flags 0x03
+
+# Force flush of the log file after each log message
+force_log_flush FALSE
+
+# Log file to be used
+log_file /var/log/opensm.log
+
+# Limit the size(MB) of the log file. If overrun, log is restarted
+log_max_size 4096
+
+# If TRUE will accumulate the log over multiple OpenSM sessions
+accum_log_file TRUE
+
+# The directory to hold the file OpenSM dumps
+dump_files_dir /var/log/
+
+# If TRUE enables new high risk options and hardware specific quirks
+enable_quirks FALSE
+
+# If TRUE disables client reregistration
+no_clients_rereg FALSE
+
+# If TRUE OpenSM should disable multicast support and
+# no multicast routing is performed if TRUE
+disable_multicast FALSE
+
+# If TRUE opensm will exit on fatal initialization issues
+exit_on_fatal TRUE
+
+# console [off|local]
+console off
+
+# Telnet port for console (default 10000)
+console_port 10000
+
+#
+# QoS OPTIONS
+#
+# Enable QoS setup
+qos FALSE
+
+# QoS policy file to be used
+qos_policy_file /etc/opensm/qos-policy.conf
+
+# QoS default options
+qos_max_vls 15
+qos_high_limit 0
+qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+# QoS CA options
+qos_ca_max_vls 15
+qos_ca_high_limit 0
+qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+# QoS Switch Port 0 options
+qos_sw0_max_vls 15
+qos_sw0_high_limit 0
+qos_sw0_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+qos_sw0_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+qos_sw0_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+# QoS Switch external ports options
+qos_swe_max_vls 15
+qos_swe_high_limit 0
+qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+# QoS Router ports options
+qos_rtr_max_vls 15
+qos_rtr_high_limit 0
+qos_rtr_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
+qos_rtr_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
+qos_rtr_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
+
+# Prefix routes file name
+prefix_routes_file /etc/opensm/prefix-routes.conf
+
+#
+# IPv6 Solicited Node Multicast (SNM) Options
+#
+consolidate_ipv6_snm_req FALSE
+
diff --git a/opensm/scripts/partitions.conf b/opensm/scripts/partitions.conf
new file mode 100644
index 0000000..868a26a
--- /dev/null
+++ b/opensm/scripts/partitions.conf
@@ -0,0 +1,100 @@
+# Default partition configuration file for OpenSM
+# 
+# The  default  name  of  OpenSM  partitions configuration file is /etc/opensm/partitions.conf. The default may be changed by using --Pconfig (-P)
+# option with OpenSM.
+# 
+# The default partition will be created by OpenSM unconditionally even when partition configuration file does not exist or cannot be accessed.
+# 
+# The default partition has P_Key value 0x7fff. OpenSM´s port will have full membership in default partition. All other end ports will  have  par‐
+# tial membership.
+# 
+# File Format
+# 
+# Comments:
+# 
+# Line content followed after ´#´ character is comment and ignored by parser.
+# 
+# General file format:
+# 
+# <Partition Definition>:<PortGUIDs list> ;
+# 
+# Partition Definition:
+# 
+# [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
+# 
+# PartitionName - string, will be used with logging. When omitted
+# 		empty string will be used.
+# PKey          - P_Key value for this partition. Only low 15 bits will
+# 		be used. When omitted will be autogenerated.
+# flag          - used to indicate IPoIB capability of this partition.
+# defmember=full|limited - specifies default membership for port guid
+# 		list. Default is limited.
+# 
+# Currently recognized flags are:
+# 
+# ipoib       - indicates that this partition may be used for IPoIB, as
+# 	      result IPoIB capable MC group will be created.
+# rate=<val>  - specifies rate for this IPoIB MC group
+# 	      (default is 3 (10GBps))
+# mtu=<val>   - specifies MTU for this IPoIB MC group
+# 	      (default is 4 (2048))
+# sl=<val>    - specifies SL for this IPoIB MC group
+# 	      (default is 0)
+# scope=<val> - specifies scope for this IPoIB MC group
+# 	      (default is 2 (link local)).  Multiple scope settings
+# 	      are permitted for a partition.
+# 
+# Note that values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048).
+# 
+# PortGUIDs list:
+# 
+# PortGUID         - GUID of partition member EndPort. Hexadecimal
+# 		   numbers should start from 0x, decimal numbers
+# 		   are accepted too.
+# full or limited  - indicates full or limited membership for this
+# 		   port.  When omitted (or unrecognized) limited
+# 		   membership is assumed.
+# 
+# There are two useful keywords for PortGUID definition:
+# 
+# - 'ALL' means all end ports in this subnet.
+# - 'SELF' means subnet manager's port.
+# 
+# Empty list means no ports in this partition.
+# 
+# Notes:
+# 
+# White space is permitted between delimiters ('=', ',',':',';').
+# 
+# The line can be wrapped after ':' followed after Partition Definition and between.
+# 
+# PartitionName  does  not need to be unique, PKey does need to be unique.  If PKey is repeated then those partition configurations will be merged
+# and first PartitionName will be used (see also next note).
+# 
+# It is possible to split partition configuration in more than one definition, but then PKey should be explicitly specified  (otherwise  different
+# PKey values will be generated for those definitions).
+# 
+# Examples:
+# 
+# Default=0x7fff : ALL, SELF=full ;
+# 
+# NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
+# 
+# YetAnotherOne = 0x300 : SELF=full ;
+# YetAnotherOne = 0x300 : ALL=limited ;
+# 
+# ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
+# # 0x123453, 0x123454 will be limited
+# ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
+# # 0x123456, 0x123457 will be limited
+# ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
+# ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
+# ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
+# 
+# 
+# Note:
+# 
+# The following rule is equivalent to how OpenSM used to run prior to the partition manager:
+# 
+ Default=0x7fff,ipoib:ALL=full;
+# 
diff --git a/opensm/scripts/qos-policy.conf b/opensm/scripts/qos-policy.conf
new file mode 100644
index 0000000..42a88c0
--- /dev/null
+++ b/opensm/scripts/qos-policy.conf
@@ -0,0 +1,2 @@
+# Default Quality of Service policy configuration file
+# For further details see /usr/share/doc/opensm-<version>/QoS_management_in_OpenSM.txt
diff --git a/opensm/scripts/root-nodes.conf b/opensm/scripts/root-nodes.conf
new file mode 100644
index 0000000..d84d732
--- /dev/null
+++ b/opensm/scripts/root-nodes.conf
@@ -0,0 +1,3 @@
+# Default root node GUIDs configuration file for OpenSM
+# List of GUIDs in hex, one per line
+# 0x8f10002322134567
-- 
1.5.3.8


From dorons at Voltaire.COM  Thu Nov  6 01:25:19 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 11:25:19 +0200
Subject: [ofa-general] [PATCH 2/2] install the configuration files by the rpm
In-Reply-To: <4912B719.3040907@Voltaire.COM>
References: <4912B719.3040907@Voltaire.COM>
Message-ID: <4912B7FF.5030900@Voltaire.COM>

install the configuration files by the rpm

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm.spec.in |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index f8cecf1..b5d5b2c 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -98,6 +98,10 @@ mkdir -p $etc/{init.d,logrotate.d} $etc/@OPENSM_CONFIG_SUB_DIR@
 install -m 755 scripts/${REDHAT}opensm.init $etc/init.d/opensmd
 install -D -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm
 install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh
+install -m 644 scripts/opensm.conf $etc/opensm/
+install -m 644 scripts/partitions.conf $etc/opensm/
+install -m 644 scripts/qos-policy.conf $etc/opensm/
+install -m 644 scripts/root-nodes.conf $etc/opensm/
 
 %clean
 rm -rf $RPM_BUILD_ROOT
@@ -130,6 +134,7 @@ fi
 %config(noreplace) %{_sysconfdir}/logrotate.d/opensm
 %dir /var/cache/opensm
 %dir %{_sysconfdir}/@OPENSM_CONFIG_SUB_DIR@
+%{_sysconfdir}/opensm/*
 
 %files libs
 %defattr(-,root,root,-)
-- 
1.5.3.8


From dorons at Voltaire.COM  Thu Nov  6 01:46:36 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 11:46:36 +0200
Subject: [ofa-general] [PATCH 0/2] update and install
	QoS_management_in_OpenSM.txt
Message-ID: <4912BCFC.8030407@Voltaire.COM>

The following patches will fix the default configuration files path
in QoS_management_in_OpenSM.txt and install the file via the rpm.


From dorons at Voltaire.COM  Thu Nov  6 01:48:44 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 11:48:44 +0200
Subject: [ofa-general] [PATCH 1/2] fix default configuration files path
In-Reply-To: <4912BCFC.8030407@Voltaire.COM>
References: <4912BCFC.8030407@Voltaire.COM>
Message-ID: <4912BD7C.1030603@Voltaire.COM>

fix default configuration files path in QoS_management_in_OpenSM.txt file
from /usr/local/etc/opensm/ to /etc/opensm/

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/doc/QoS_management_in_OpenSM.txt |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/opensm/doc/QoS_management_in_OpenSM.txt b/opensm/doc/QoS_management_in_OpenSM.txt
index ba1b4b1..1a48b1a 100644
--- a/opensm/doc/QoS_management_in_OpenSM.txt
+++ b/opensm/doc/QoS_management_in_OpenSM.txt
@@ -20,7 +20,7 @@
 
 When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
 The default name of OpenSM QoS policy file is
-/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
+/etc/opensm/qos-policy.conf. The default may be changed by using -Y
 or --qos_policy_file option with OpenSM.
 
 During fabric initialization and at every heavy sweep OpenSM parses the QoS
@@ -67,7 +67,7 @@ This section describes how to set up SL2VL and VL Arbitration tables on
 various nodes in the fabric.
 However, this is not supported in OpenSM currently.
 SL2VL and VLArb tables should be configured in the OpenSM options file
-(default location - /usr/local/etc/opensm/opensm.conf).
+(default location - /etc/opensm/opensm.conf).
 
 III) QoS Levels (denoted by qos-levels).
 Each QoS Level defines Service Level (SL) and a few optional fields:
@@ -205,7 +205,7 @@ policy file and their syntax:
         # Arbitration tables on various nodes in the fabric.
         # However, this is not supported in OpenSM currently - the section is
         # parsed and ignored. SL2VL and VLArb tables should be configured in the
-        # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf).
+        # OpenSM options file (by default - /etc/opensm/opensm.conf).
     end-qos-setup
 
     qos-levels
-- 
1.5.3.8


From dorons at Voltaire.COM  Thu Nov  6 01:49:31 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 11:49:31 +0200
Subject: [ofa-general] [PATCH 2/2] install QoS_management_in_OpenSM.txt
In-Reply-To: <4912BCFC.8030407@Voltaire.COM>
References: <4912BCFC.8030407@Voltaire.COM>
Message-ID: <4912BDAB.5040704@Voltaire.COM>

install QoS_management_in_OpenSM.txt via the rpm

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm.spec.in |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in
index f8cecf1..da07a73 100644
--- a/opensm/opensm.spec.in
+++ b/opensm/opensm.spec.in
@@ -124,7 +124,7 @@ fi
 %{_sbindir}/opensm
 %{_sbindir}/osmtest
 %{_mandir}/man8/*
-%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt
+%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt doc/QoS_management_in_OpenSM.txt
 %{_sysconfdir}/init.d/opensmd
 %{_sbindir}/sldd.sh
 %config(noreplace) %{_sysconfdir}/logrotate.d/opensm
-- 
1.5.3.8


From jackm at dev.mellanox.co.il  Thu Nov  6 01:54:01 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 6 Nov 2008 11:54:01 +0200
Subject: [ofa-general] ib_mthca catastrophic error detected
In-Reply-To: <490763D0.5020002@ucla.edu>
References: <4906645D.6010101@ucla.edu> <4907054E.9080205@mellanox.co.il>
	<490763D0.5020002@ucla.edu>
Message-ID: <200811061154.02260.jackm@dev.mellanox.co.il>

On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
> Hi
> 
> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel module 
> reports the following on startup:
> 
> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
> 
> The cards in all (22) of the nodes we have seen this error on are as 
> follows:
> 
> hca_id: mthca0
>          fw_ver:                         1.2.0
>          vendor_id:                      0x02c9
>          vendor_part_id:                 25204
>          hw_ver:                         0xA0
>          board_id:                       MT_03B0140001
>          phys_port_cnt:                  1
> 
> It appears that when this happens the driver restarts (loads?) itself 
> however the job running at the time of the error is, of course, killed.
> 
> Scott

Scott,
We are trying to reproduce this here.  It would help if you could supply
the following info:

Host model for hosts which are experiencing the failure:
 
Console output from the following linux commands:
  cat /etc/*rel*
  cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using grub)
  uname -a
  cat /proc/cpuinfo
  cat /proc/meminfo

Also, what sort of job was running when the failure occurred:
-- which MPI are you using?
-- do you have a test example which we can run here to reproduce the problem?

Thanks in advance for your help!

Jack Morgenstein
Senior Software Development Engineer
Mellanox


From vlad at lists.openfabrics.org  Thu Nov  6 03:19:51 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu,  6 Nov 2008 03:19:51 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081106-0200 daily build status
Message-ID: <20081106111951.40BE3E60D8C@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From dorons at Voltaire.COM  Thu Nov  6 03:59:44 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 13:59:44 +0200
Subject: [ofa-general] [PATCH] export osm_log_max in MB
In-Reply-To: <49101D1F.4040605@Voltaire.COM>
References: <49101D1F.4040605@Voltaire.COM>
Message-ID: <4912DC30.40309@Voltaire.COM>

export the osm_log_max in MB when using 'opensm -c <conf>

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 0422d0f..c130c0d 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -1668,7 +1668,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
 		"force_log_flush %s\n\n"
 		"# Log file to be used\n"
 		"log_file %s\n\n"
-		"# Limit the size of the log file. If overrun, log is restarted\n"
+		"# Limit the size of the log file in MB. If overrun, log is restarted\n"
 		"log_max_size %lu\n\n"
 		"# If TRUE will accumulate the log over multiple OpenSM sessions\n"
 		"accum_log_file %s\n\n"
@@ -1694,7 +1694,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
 		p_opts->log_flags,
 		p_opts->force_log_flush ? "TRUE" : "FALSE",
 		p_opts->log_file,
-		p_opts->log_max_size,
+		p_opts->log_max_size/1024/1024,
 		p_opts->accum_log_file ? "TRUE" : "FALSE",
 		p_opts->dump_files_dir,
 		p_opts->enable_quirks ? "TRUE" : "FALSE",
-- 
1.5.3.8


From ogerlitz at voltaire.com  Thu Nov  6 04:57:56 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 6 Nov 2008 14:57:56 +0200 (IST)
Subject: [ofa-general] [PATCH] opensm: fix iser service-id used for SL
	assignment
Message-ID: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>

RFC3720 says:

The well-known user TCP port number for iSCSI connections assigned by IANA is 3260
and this is the default iSCSI port. Implementations needing a system TCP port number
may use port 860, the port assigned by IANA as the iSCSI system port; however in
order to use port 860, it MUST be explicitly specified - implementations MUST NOT
default to use of port 860, as 3260 is the only allowed default.

Hence the SID used by iser is 0x0000000001060CBC and not 0x000000000106035C

Signed-off-by: Or Gerlitz  <ogerlitz at voltaire.com>
Signed-off-by: Eli Dorfman <elid at voltaire.com>

Index: opensm-3.2.2/doc/QoS_management_in_OpenSM.txt
===================================================================
--- opensm-3.2.2.orig/doc/QoS_management_in_OpenSM.txt
+++ opensm-3.2.2/doc/QoS_management_in_OpenSM.txt
@@ -378,12 +378,12 @@ equivalent:

 6.4  iSER
 Similar to RDS, iSER query is matched by Service ID, where the the Service ID
-is also 0x000000000106PPPP. Default port number for iSER is 0x035C, which makes
-a default Service-ID 0x000000000106035C. The following two match rules are
+is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes
+a default Service-ID 0x0000000001060CBC. The following two match rules are
 equivalent:

     iser                               : <SL>
-    any, service-id 0x000000000106035C : <SL>
+    any, service-id 0x0000000001060CBC : <SL>

 6.5  SRP
 Service ID for SRP varies from storage vendor to vendor, thus SRP query is
Index: opensm-3.2.2/include/opensm/osm_qos_policy.h
===================================================================
--- opensm-3.2.2.orig/include/opensm/osm_qos_policy.h
+++ opensm-3.2.2/include/opensm/osm_qos_policy.h
@@ -58,7 +58,7 @@
 #define OSM_QOS_POLICY_ULP_RDS_SERVICE_ID   0x0000000001060000ULL
 #define OSM_QOS_POLICY_ULP_RDS_PORT         0x48CA
 #define OSM_QOS_POLICY_ULP_ISER_SERVICE_ID  0x0000000001060000ULL
-#define OSM_QOS_POLICY_ULP_ISER_PORT        0x035C
+#define OSM_QOS_POLICY_ULP_ISER_PORT        0x0CBC

 #define OSM_QOS_POLICY_NODE_TYPE_CA        (((uint8_t)1)<<IB_NODE_TYPE_CA)
 #define OSM_QOS_POLICY_NODE_TYPE_SWITCH    (((uint8_t)1)<<IB_NODE_TYPE_SWITCH)


From hal.rosenstock at gmail.com  Thu Nov  6 05:29:50 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 6 Nov 2008 08:29:50 -0500
Subject: ***SPAM*** Re: [ofa-general] [PATCH] opensm: fix iser service-id used
	for SL assignment
In-Reply-To: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
Message-ID: <f0e08f230811060529y205d989buae2f2b4713d53c3a@mail.gmail.com>

On Thu, Nov 6, 2008 at 7:57 AM, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> RFC3720 says:
>
> The well-known user TCP port number for iSCSI connections assigned by IANA is 3260
> and this is the default iSCSI port. Implementations needing a system TCP port number
> may use port 860, the port assigned by IANA as the iSCSI system port; however in
> order to use port 860, it MUST be explicitly specified - implementations MUST NOT
> default to use of port 860, as 3260 is the only allowed default.
>
> Hence the SID used by iser is 0x0000000001060CBC and not 0x000000000106035C
>
> Signed-off-by: Or Gerlitz  <ogerlitz at voltaire.com>
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>
> Index: opensm-3.2.2/doc/QoS_management_in_OpenSM.txt
> ===================================================================
> --- opensm-3.2.2.orig/doc/QoS_management_in_OpenSM.txt
> +++ opensm-3.2.2/doc/QoS_management_in_OpenSM.txt
> @@ -378,12 +378,12 @@ equivalent:
>
>  6.4  iSER
>  Similar to RDS, iSER query is matched by Service ID, where the the Service ID
> -is also 0x000000000106PPPP. Default port number for iSER is 0x035C, which makes
> -a default Service-ID 0x000000000106035C. The following two match rules are
> +is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes
> +a default Service-ID 0x0000000001060CBC.

Should some mention of the prestandard port number be mentioned here
for backward compatibility ?

>The following two match rules are
>  equivalent:
>
>     iser                               : <SL>
> -    any, service-id 0x000000000106035C : <SL>
> +    any, service-id 0x0000000001060CBC : <SL>
>
>  6.5  SRP
>  Service ID for SRP varies from storage vendor to vendor, thus SRP query is
> Index: opensm-3.2.2/include/opensm/osm_qos_policy.h
> ===================================================================
> --- opensm-3.2.2.orig/include/opensm/osm_qos_policy.h
> +++ opensm-3.2.2/include/opensm/osm_qos_policy.h
> @@ -58,7 +58,7 @@
>  #define OSM_QOS_POLICY_ULP_RDS_SERVICE_ID   0x0000000001060000ULL
>  #define OSM_QOS_POLICY_ULP_RDS_PORT         0x48CA
>  #define OSM_QOS_POLICY_ULP_ISER_SERVICE_ID  0x0000000001060000ULL
> -#define OSM_QOS_POLICY_ULP_ISER_PORT        0x035C
> +#define OSM_QOS_POLICY_ULP_ISER_PORT        0x0CBC
>
>  #define OSM_QOS_POLICY_NODE_TYPE_CA        (((uint8_t)1)<<IB_NODE_TYPE_CA)
>  #define OSM_QOS_POLICY_NODE_TYPE_SWITCH    (((uint8_t)1)<<IB_NODE_TYPE_SWITCH)
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Thu Nov  6 05:31:34 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 6 Nov 2008 08:31:34 -0500
Subject: [ofa-general] [PATCH] export osm_log_max in MB
In-Reply-To: <4912DC30.40309@Voltaire.COM>
References: <49101D1F.4040605@Voltaire.COM> <4912DC30.40309@Voltaire.COM>
Message-ID: <f0e08f230811060531j6be51fffv1ce2f45c6573e9b0@mail.gmail.com>

On Thu, Nov 6, 2008 at 6:59 AM, Doron Shoham <dorons at voltaire.com> wrote:
> export the osm_log_max in MB when using 'opensm -c <conf>
>
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/opensm/osm_subnet.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 0422d0f..c130c0d 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1668,7 +1668,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>                "force_log_flush %s\n\n"
>                "# Log file to be used\n"
>                "log_file %s\n\n"
> -               "# Limit the size of the log file. If overrun, log is restarted\n"
> +               "# Limit the size of the log file in MB. If overrun, log is restarted\n"
>                "log_max_size %lu\n\n"
>                "# If TRUE will accumulate the log over multiple OpenSM sessions\n"
>                "accum_log_file %s\n\n"
> @@ -1694,7 +1694,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>                p_opts->log_flags,
>                p_opts->force_log_flush ? "TRUE" : "FALSE",
>                p_opts->log_file,
> -               p_opts->log_max_size,
> +               p_opts->log_max_size/1024/1024,
>                p_opts->accum_log_file ? "TRUE" : "FALSE",
>                p_opts->dump_files_dir,
>                p_opts->enable_quirks ? "TRUE" : "FALSE",

Should your patch for adding opensm.conf to scripts should be updated to v2 ?

-- Hal

> --
> 1.5.3.8
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Thu Nov  6 05:33:03 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 6 Nov 2008 08:33:03 -0500
Subject: [ofa-general] [PATCH 1/2] fix default configuration files path
In-Reply-To: <4912BD7C.1030603@Voltaire.COM>
References: <4912BCFC.8030407@Voltaire.COM> <4912BD7C.1030603@Voltaire.COM>
Message-ID: <f0e08f230811060533s238a9f08n462615c268c38cc2@mail.gmail.com>

On Thu, Nov 6, 2008 at 4:48 AM, Doron Shoham <dorons at voltaire.com> wrote:
> fix default configuration files path in QoS_management_in_OpenSM.txt file
> from /usr/local/etc/opensm/ to /etc/opensm/
>
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/doc/QoS_management_in_OpenSM.txt |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/opensm/doc/QoS_management_in_OpenSM.txt b/opensm/doc/QoS_management_in_OpenSM.txt
> index ba1b4b1..1a48b1a 100644
> --- a/opensm/doc/QoS_management_in_OpenSM.txt
> +++ b/opensm/doc/QoS_management_in_OpenSM.txt
> @@ -20,7 +20,7 @@
>
>  When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
>  The default name of OpenSM QoS policy file is
> -/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
> +/etc/opensm/qos-policy.conf. The default may be changed by using -Y
>  or --qos_policy_file option with OpenSM.
>
>  During fabric initialization and at every heavy sweep OpenSM parses the QoS
> @@ -67,7 +67,7 @@ This section describes how to set up SL2VL and VL Arbitration tables on
>  various nodes in the fabric.
>  However, this is not supported in OpenSM currently.
>  SL2VL and VLArb tables should be configured in the OpenSM options file
> -(default location - /usr/local/etc/opensm/opensm.conf).
> +(default location - /etc/opensm/opensm.conf).

If this needs changing, aren't there similar changes needed in the
opensm man page ?

-- Hal

>  III) QoS Levels (denoted by qos-levels).
>  Each QoS Level defines Service Level (SL) and a few optional fields:
> @@ -205,7 +205,7 @@ policy file and their syntax:
>         # Arbitration tables on various nodes in the fabric.
>         # However, this is not supported in OpenSM currently - the section is
>         # parsed and ignored. SL2VL and VLArb tables should be configured in the
> -        # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf).
> +        # OpenSM options file (by default - /etc/opensm/opensm.conf).
>     end-qos-setup
>
>     qos-levels
> --
> 1.5.3.8
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From ogerlitz at voltaire.com  Thu Nov  6 05:33:30 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 06 Nov 2008 15:33:30 +0200
Subject: [ofa-general] [PATCH] opensm: fix iser service-id used for SL
	assignment
In-Reply-To: <f0e08f230811060529y205d989buae2f2b4713d53c3a@mail.gmail.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
	<f0e08f230811060529y205d989buae2f2b4713d53c3a@mail.gmail.com>
Message-ID: <4912F22A.4040802@voltaire.com>

Hal Rosenstock wrote:
>> --- opensm-3.2.2.orig/doc/QoS_management_in_OpenSM.txt
>> +++ opensm-3.2.2/doc/QoS_management_in_OpenSM.txt
>> @@ -378,12 +378,12 @@ equivalent:
>>
>>  6.4  iSER
>>  Similar to RDS, iSER query is matched by Service ID, where the the Service ID
>> -is also 0x000000000106PPPP. Default port number for iSER is 0x035C, which makes
>> -a default Service-ID 0x000000000106035C. The following two match rules are
>> +is also 0x000000000106PPPP. Default port number for iSER is 0x0CBC, which makes
>> +a default Service-ID 0x0000000001060CBC.
> Should some mention of the prestandard port number be mentioned here for backward compatibility ?
I don't think so as all the iser targets I'm aware too use port 3260, 
but if you want to, feel free to patch the patch

Or.


From ogerlitz at voltaire.com  Thu Nov  6 05:38:04 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 6 Nov 2008 15:38:04 +0200 (IST)
Subject: [ofa-general] Re: [PATCH] opensm: fix iser service-id used for SL
	assignment
In-Reply-To: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
Message-ID: <Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>

On Thu, 6 Nov 2008, Or Gerlitz wrote:
> The well-known user TCP port number for iSCSI connections assigned by IANA is 3260
> and this is the default iSCSI port. Implementations needing a system TCP port number
> may use port 860, the port assigned by IANA as the iSCSI system port; however in
> order to use port 860, it MUST be explicitly specified - implementations MUST NOT
> default to use of port 860, as 3260 is the only allowed default.
>
> Hence the SID used by iser is 0x0000000001060CBC and not 0x000000000106035C

> Index: opensm-3.2.2/include/opensm/osm_qos_policy.h
> ===================================================================
> --- opensm-3.2.2.orig/include/opensm/osm_qos_policy.h
> +++ opensm-3.2.2/include/opensm/osm_qos_policy.h
> @@ -58,7 +58,7 @@
>  #define OSM_QOS_POLICY_ULP_RDS_SERVICE_ID   0x0000000001060000ULL
>  #define OSM_QOS_POLICY_ULP_RDS_PORT         0x48CA
>  #define OSM_QOS_POLICY_ULP_ISER_SERVICE_ID  0x0000000001060000ULL
> -#define OSM_QOS_POLICY_ULP_ISER_PORT        0x035C
> +#define OSM_QOS_POLICY_ULP_ISER_PORT        0x0CBC

BTW - while doing this fix, I noted that the port assumed by opensm for RDS is 18634
(0x48CA) which is the ones used in the rds code deployed in ofed 1.3.x, where the rds
code based deployed into ofed 1.4.y uses port 18635

Andy, Rick, can you guys revert to 18634 to make things simpler wrt RDS/QoS configuration?

Or.


From dorons at Voltaire.COM  Thu Nov  6 05:54:43 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 15:54:43 +0200
Subject: [ofa-general] [PATCH 1/2] fix default configuration files path
In-Reply-To: <f0e08f230811060533s238a9f08n462615c268c38cc2@mail.gmail.com>
References: <4912BCFC.8030407@Voltaire.COM> <4912BD7C.1030603@Voltaire.COM>
	<f0e08f230811060533s238a9f08n462615c268c38cc2@mail.gmail.com>
Message-ID: <4912F723.3090203@Voltaire.COM>

Hal Rosenstock wrote:
> On Thu, Nov 6, 2008 at 4:48 AM, Doron Shoham <dorons at voltaire.com> wrote:
>> fix default configuration files path in QoS_management_in_OpenSM.txt file
>> from /usr/local/etc/opensm/ to /etc/opensm/
>>
>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>> ---
>>  opensm/doc/QoS_management_in_OpenSM.txt |    6 +++---
>>  1 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/opensm/doc/QoS_management_in_OpenSM.txt b/opensm/doc/QoS_management_in_OpenSM.txt
>> index ba1b4b1..1a48b1a 100644
>> --- a/opensm/doc/QoS_management_in_OpenSM.txt
>> +++ b/opensm/doc/QoS_management_in_OpenSM.txt
>> @@ -20,7 +20,7 @@
>>
>>  When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
>>  The default name of OpenSM QoS policy file is
>> -/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
>> +/etc/opensm/qos-policy.conf. The default may be changed by using -Y
>>  or --qos_policy_file option with OpenSM.
>>
>>  During fabric initialization and at every heavy sweep OpenSM parses the QoS
>> @@ -67,7 +67,7 @@ This section describes how to set up SL2VL and VL Arbitration tables on
>>  various nodes in the fabric.
>>  However, this is not supported in OpenSM currently.
>>  SL2VL and VLArb tables should be configured in the OpenSM options file
>> -(default location - /usr/local/etc/opensm/opensm.conf).
>> +(default location - /etc/opensm/opensm.conf).
> 
> If this needs changing, aren't there similar changes needed in the
> opensm man page ?
> 
> -- Hal
> 

No,
in the man page:
/etc/opensm/qos-policy.conf
	default QOS policy config file


>>  III) QoS Levels (denoted by qos-levels).
>>  Each QoS Level defines Service Level (SL) and a few optional fields:
>> @@ -205,7 +205,7 @@ policy file and their syntax:
>>         # Arbitration tables on various nodes in the fabric.
>>         # However, this is not supported in OpenSM currently - the section is
>>         # parsed and ignored. SL2VL and VLArb tables should be configured in the
>> -        # OpenSM options file (by default - /usr/local/etc/opensm/opensm.conf).
>> +        # OpenSM options file (by default - /etc/opensm/opensm.conf).
>>     end-qos-setup
>>
>>     qos-levels
>> --
>> 1.5.3.8
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>


From dorons at Voltaire.COM  Thu Nov  6 05:57:11 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 15:57:11 +0200
Subject: [ofa-general] [PATCH] export osm_log_max in MB
In-Reply-To: <f0e08f230811060531j6be51fffv1ce2f45c6573e9b0@mail.gmail.com>
References: <49101D1F.4040605@Voltaire.COM> <4912DC30.40309@Voltaire.COM>
	<f0e08f230811060531j6be51fffv1ce2f45c6573e9b0@mail.gmail.com>
Message-ID: <4912F7B7.1000109@Voltaire.COM>

Hal Rosenstock wrote:
> On Thu, Nov 6, 2008 at 6:59 AM, Doron Shoham <dorons at voltaire.com> wrote:
>> export the osm_log_max in MB when using 'opensm -c <conf>
>>
>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>> ---
>>  opensm/opensm/osm_subnet.c |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>> index 0422d0f..c130c0d 100644
>> --- a/opensm/opensm/osm_subnet.c
>> +++ b/opensm/opensm/osm_subnet.c
>> @@ -1668,7 +1668,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>>                "force_log_flush %s\n\n"
>>                "# Log file to be used\n"
>>                "log_file %s\n\n"
>> -               "# Limit the size of the log file. If overrun, log is restarted\n"
>> +               "# Limit the size of the log file in MB. If overrun, log is restarted\n"
>>                "log_max_size %lu\n\n"
>>                "# If TRUE will accumulate the log over multiple OpenSM sessions\n"
>>                "accum_log_file %s\n\n"
>> @@ -1694,7 +1694,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>>                p_opts->log_flags,
>>                p_opts->force_log_flush ? "TRUE" : "FALSE",
>>                p_opts->log_file,
>> -               p_opts->log_max_size,
>> +               p_opts->log_max_size/1024/1024,
>>                p_opts->accum_log_file ? "TRUE" : "FALSE",
>>                p_opts->dump_files_dir,
>>                p_opts->enable_quirks ? "TRUE" : "FALSE",
> 
> Should your patch for adding opensm.conf to scripts should be updated to v2 ?
> 
> -- Hal
> 

Can you please explain?

Thanks,
Doron

>> --
>> 1.5.3.8
>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>


From hal.rosenstock at gmail.com  Thu Nov  6 06:03:03 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 6 Nov 2008 09:03:03 -0500
Subject: [ofa-general] [PATCH] export osm_log_max in MB
In-Reply-To: <4912F7B7.1000109@Voltaire.COM>
References: <49101D1F.4040605@Voltaire.COM> <4912DC30.40309@Voltaire.COM>
	<f0e08f230811060531j6be51fffv1ce2f45c6573e9b0@mail.gmail.com>
	<4912F7B7.1000109@Voltaire.COM>
Message-ID: <f0e08f230811060603j6a1a98bdqe77f7026245c0af4@mail.gmail.com>

On Thu, Nov 6, 2008 at 8:57 AM, Doron Shoham <dorons at voltaire.com> wrote:
> Hal Rosenstock wrote:
>> On Thu, Nov 6, 2008 at 6:59 AM, Doron Shoham <dorons at voltaire.com> wrote:
>>> export the osm_log_max in MB when using 'opensm -c <conf>
>>>
>>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>>> ---
>>>  opensm/opensm/osm_subnet.c |    4 ++--
>>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>>> index 0422d0f..c130c0d 100644
>>> --- a/opensm/opensm/osm_subnet.c
>>> +++ b/opensm/opensm/osm_subnet.c
>>> @@ -1668,7 +1668,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>>>                "force_log_flush %s\n\n"
>>>                "# Log file to be used\n"
>>>                "log_file %s\n\n"
>>> -               "# Limit the size of the log file. If overrun, log is restarted\n"
>>> +               "# Limit the size of the log file in MB. If overrun, log is restarted\n"
>>>                "log_max_size %lu\n\n"
>>>                "# If TRUE will accumulate the log over multiple OpenSM sessions\n"
>>>                "accum_log_file %s\n\n"
>>> @@ -1694,7 +1694,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>>>                p_opts->log_flags,
>>>                p_opts->force_log_flush ? "TRUE" : "FALSE",
>>>                p_opts->log_file,
>>> -               p_opts->log_max_size,
>>> +               p_opts->log_max_size/1024/1024,
>>>                p_opts->accum_log_file ? "TRUE" : "FALSE",
>>>                p_opts->dump_files_dir,
>>>                p_opts->enable_quirks ? "TRUE" : "FALSE",
>>
>> Should your patch for adding opensm.conf to scripts should be updated to v2 ?
>>
>> -- Hal
>>
>
> Can you please explain?

Doesn't this change these lines (a comment and the value of
log_max_size) in the opensm.conf file which you are proposing to be
added into scripts ?

-- Hal

>
> Thanks,
> Doron
>
>>> --
>>> 1.5.3.8
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>
>


From dorons at voltaire.com  Thu Nov  6 06:22:03 2008
From: dorons at voltaire.com (Doron Shoham)
Date: Thu, 06 Nov 2008 16:22:03 +0200
Subject: [ofa-general] [PATCH] export osm_log_max in MB
In-Reply-To: <f0e08f230811060603j6a1a98bdqe77f7026245c0af4@mail.gmail.com>
References: <49101D1F.4040605@Voltaire.COM> <4912DC30.40309@Voltaire.COM>	
	<f0e08f230811060531j6be51fffv1ce2f45c6573e9b0@mail.gmail.com>	
	<4912F7B7.1000109@Voltaire.COM>
	<f0e08f230811060603j6a1a98bdqe77f7026245c0af4@mail.gmail.com>
Message-ID: <4912FD8B.9070304@voltaire.com>

Hal Rosenstock wrote:
> On Thu, Nov 6, 2008 at 8:57 AM, Doron Shoham <dorons at voltaire.com> wrote:
>> Hal Rosenstock wrote:
>>> On Thu, Nov 6, 2008 at 6:59 AM, Doron Shoham <dorons at voltaire.com> wrote:
>>>> export the osm_log_max in MB when using 'opensm -c <conf>
>>>>
>>>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>>>> ---
>>>>  opensm/opensm/osm_subnet.c |    4 ++--
>>>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>>>> index 0422d0f..c130c0d 100644
>>>> --- a/opensm/opensm/osm_subnet.c
>>>> +++ b/opensm/opensm/osm_subnet.c
>>>> @@ -1668,7 +1668,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>>>>                "force_log_flush %s\n\n"
>>>>                "# Log file to be used\n"
>>>>                "log_file %s\n\n"
>>>> -               "# Limit the size of the log file. If overrun, log is restarted\n"
>>>> +               "# Limit the size of the log file in MB. If overrun, log is restarted\n"
>>>>                "log_max_size %lu\n\n"
>>>>                "# If TRUE will accumulate the log over multiple OpenSM sessions\n"
>>>>                "accum_log_file %s\n\n"
>>>> @@ -1694,7 +1694,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>>>>                p_opts->log_flags,
>>>>                p_opts->force_log_flush ? "TRUE" : "FALSE",
>>>>                p_opts->log_file,
>>>> -               p_opts->log_max_size,
>>>> +               p_opts->log_max_size/1024/1024,
>>>>                p_opts->accum_log_file ? "TRUE" : "FALSE",
>>>>                p_opts->dump_files_dir,
>>>>                p_opts->enable_quirks ? "TRUE" : "FALSE",
>>> Should your patch for adding opensm.conf to scripts should be updated to v2 ?
>>>
>>> -- Hal
>>>
>> Can you please explain?
> 
> Doesn't this change these lines (a comment and the value of
> log_max_size) in the opensm.conf file which you are proposing to be
> added into scripts ?
> 
> -- Hal
> 

The first patch converts the log_size from opensm.conf to MB.
The second one converts in the opposite direction when opensm dump
its configuration.


>> Thanks,
>> Doron
>>
>>>> --
>>>> 1.5.3.8
>>>>
>>>>
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>
>>


From kelly at tradebotsystems.com  Thu Nov  6 06:24:57 2008
From: kelly at tradebotsystems.com (Kelly Burkhart)
Date: Thu, 6 Nov 2008 08:24:57 -0600
Subject: [ofa-general] infiniband multicast (libibverbs)
Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E34349F@tbmail2.tradebot.com>

I believe that the problem was that prior to receiving any messages, I
was posting many recvs using the the same buffer.  I was expecting
that the buffer wouldn't be filled until I polled the cq for the
completion.  Instead, it appears that my buffer was being filled and
then over filled as fast as messages came in.  So when I polled for
the completion of the fifth message, the buffer may already contain
the tenth.

To resolve the issue, I created larger memory region and considered it
a circular buffer.  When I advance posted my recvs, each WR pointed to
a different portion of the MR.  Now I should only have problems If I
can't process messages fast enough and my buffer wraps.

Thanks,

-K

________________________________

	From: Devesh Sharma [mailto:devesh28 at gmail.com] 
	Sent: Thursday, November 06, 2008 3:09 AM
	To: Kelly Burkhart
	Cc: Roland Dreier; general at lists.openfabrics.org
	Subject: Re: [ofa-general] infiniband multicast (libibverbs)
	
	
	ok, try to do sequence number check after a slight delay say
after 100ns delay. Is it possible that DMA latancies are comming into
picture? Roland or Dotan can comment on this!
	
	
	On 11/5/08, Kelly Burkhart <kelly at tradebotsystems.com> wrote: 

		It is non-blocking.  I spin, calling ibv_poll_cq until
it returns a non-zero.


From dorons at Voltaire.COM  Thu Nov  6 06:26:43 2008
From: dorons at Voltaire.COM (Doron Shoham)
Date: Thu, 06 Nov 2008 16:26:43 +0200
Subject: [ofa-general] [PATCH] limit log records number and size
Message-ID: <4912FEA3.3090409@Voltaire.COM>

limit log records number and size

Signed-off-by: Doron Shoham <dorons at voltaire.com>
---
 opensm/scripts/opensm.logrotate |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/opensm/scripts/opensm.logrotate b/opensm/scripts/opensm.logrotate
index e16e227..e0f4125 100644
--- a/opensm/scripts/opensm.logrotate
+++ b/opensm/scripts/opensm.logrotate
@@ -4,4 +4,6 @@
     copytruncate
     weekly
     compress
+    rotate 10
+    size 100M
 }
-- 
1.5.3.8


From hal.rosenstock at gmail.com  Thu Nov  6 06:29:09 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 6 Nov 2008 09:29:09 -0500
Subject: [ofa-general] [PATCH] export osm_log_max in MB
In-Reply-To: <4912FD8B.9070304@voltaire.com>
References: <49101D1F.4040605@Voltaire.COM> <4912DC30.40309@Voltaire.COM>
	<f0e08f230811060531j6be51fffv1ce2f45c6573e9b0@mail.gmail.com>
	<4912F7B7.1000109@Voltaire.COM>
	<f0e08f230811060603j6a1a98bdqe77f7026245c0af4@mail.gmail.com>
	<4912FD8B.9070304@voltaire.com>
Message-ID: <f0e08f230811060629g5a0fab76r2a4d96c421efd5c@mail.gmail.com>

On Thu, Nov 6, 2008 at 9:22 AM, Doron Shoham <dorons at voltaire.com> wrote:
> Hal Rosenstock wrote:
>>
>> On Thu, Nov 6, 2008 at 8:57 AM, Doron Shoham <dorons at voltaire.com> wrote:
>>>
>>> Hal Rosenstock wrote:
>>>>
>>>> On Thu, Nov 6, 2008 at 6:59 AM, Doron Shoham <dorons at voltaire.com>
>>>> wrote:
>>>>>
>>>>> export the osm_log_max in MB when using 'opensm -c <conf>
>>>>>
>>>>> Signed-off-by: Doron Shoham <dorons at voltaire.com>
>>>>> ---
>>>>>  opensm/opensm/osm_subnet.c |    4 ++--
>>>>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
>>>>> index 0422d0f..c130c0d 100644
>>>>> --- a/opensm/opensm/osm_subnet.c
>>>>> +++ b/opensm/opensm/osm_subnet.c
>>>>> @@ -1668,7 +1668,7 @@ int osm_subn_write_conf_file(char *file_name, IN
>>>>> osm_subn_opt_t *const p_opts)
>>>>>               "force_log_flush %s\n\n"
>>>>>               "# Log file to be used\n"
>>>>>               "log_file %s\n\n"
>>>>> -               "# Limit the size of the log file. If overrun, log is
>>>>> restarted\n"
>>>>> +               "# Limit the size of the log file in MB. If overrun,
>>>>> log is restarted\n"
>>>>>               "log_max_size %lu\n\n"
>>>>>               "# If TRUE will accumulate the log over multiple OpenSM
>>>>> sessions\n"
>>>>>               "accum_log_file %s\n\n"
>>>>> @@ -1694,7 +1694,7 @@ int osm_subn_write_conf_file(char *file_name, IN
>>>>> osm_subn_opt_t *const p_opts)
>>>>>               p_opts->log_flags,
>>>>>               p_opts->force_log_flush ? "TRUE" : "FALSE",
>>>>>               p_opts->log_file,
>>>>> -               p_opts->log_max_size,
>>>>> +               p_opts->log_max_size/1024/1024,
>>>>>               p_opts->accum_log_file ? "TRUE" : "FALSE",
>>>>>               p_opts->dump_files_dir,
>>>>>               p_opts->enable_quirks ? "TRUE" : "FALSE",
>>>>
>>>> Should your patch for adding opensm.conf to scripts should be updated to
>>>> v2 ?
>>>>
>>>> -- Hal
>>>>
>>> Can you please explain?
>>
>> Doesn't this change these lines (a comment and the value of
>> log_max_size) in the opensm.conf file which you are proposing to be
>> added into scripts ?

Understood. It's a nit but I was referring to "[PATCH 1/2] add default
configuration files" where in opensm.conf there is:

+# Limit the size(MB) of the log file. If overrun, log is restarted
+log_max_size 4096

-- Hal

>>
>> -- Hal
>>
>
> The first patch converts the log_size from opensm.conf to MB.
> The second one converts in the opposite direction when opensm dump
> its configuration.
>
>
>>> Thanks,
>>> Doron
>>>
>>>>> --
>>>>> 1.5.3.8
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>
>
>


From jackm at dev.mellanox.co.il  Thu Nov  6 07:12:50 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 6 Nov 2008 17:12:50 +0200
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106012307.GP31163@sgi.com>
References: <20081106012307.GP31163@sgi.com>
Message-ID: <200811061712.50605.jackm@dev.mellanox.co.il>

On Thursday 06 November 2008 03:23, akepner at sgi.com wrote:
> I described an IPoIB-related panic we were seeing on large 
> clusters. The signature was a backtrace like this:
> 
>         skb_over_panic
>         :ib_ipoib:ipoib_ib_handle_rx_wc
>         :ib_ipoib:ipoib_poll
>         net_rx_action
>         .....
> 
> The bug is difficult to reproduce, but we finally got a crashdump, 
> and the problem appears to be that stale skb pointers on the tx_ring 
> were left pointing to skbs that had been since reused, so that the 
> skb's data region was now unexpectedly short, etc. 
> 
How does ipoib_ib_handle_rx_wc() involve the tx_ring? This is receive processing.

- Jack


From akepner at sgi.com  Thu Nov  6 08:04:28 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 6 Nov 2008 08:04:28 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106084031.GA25354@mtls03>
References: <20081106012307.GP31163@sgi.com> <20081106084031.GA25354@mtls03>
Message-ID: <20081106160428.GR31163@sgi.com>

On Thu, Nov 06, 2008 at 10:40:32AM +0200, Eli Cohen wrote:
> On Wed, Nov 05, 2008 at 05:23:07PM -0800, akepner at sgi.com wrote:
> ...
> looking a the patch I don't understand why it should fix the problem
> you're seeing. I suspect we may be hiding the problem.
> 

I think that may be correct. 

For the stale skb pointers to be reused by the ipoib driver, it 
looks like we'd need to get 'unexpected' completions. 

-- 
Arthur


From akepner at sgi.com  Thu Nov  6 08:40:05 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 6 Nov 2008 08:40:05 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <200811061712.50605.jackm@dev.mellanox.co.il>
References: <20081106012307.GP31163@sgi.com>
	<200811061712.50605.jackm@dev.mellanox.co.il>
Message-ID: <20081106164005.GS31163@sgi.com>

On Thu, Nov 06, 2008 at 05:12:50PM +0200, Jack Morgenstein wrote:
> On Thursday 06 November 2008 03:23, akepner at sgi.com wrote:
> > I described an IPoIB-related panic we were seeing on large 
> > clusters. The signature was a backtrace like this:
> > 
> >         skb_over_panic
> >         :ib_ipoib:ipoib_ib_handle_rx_wc
> >         :ib_ipoib:ipoib_poll
> >         net_rx_action
> >         .....
> > 
> > The bug is difficult to reproduce, but we finally got a crashdump, 
> > and the problem appears to be that stale skb pointers on the tx_ring 
> > were left pointing to skbs that had been since reused, so that the 
> > skb's data region was now unexpectedly short, etc. 
> > 
> How does ipoib_ib_handle_rx_wc() involve the tx_ring? This is 
> receive processing.
> 

What I surmise may be happening is something like this:

- tx skb is freed, but a stale pointer remains on tx_ring
- the same skb is reallocated, and added to the rx_ring
- now we get an 'unexpected' tx completion, and use the stale 
  skb pointer on the tx_ring to again free the skb (this step 
  seems to invoke a f/w bug)
- another driver, say an ethernet driver, reallocates the skb, 
  reducing the extent of the data region (leading to the 
  skb_over_panic once it's processed by ipoib_ib_handle_rx_wc)


This bug leaves the tx and rx rings corrupted in many ways, 
including:

- different rx_ring members refer to the same skb
- different skbs on the rx_ring have identical data, head, end, tail ptrs
- skbs on the rx_ring have sizes inconsistent with what the ipoib 
  driver allocates (which causes the skb_over_panic, of course)
- rx skbs have 'dev' pointers to ethernet devices 
- dma mappings in rx_ring aren't consistent with what's in skb
- some skbs are simultaneously on the tx and rx rings

-- 
Arthur


From weiny2 at llnl.gov  Thu Nov  6 09:11:13 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 6 Nov 2008 09:11:13 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106160428.GR31163@sgi.com>
References: <20081106012307.GP31163@sgi.com> <20081106084031.GA25354@mtls03>
	<20081106160428.GR31163@sgi.com>
Message-ID: <20081106091113.66bcff92.weiny2@llnl.gov>

On Thu, 6 Nov 2008 08:04:28 -0800
akepner at sgi.com wrote:

> On Thu, Nov 06, 2008 at 10:40:32AM +0200, Eli Cohen wrote:
> > On Wed, Nov 05, 2008 at 05:23:07PM -0800, akepner at sgi.com wrote:
> > ...
> > looking a the patch I don't understand why it should fix the problem
> > you're seeing. I suspect we may be hiding the problem.
> > 
> 
> I think that may be correct. 
> 
> For the stale skb pointers to be reused by the ipoib driver, it 
> looks like we'd need to get 'unexpected' completions. 
> 

If this is the case we could use a debug patch which Al developed here which
simply flags the skb as "invalid" but leaves the pointer.  Then we could use
that flag to determine when these "unexpected" completions are occuring.

I can get the patch from Al if you would like.

Ira


From chu11 at llnl.gov  Thu Nov  6 09:23:56 2008
From: chu11 at llnl.gov (Al Chu)
Date: Thu, 06 Nov 2008 09:23:56 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106160428.GR31163@sgi.com>
References: <20081106012307.GP31163@sgi.com> <20081106084031.GA25354@mtls03>
	<20081106160428.GR31163@sgi.com>
Message-ID: <1225992236.13371.19.camel@cardanus.llnl.gov>

On Thu, 2008-11-06 at 08:04 -0800, akepner at sgi.com wrote:
> On Thu, Nov 06, 2008 at 10:40:32AM +0200, Eli Cohen wrote:
> > On Wed, Nov 05, 2008 at 05:23:07PM -0800, akepner at sgi.com wrote:
> > ...
> > looking a the patch I don't understand why it should fix the problem
> > you're seeing. I suspect we may be hiding the problem.
> > 
> 
> I think that may be correct. 
>
> For the stale skb pointers to be reused by the ipoib driver, it 
> looks like we'd need to get 'unexpected' completions. 

I implemented the attached cheapo-debug-patch and installed it on one of
our clusters.  We hit the error condition (the "Oh crap" error message)
several times right before the same crashes.  So I think Arthur's patch
fixes something, although there may be a more core underlying issue yet
to be solved.

Al

P.S.  I should note that when debugging this, I was looking at a
different stack trace than Arthur and Ira, but believed it to be the
same core issue.

-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: verify_skb_reset.patch
Type: text/x-patch
Size: 3196 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081106/b093e4d5/attachment.bin>

From chu11 at llnl.gov  Thu Nov  6 09:31:47 2008
From: chu11 at llnl.gov (Al Chu)
Date: Thu, 06 Nov 2008 09:31:47 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106091113.66bcff92.weiny2@llnl.gov>
References: <20081106012307.GP31163@sgi.com> <20081106084031.GA25354@mtls03>
	<20081106160428.GR31163@sgi.com>
	<20081106091113.66bcff92.weiny2@llnl.gov>
Message-ID: <1225992707.13371.22.camel@cardanus.llnl.gov>

On Thu, 2008-11-06 at 09:11 -0800, Ira Weiny wrote:
> On Thu, 6 Nov 2008 08:04:28 -0800
> akepner at sgi.com wrote:
> 
> > On Thu, Nov 06, 2008 at 10:40:32AM +0200, Eli Cohen wrote:
> > > On Wed, Nov 05, 2008 at 05:23:07PM -0800, akepner at sgi.com wrote:
> > > ...
> > > looking a the patch I don't understand why it should fix the problem
> > > you're seeing. I suspect we may be hiding the problem.
> > > 
> > 
> > I think that may be correct. 
> > 
> > For the stale skb pointers to be reused by the ipoib driver, it 
> > looks like we'd need to get 'unexpected' completions. 
> > 
> 
> If this is the case we could use a debug patch which Al developed here which
> simply flags the skb as "invalid" but leaves the pointer.  Then we could use
> that flag to determine when these "unexpected" completions are occuring.

FYI, this is the patch I just posted.
Al


> I can get the patch from Al if you would like.
> 
> Ira
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From friedman at ucla.edu  Thu Nov  6 10:34:57 2008
From: friedman at ucla.edu (Scott A. Friedman)
Date: Thu, 06 Nov 2008 10:34:57 -0800
Subject: [ofa-general] ib_mthca catastrophic error detected
In-Reply-To: <200811061154.02260.jackm@dev.mellanox.co.il>
References: <4906645D.6010101@ucla.edu> <4907054E.9080205@mellanox.co.il>
	<490763D0.5020002@ucla.edu>
	<200811061154.02260.jackm@dev.mellanox.co.il>
Message-ID: <491338D1.8050205@ucla.edu>

Hi

We have been working with Matthew Finlay <Matt at mellanox.com> on this 
recently - you/we might pull all of this together. We are able to make 
any of our sdr cards have a catastrophic error - and are unable to do 
the same with our ddr cards. Matt has suggested that there is a firmware 
fix possibly?

Anyway, to answer your questions:

The hosts are Sun X2200M, but we have swapped a few around with some 
hosts we have from Aspen systems and the problem remains. I suppose the 
similarity is that they are all nForce based.

The MPI used was the latest OpenMPI - I will find the version, but I do 
not think it matters whether we are using OpenMPI or MVAPICH.

The job itself does not seem to matter either. The situation is after a 
node comes up it takes a very long time for the card to become ACTIVE. 
It seems to ocsillate between ACTIVE and INIT. We have waited several 
minutes sometimes but can never be sure of when it will settle down. The 
queue certainly doesn't know and a job submitted to such a node will die 
as the cards will have a catastrophic error.

Scott


 > Console output from the following linux commands:
 >   cat /etc/*rel*


Not a good idea...maybe this

#cat /etc/redhat-release
CentOS release 5 (Final)

 >   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using 
grub)

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/hda3
#          initrd /initrd-version.img
#boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.18-92.1.6.el5)
  root (hd0,0)
  kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
  initrd /initrd-2.6.18-92.1.6.el5.img


 >   uname -a

Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008 x86_64 
x86_64 x86_64 GNU/Linux


 >   cat /proc/cpuinfo
 >   cat /proc/meminfo

processor : 0
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4424.75
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 1
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4426.22
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 2
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.37
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 3
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.65
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 4
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.36
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 5
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.71
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 6
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 7
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]


MemTotal:      8182568 kB
MemFree:       4535892 kB
Buffers:        318232 kB
Cached:        1583772 kB
SwapCached:          0 kB
Active:        2714400 kB
Inactive:       730260 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      8182568 kB
LowFree:       4535892 kB
SwapTotal:     8289532 kB
SwapFree:      8289380 kB
Dirty:             340 kB
Writeback:           0 kB
AnonPages:     1542636 kB
Mapped:          14588 kB
Slab:           139788 kB
PageTables:       7208 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  12380816 kB
Committed_AS:  1679420 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      4600 kB
VmallocChunk: 34359733707 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB


Jack Morgenstein wrote:
> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>> Hi
>>
>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel module 
>> reports the following on startup:
>>
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>
>> The cards in all (22) of the nodes we have seen this error on are as 
>> follows:
>>
>> hca_id: mthca0
>>          fw_ver:                         1.2.0
>>          vendor_id:                      0x02c9
>>          vendor_part_id:                 25204
>>          hw_ver:                         0xA0
>>          board_id:                       MT_03B0140001
>>          phys_port_cnt:                  1
>>
>> It appears that when this happens the driver restarts (loads?) itself 
>> however the job running at the time of the error is, of course, killed.
>>
>> Scott
> 
> Scott,
> We are trying to reproduce this here.  It would help if you could supply
> the following info:
> 
> Host model for hosts which are experiencing the failure:
>  
> Console output from the following linux commands:
>   cat /etc/*rel*
>   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using grub)
>   uname -a
>   cat /proc/cpuinfo
>   cat /proc/meminfo
> 
> Also, what sort of job was running when the failure occurred:
> -- which MPI are you using?
> -- do you have a test example which we can run here to reproduce the problem?
> 
> Thanks in advance for your help!
> 
> Jack Morgenstein
> Senior Software Development Engineer
> Mellanox


From andy.grover at oracle.com  Thu Nov  6 10:58:24 2008
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 06 Nov 2008 10:58:24 -0800
Subject: [ofa-general] Re: [PATCH] opensm: fix iser service-id used for SL
	assignment
In-Reply-To: <Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
	<Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>
Message-ID: <49133E50.7090508@oracle.com>

Or Gerlitz wrote:
> BTW - while doing this fix, I noted that the port assumed by opensm
for RDS is 18634
> (0x48CA) which is the ones used in the rds code deployed in ofed
> 1.3.x, where the rds code based deployed into ofed 1.4.y uses port
> 18635
> 
> Andy, Rick, can you guys revert to 18634 to make things simpler wrt
> RDS/QoS configuration?

It appears this is a fix for multiple rds transports each trying to bind
to that port with INADDR_ANY, see commit f0af6566. I think the correct
fix is to use a single port but have transports listen on their specific
interfaces only.

I think this is too big a fix for 1.4.0 so I will simply disable TCP
transport there (leaving just IB transport, thus no problem) and move
the port back to 18634. For 1.4.1 we will have multiple transports again
and will need to fix this by not using INADDR_ANY, as described above.

Regards -- Andy


From swise at opengridcomputing.com  Thu Nov  6 11:06:45 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 06 Nov 2008 13:06:45 -0600
Subject: [ofa-general] Re: [ewg] OFED Nov 3 2008 meeting summary on OFED 1.4
	status
In-Reply-To: <49102064.7080004@dev.mellanox.co.il>
References: <49102064.7080004@dev.mellanox.co.il>
Message-ID: <49134045.2080705@opengridcomputing.com>

Hey Vlad,

I opened a few critical bugs against cxgb3 for rhel4.x backport issues.  
We're trying to resolve them asap.

When is the cutoff for making rc4?

Thanks,


Steve.


Vladimir Sokolovsky wrote:
> Meeting minutes on the web:
> http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/
>
> Meeting Summary:
> ==============
> RC4 is delayed - will be released on Thursday Nov 6.
>
> Details:
> =======
> Bugs to be fixed in RC4:
>
> 1283    blocker   P1    RHEL 5 yannick.cote at qlogic.com   NEW        
> Intel MPI fails on Qlogc HCA
> 1326    blocker   P1    RHEL 4 yannick.cote at qlogic.com   NEW        
> ipath driver fails to build on IA64 in the 10/28/08 daily build
> 1335    major     P3    Other  monis at voltaire.com        NEW        
> Bonding: packet lost during failover
> 1301    major     P3    RHEL 4 olgas at voltaire.com        NEW        
> Can not load rds module on RH4 up7
> 1323    blocker   P1    All    stefan.roscher at de.ibm.com REOPENED   
> IB/ehca: possibillity of kernel panic under certain circumstances
> 1242    critical  P2    RHEL 4 yannick.cote at qlogic.com   NEW        
> kernel panic while running mpi2007 against ofed1.4 -- ib_ipath: 
> ipath_sdma_verbs_send
> 1336    critical  P1    RHEL 5 bugzilla at openib.org       NEW        
> Can't to unloading the mlx4_ib module on ppc64
>
> Regards,
> Vladimir
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From jon at opengridcomputing.com  Thu Nov  6 12:23:22 2008
From: jon at opengridcomputing.com (Jon Mason)
Date: Thu, 6 Nov 2008 14:23:22 -0600
Subject: [ofa-general] Re: [PATCH] opensm: fix iser service-id used for SL
	assignment
In-Reply-To: <49133E50.7090508@oracle.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
	<Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>
	<49133E50.7090508@oracle.com>
Message-ID: <20081106202322.GE15978@opengridcomputing.com>

On Thu, Nov 06, 2008 at 10:58:24AM -0800, Andy Grover wrote:
> Or Gerlitz wrote:
> > BTW - while doing this fix, I noted that the port assumed by opensm
> for RDS is 18634
> > (0x48CA) which is the ones used in the rds code deployed in ofed
> > 1.3.x, where the rds code based deployed into ofed 1.4.y uses port
> > 18635
> > 
> > Andy, Rick, can you guys revert to 18634 to make things simpler wrt
> > RDS/QoS configuration?
> 
> It appears this is a fix for multiple rds transports each trying to bind
> to that port with INADDR_ANY, see commit f0af6566. I think the correct
> fix is to use a single port but have transports listen on their specific
> interfaces only.
> 
> I think this is too big a fix for 1.4.0 so I will simply disable TCP
> transport there (leaving just IB transport, thus no problem) and move
> the port back to 18634. For 1.4.1 we will have multiple transports again
> and will need to fix this by not using INADDR_ANY, as described above.

There needs to be a separate port for all the interfaces.  IIRC, each
RDS transport type is listening on a specific port for incoming
connections.  With each one squatting, the other ones will receive
incoming connections.  So for the existing iWARP setup in RDS, they
must be separate.


If they are migrated to a specific physical port or IP address/port
tuple, then this is not an issue.  Also, there should be a standard port
to listen on (and not squat on an ephemeral port, as this can cause
problems).

Thanks,
Jon

> 
> Regards -- Andy


From or.gerlitz at gmail.com  Thu Nov  6 13:25:46 2008
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Thu, 6 Nov 2008 23:25:46 +0200
Subject: [ofa-general] Re: [PATCH] opensm: fix iser service-id used for SL
	assignment
In-Reply-To: <49133E50.7090508@oracle.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
	<Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>
	<49133E50.7090508@oracle.com>
Message-ID: <15ddcffd0811061325m2dcad0e9i789fa873d76aaa7f@mail.gmail.com>

On Thu, Nov 6, 2008 at 8:58 PM, Andy Grover <andy.grover at oracle.com> wrote:
>> can you guys revert to 18634 to make things simpler wrt RDS/QoS configuration?

> It appears this is a fix for multiple rds transports each trying to bind to that port with
> INADDR_ANY, see commit f0af6566. I think the correct fix is to use a single port but have > transports listen on their specific interfaces only.

Andy,

commit f0af6566 came to handle the case where RDS iWARP and RDS TCP
listeners co-exist on the same node.

In the general case, there's would be no special interface nor IP
address for iWARP, the same IP is used for both TCP connections served
by the OS stack and iWARP connections serves by the NIC TOE stack.
This creates the "TOE port space problem" since when the NIC gets TCP
connection request for port X it has no clue if it need to be served
by the TOE stack or the OS stack, so RDS iWARP connection request to
port 18635 could be routed to iperf server that was spawned to listen
on that port. A possible solution that was suggested by the iWARP guys
was to have the TCP port space being shared between TCP and RDMA
listeners, currently the Linux kernel netdev maintainers are not
willing to accept such patch, and the current suggestion was applied
in ofed 1.4 see cma_0100_unified_tcp_ports.patch under
kernel_patches/fixes

> I think this is too big a fix for 1.4.0 so I will simply disable TCP transport there (leaving just > IB transport, thus no problem) and move the port back to 18634. For 1.4.1 we will have
> multiple transports again and will need to fix this by not using INADDR_ANY, as described > above.

Yes, lets have the IB transport use port 18634. As I explained above
the INADDR_ANY usage is not the problem.

Or.


From andy.grover at oracle.com  Thu Nov  6 13:41:01 2008
From: andy.grover at oracle.com (Andy Grover)
Date: Thu, 06 Nov 2008 13:41:01 -0800
Subject: [ofa-general] Re: [PATCH] opensm: fix iser service-id used for
	SL assignment
In-Reply-To: <15ddcffd0811061325m2dcad0e9i789fa873d76aaa7f@mail.gmail.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>	
	<Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>	
	<49133E50.7090508@oracle.com>
	<15ddcffd0811061325m2dcad0e9i789fa873d76aaa7f@mail.gmail.com>
Message-ID: <4913646D.6080308@oracle.com>

Hi Vlad, please pull the below trees. They remove the unused RDS TCP
transport in 1.3 and 1.4. I've verified they do not break the build:

OFED 1.3:

Andy Grover (1):
      RDS: Remove TCP transport

www.openfabrics.org:/pub/scm/~agrover/ofed_1_3/linux-2.6.git
code-drop/20081106

OFED 1.4:

Andy Grover (2):
      RDS: Remove TCP transport
      RDS: Change listen port back to 18634

www.openfabrics.org:/pub/scm/~agrover/ofed_1_4/linux-2.6.git
code-drop/20081106

Thanks -- Andy


From swise at opengridcomputing.com  Thu Nov  6 15:06:42 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 06 Nov 2008 17:06:42 -0600
Subject: [ofa-general] [PATCH 2.6.28] RDMA/cxgb3: deadlock in iw_cxgb3 can
	cause hang when configuring interface.
Message-ID: <20081106230642.28808.66765.stgit@dell3.ogc.int>

From: Steve Wise <swise at opengridcomputing.com>

When the iw_cxgb3 module's cxgb3_client "add" func gets called by the
cxgb3 module, the iwarp driver ends up calling the ethtool ops get_drvinfo
function in cxgb3 to get the fw version and other info.  Currently the
iwarp driver grabs the rtnl lock around this down call to serialize.
As of 2.6.27 or so, things changed such that the rtnl lock is held around
the call to the netdev driver open function.  Also the cxgb3_client "add"
function doesn't get called if the device is down.  

So, if you load cxgb3, then load iw_cxgb3, then ifconfig up the device,
the iw_cxgb3 add func gets called with the rtnl_lock held.   If you
load cxgb3, ifconfig up the device, then load iw_cxgb3, the add func
gets called without the rtnl_lock held.  The former causes the deadlock,
the latter does not.

In addition, there are iw_cxgb3 sysfs handlers that also can call
down into cxgb3 to gather the fw and hw versions.  These can be called
concurrently on different processors and at any time.  Thus we need to
push this serialization down in the cxgb3 driver get_drvinfo func.

The fix is to remove rtnl lock usage, and use a per-device lock in cxgb3.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c |    6 ------
 drivers/net/cxgb3/cxgb3_main.c              |    2 ++
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index ecff980..160ef48 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1102,9 +1102,7 @@ static u64 fw_vers_string_to_u64(struct iwch_dev *iwch_dev)
 	char *cp, *next;
 	unsigned fw_maj, fw_min, fw_mic;
 
-	rtnl_lock();
 	lldev->ethtool_ops->get_drvinfo(lldev, &info);
-	rtnl_unlock();
 
 	next = info.fw_version + 1;
 	cp = strsep(&next, ".");
@@ -1192,9 +1190,7 @@ static ssize_t show_fw_ver(struct device *dev, struct device_attribute *attr, ch
 	struct net_device *lldev = iwch_dev->rdev.t3cdev_p->lldev;
 
 	PDBG("%s dev 0x%p\n", __func__, dev);
-	rtnl_lock();
 	lldev->ethtool_ops->get_drvinfo(lldev, &info);
-	rtnl_unlock();
 	return sprintf(buf, "%s\n", info.fw_version);
 }
 
@@ -1207,9 +1203,7 @@ static ssize_t show_hca(struct device *dev, struct device_attribute *attr,
 	struct net_device *lldev = iwch_dev->rdev.t3cdev_p->lldev;
 
 	PDBG("%s dev 0x%p\n", __func__, dev);
-	rtnl_lock();
 	lldev->ethtool_ops->get_drvinfo(lldev, &info);
-	rtnl_unlock();
 	return sprintf(buf, "%s\n", info.driver);
 }
 
diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index 1ace41a..5e663cc 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -1307,8 +1307,10 @@ static void get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
 	u32 fw_vers = 0;
 	u32 tp_vers = 0;
 
+	spin_lock(&adapter->stats_lock);
 	t3_get_fw_version(adapter, &fw_vers);
 	t3_get_tp_version(adapter, &tp_vers);
+	spin_unlock(&adapter->stats_lock);
 
 	strcpy(info->driver, DRV_NAME);
 	strcpy(info->version, DRV_VERSION);


From divy at chelsio.com  Thu Nov  6 15:27:21 2008
From: divy at chelsio.com (Divy Le Ray)
Date: Thu, 06 Nov 2008 15:27:21 -0800
Subject: [ofa-general] Re: [PATCH 2.6.28] RDMA/cxgb3: deadlock in iw_cxgb3
 can cause hang when configuring interface.
In-Reply-To: <20081106230642.28808.66765.stgit@dell3.ogc.int>
References: <20081106230642.28808.66765.stgit@dell3.ogc.int>
Message-ID: <49137D59.9070306@chelsio.com>

Steve Wise wrote:
> From: Steve Wise <swise at opengridcomputing.com>
>
> When the iw_cxgb3 module's cxgb3_client "add" func gets called by the
> cxgb3 module, the iwarp driver ends up calling the ethtool ops get_drvinfo
> function in cxgb3 to get the fw version and other info.  Currently the
> iwarp driver grabs the rtnl lock around this down call to serialize.
> As of 2.6.27 or so, things changed such that the rtnl lock is held around
> the call to the netdev driver open function.  Also the cxgb3_client "add"
> function doesn't get called if the device is down.  
>
> So, if you load cxgb3, then load iw_cxgb3, then ifconfig up the device,
> the iw_cxgb3 add func gets called with the rtnl_lock held.   If you
> load cxgb3, ifconfig up the device, then load iw_cxgb3, the add func
> gets called without the rtnl_lock held.  The former causes the deadlock,
> the latter does not.
>
> In addition, there are iw_cxgb3 sysfs handlers that also can call
> down into cxgb3 to gather the fw and hw versions.  These can be called
> concurrently on different processors and at any time.  Thus we need to
> push this serialization down in the cxgb3 driver get_drvinfo func.
>
> The fix is to remove rtnl lock usage, and use a per-device lock in cxgb3.
>
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>
>   

Acked-by: Divy Le Ray <divy at chelsio.com>

> ---
>
>  drivers/infiniband/hw/cxgb3/iwch_provider.c |    6 ------
>  drivers/net/cxgb3/cxgb3_main.c              |    2 ++
>  2 files changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
> index ecff980..160ef48 100644
> --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
> +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
> @@ -1102,9 +1102,7 @@ static u64 fw_vers_string_to_u64(struct iwch_dev *iwch_dev)
>  	char *cp, *next;
>  	unsigned fw_maj, fw_min, fw_mic;
>  
> -	rtnl_lock();
>  	lldev->ethtool_ops->get_drvinfo(lldev, &info);
> -	rtnl_unlock();
>  
>  	next = info.fw_version + 1;
>  	cp = strsep(&next, ".");
> @@ -1192,9 +1190,7 @@ static ssize_t show_fw_ver(struct device *dev, struct device_attribute *attr, ch
>  	struct net_device *lldev = iwch_dev->rdev.t3cdev_p->lldev;
>  
>  	PDBG("%s dev 0x%p\n", __func__, dev);
> -	rtnl_lock();
>  	lldev->ethtool_ops->get_drvinfo(lldev, &info);
> -	rtnl_unlock();
>  	return sprintf(buf, "%s\n", info.fw_version);
>  }
>  
> @@ -1207,9 +1203,7 @@ static ssize_t show_hca(struct device *dev, struct device_attribute *attr,
>  	struct net_device *lldev = iwch_dev->rdev.t3cdev_p->lldev;
>  
>  	PDBG("%s dev 0x%p\n", __func__, dev);
> -	rtnl_lock();
>  	lldev->ethtool_ops->get_drvinfo(lldev, &info);
> -	rtnl_unlock();
>  	return sprintf(buf, "%s\n", info.driver);
>  }
>  
> diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
> index 1ace41a..5e663cc 100644
> --- a/drivers/net/cxgb3/cxgb3_main.c
> +++ b/drivers/net/cxgb3/cxgb3_main.c
> @@ -1307,8 +1307,10 @@ static void get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
>  	u32 fw_vers = 0;
>  	u32 tp_vers = 0;
>  
> +	spin_lock(&adapter->stats_lock);
>  	t3_get_fw_version(adapter, &fw_vers);
>  	t3_get_tp_version(adapter, &tp_vers);
> +	spin_unlock(&adapter->stats_lock);
>  
>  	strcpy(info->driver, DRV_NAME);
>  	strcpy(info->version, DRV_VERSION);
>   


From panda at cse.ohio-state.edu  Thu Nov  6 23:02:30 2008
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Fri, 7 Nov 2008 02:02:30 -0500 (EST)
Subject: [ofa-general] Announcing the release of MVAPICH2 1.2
Message-ID: <Pine.GSO.4.40.0811070159480.14936-100000@xi.cse.ohio-state.edu>

The MVAPICH team is pleased to announce the availability of
MVAPICH2-1.2 with the following NEW features:

- Scalable and robust daemon-less job startup
   - Enhanced and robust mpirun_rsh framework (non-MPD-based) to
     provide scalable job launching on multi-thousand core clusters
   - Available for OpenFabrics (IB and iWARP) and uDAPL interfaces
     (including Solaris)
   - Support for Totalview debugger

- Checkpoint-restart with intra-node shared memory support
   - Allows best performance and scalability with fault-tolerance
     support

- Enhancement to software installation
   - Full autoconf-based configuration
   - Automatically detects system architecture and adapter types
     and optimizes MVAPICH2 for any particular installation
   - An application (mpiname) for querying the MVAPICH2
     library version and configuration information

- Enhanced processor affinity using PLPA for multi-core architectures
   - Allows user-defined flexible processor affinity

- Enhanced scalability for RDMA-based direct one-sided communication
  with less communication resource
   - Available for OpenFabrics (IB and iWARP) interfaces

- Shared memory optimized algorithm for MPI_Bcast operation

- Optimized and tuned MPI_Alltoall

- Based on MPICH2 1.0.7

More details on all features and supported platforms can be obtained
by visiting the following URL:

http://mvapich.cse.ohio-state.edu/overview/mvapich2/features.shtml

MVAPICH2 1.2 is being made available with OFED 1.4. It is also tested
with OFED 1.3. It continues to deliver excellent performance.  Sample
performance numbers include:

  OpenFabrics/Gen2 on EM64T quad-core with PCIe-Gen2 and ConnectX-QDR:
      Two-sided operations:
        - 1.25 microsec one-way latency (4 bytes)
        - 2573 MB/sec unidirectional bandwidth
        - 5037 MB/sec bidirectional bandwidth

      One-sided operations:
        - 2.73 microsec Put latency (4 bytes)
        - 2576 MB/sec unidirectional Put bandwidth
        - 4921 MB/sec bidirectional Put bandwidth

Performance numbers for several other platforms, system configurations
and operations can be viewed by visiting `Performance' section of the
project's web page.

For downloading MVAPICH2 1.2 package and accessing the anonymous SVN,
please visit the following URL:

http://mvapich.cse.ohio-state.edu/

All feedbacks, including bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list.

Thanks,

The MVAPICH Team


From o.w.saastad at usit.uio.no  Fri Nov  7 00:01:23 2008
From: o.w.saastad at usit.uio.no (Ole Widar Saastad)
Date: Fri, 07 Nov 2008 09:01:23 +0100
Subject: [ofa-general] Problems running many MPI concurrent prosesses
Message-ID: <1226044883.11237.3.camel@pyren.uio.no>

I have experienced problems running many MPI processes concurrently.
Some of the MPI processes run fine (the first started) while the others
hang or have very very slow progress.

I have dual socket quad core SUN 2200 nodes and Mellanox cards.
Se below. 

I have tried the OFED 1.2.5 stack and the OFED 1.4rc3 stack.


Any suggestions about settings or increments of buffers, tokens etc is
welcome.

An example :
Barrier benchmark :
Barrier size  9 iterations 32768 [8 procs - Resolution 0.95us]
9 nodes 12186.93 us

A barrier using 9 nodes should not take 12 milliseconds.
One barrier normally takes 11.20 microseconds using 9 nodes.


Some background information :

Stack: OFED 1.4rc3
Card : InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0)


Best regards,
Ole W. Saastad


-- 
Ole W. Saastad, dr. scient.
Scientific Computing Group, USIT, University of Oslo
http://hpc.uio.no


From vlad at lists.openfabrics.org  Fri Nov  7 03:25:21 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri,  7 Nov 2008 03:25:21 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081107-0200 daily build status
Message-ID: <20081107112521.5E4CBE60DEA@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From pradeeps at linux.vnet.ibm.com  Fri Nov  7 08:47:03 2008
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Fri, 07 Nov 2008 08:47:03 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <20081106164005.GS31163@sgi.com>
References: <20081106012307.GP31163@sgi.com>
	<200811061712.50605.jackm@dev.mellanox.co.il>
	<20081106164005.GS31163@sgi.com>
Message-ID: <49147107.2090600@linux.vnet.ibm.com>

akepner at sgi.com wrote:
> On Thu, Nov 06, 2008 at 05:12:50PM +0200, Jack Morgenstein wrote:
>> On Thursday 06 November 2008 03:23, akepner at sgi.com wrote:
>>> I described an IPoIB-related panic we were seeing on large 
>>> clusters. The signature was a backtrace like this:
>>>
>>>         skb_over_panic
>>>         :ib_ipoib:ipoib_ib_handle_rx_wc
>>>         :ib_ipoib:ipoib_poll
>>>         net_rx_action
>>>         .....
>>>
>>> The bug is difficult to reproduce, but we finally got a crashdump, 
>>> and the problem appears to be that stale skb pointers on the tx_ring 
>>> were left pointing to skbs that had been since reused, so that the 
>>> skb's data region was now unexpectedly short, etc. 
>>>
>> How does ipoib_ib_handle_rx_wc() involve the tx_ring? This is 
>> receive processing.
>>
> 
> What I surmise may be happening is something like this:
> 
> - tx skb is freed, but a stale pointer remains on tx_ring
> - the same skb is reallocated, and added to the rx_ring
> - now we get an 'unexpected' tx completion, and use the stale 
>   skb pointer on the tx_ring to again free the skb (this step 
>   seems to invoke a f/w bug)
> - another driver, say an ethernet driver, reallocates the skb, 
>   reducing the extent of the data region (leading to the 
>   skb_over_panic once it's processed by ipoib_ib_handle_rx_wc)
> 
> 
> This bug leaves the tx and rx rings corrupted in many ways, 
> including:
> 
> - different rx_ring members refer to the same skb
> - different skbs on the rx_ring have identical data, head, end, tail ptrs
> - skbs on the rx_ring have sizes inconsistent with what the ipoib 
>   driver allocates (which causes the skb_over_panic, of course)
> - rx skbs have 'dev' pointers to ethernet devices 
> - dma mappings in rx_ring aren't consistent with what's in skb
> - some skbs are simultaneously on the tx and rx rings

If I am not mistaken we saw a problem that showed similar characteristics 
more than two years ago on IBM platforms. The same issue of rx_ring 
reusing tx_ring skbs and so on and would show up only under stress. This 
was with UD mode (before CM came into the picture) and it turned
out to be a driver issue. Could that be the same here?

Pradeep


From fenkes at de.ibm.com  Fri Nov  7 08:42:51 2008
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Fri, 7 Nov 2008 17:42:51 +0100
Subject: [ofa-general] [PATCH] IB/ehca: Fix suppression of port activation
	events
In-Reply-To: <48499C11.7030504@gmail.com>
References: <200806061835.43802.fenkes@de.ibm.com> <48499C11.7030504@gmail.com>
Message-ID: <200811071742.51867.fenkes@de.ibm.com>

A previous fix introduced a regression where port activation events were
dropped unconditionally if port autodetection was not enabled. Fixed.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---

Roland -- this patch is made against your for-linus branch. Please review
and apply if you think it's okay. Hope it's not too late for the next kernel.

Joachim

 drivers/infiniband/hw/ehca/ehca_irq.c |   45 +++++++++++++++++++-------------
 1 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c
index 9e43459..757035e 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -359,34 +359,43 @@ static void notify_port_conf_change(struct ehca_shca *shca, int port_num)
 	*old_attr = new_attr;
 }
 
+/* replay modify_qp for sqps -- return 0 if all is well, 1 if AQP1 destroyed */
+static int replay_modify_qp(struct ehca_sport *sport)
+{
+	int aqp1_destroyed;
+	unsigned long flags;
+
+	spin_lock_irqsave(&sport->mod_sqp_lock, flags);
+
+	aqp1_destroyed = !sport->ibqp_sqp[IB_QPT_GSI];
+
+	if (sport->ibqp_sqp[IB_QPT_SMI])
+		ehca_recover_sqp(sport->ibqp_sqp[IB_QPT_SMI]);
+	if (!aqp1_destroyed)
+		ehca_recover_sqp(sport->ibqp_sqp[IB_QPT_GSI]);
+
+	spin_unlock_irqrestore(&sport->mod_sqp_lock, flags);
+
+	return aqp1_destroyed;
+}
+
 static void parse_ec(struct ehca_shca *shca, u64 eqe)
 {
 	u8 ec   = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe);
 	u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe);
 	u8 spec_event;
 	struct ehca_sport *sport = &shca->sport[port - 1];
-	unsigned long flags;
 
 	switch (ec) {
 	case 0x30: /* port availability change */
 		if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) {
-			/* only for autodetect mode important */
-			if (ehca_nr_ports >= 0)
-				break;
-
-			int suppress_event;
-			/* replay modify_qp for sqps */
-			spin_lock_irqsave(&sport->mod_sqp_lock, flags);
-			suppress_event = !sport->ibqp_sqp[IB_QPT_GSI];
-			if (sport->ibqp_sqp[IB_QPT_SMI])
-				ehca_recover_sqp(sport->ibqp_sqp[IB_QPT_SMI]);
-			if (!suppress_event)
-				ehca_recover_sqp(sport->ibqp_sqp[IB_QPT_GSI]);
-			spin_unlock_irqrestore(&sport->mod_sqp_lock, flags);
-
-			/* AQP1 was destroyed, ignore this event */
-			if (suppress_event)
-				break;
+			/* only replay modify_qp calls in autodetect mode;
+			 * if AQP1 was destroyed, the port is already down
+			 * again and we can drop the event.
+			 */
+			if (ehca_nr_ports < 0)
+				if (replay_modify_qp(sport))
+					break;
 
 			sport->port_state = IB_PORT_ACTIVE;
 			dispatch_port_event(shca, port, IB_EVENT_PORT_ACTIVE,
-- 
1.5.5


From vladsk at gmail.com  Fri Nov  7 12:21:13 2008
From: vladsk at gmail.com (Vladimir Sokolovsky)
Date: Fri, 07 Nov 2008 22:21:13 +0200
Subject: ***SPAM*** Re: [rds-devel] [ofa-general] Re: [PATCH] opensm: fix iser
	service-id used for SL assignment
In-Reply-To: <4913646D.6080308@oracle.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>		<Pine.LNX.4.64.0811061458410.3153@zuben.voltaire.com>		<49133E50.7090508@oracle.com>	<15ddcffd0811061325m2dcad0e9i789fa873d76aaa7f@mail.gmail.com>
	<4913646D.6080308@oracle.com>
Message-ID: <4914A339.2080303@gmail.com>

Andy Grover wrote:
> Hi Vlad, please pull the below trees. They remove the unused RDS TCP
> transport in 1.3 and 1.4. I've verified they do not break the build:
>
> OFED 1.3:
>
> Andy Grover (1):
>       RDS: Remove TCP transport
>
> www.openfabrics.org:/pub/scm/~agrover/ofed_1_3/linux-2.6.git
> code-drop/20081106
>
> OFED 1.4:
>
> Andy Grover (2):
>       RDS: Remove TCP transport
>       RDS: Change listen port back to 18634
>
> www.openfabrics.org:/pub/scm/~agrover/ofed_1_4/linux-2.6.git
> code-drop/20081106
>
> Thanks -- Andy
>   

Done,

Regards,
Vladimir


From arlin.r.davis at intel.com  Fri Nov  7 15:34:04 2008
From: arlin.r.davis at intel.com (Davis, Arlin R)
Date: Fri, 7 Nov 2008 15:34:04 -0800
Subject: [ofa-general] [ANNOUNCE] compat-dapl-1.2.12 and dapl-2.0.15 Release 
Message-ID: <E3280858FA94444CA49D2BA02341C9830AFBC18C@orsmsx506.amr.corp.intel.com>

 New DAPL releases now available from OFA download page:

 http://www.openfabrics.org/downloads/dapl/

 md5sum: 098c3efdf812f291449de0253c35d2b9 compat-dapl-1.2.12.tar.gz
 md5sum: 8bcf281049f7ff282202639d4bc523f8 dapl-2.0.15.tar.gz

 Summary of changes since last release:

 v1,v2 - allow override of /etc/dat.conf via syscondir option
 v1,v2 - fix dapltest transaction test to avoid cleanup before rdma complete
 v1 - add ipath, ehca socket cm provider entries for v1.2, sync with v2.0

 Vlad, please pick up new packages and install following for OFED 1.4 rc4:

 compat-dapl-1.2.12-1
 compat-dapl-devel-1.2.12-1
 dapl-2.0.15-1
 dapl-utils-2.0.15-1
 dapl-devel-2.0.15-1
 dapl-debuginfo-2.0.15-1

 Thanks,

 -arlin


From frederic.ciesielski at hp.com  Sat Nov  8 00:13:58 2008
From: frederic.ciesielski at hp.com (Ciesielski, Frederic (EMEA HPC&OSLO CC))
Date: Sat, 8 Nov 2008 08:13:58 +0000
Subject: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
Message-ID: <7391130E01ED404FBD7A3C86731EEB7D20EC0F8737@GVW1087EXB.americas.hpqcorp.net>

Is there any chance that the new NFS-RDMA features coming with OFED 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
Did anybody test this, or would pretend it is supposed to work ?

I mean without building a 2.6.27 or equivalent kernel on top of it, keeping almost full support from the vendors.

Enhanced kernel modules may not be sufficient to work around the limitations of old kernels...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081108/139fe952/attachment.html>

From vlad at lists.openfabrics.org  Sat Nov  8 03:18:26 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat,  8 Nov 2008 03:18:26 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081108-0200 daily build status
Message-ID: <20081108111826.F236AE60B1D@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From Jeffrey.C.Becker at nasa.gov  Sat Nov  8 13:35:20 2008
From: Jeffrey.C.Becker at nasa.gov (Jeff Becker)
Date: Sat, 08 Nov 2008 13:35:20 -0800
Subject: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions
 ?
In-Reply-To: <7391130E01ED404FBD7A3C86731EEB7D20EC0F8737@GVW1087EXB.americas.hpqcorp.net>
References: <7391130E01ED404FBD7A3C86731EEB7D20EC0F8737@GVW1087EXB.americas.hpqcorp.net>
Message-ID: <49160618.3050409@nasa.gov>

Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
> Is there any chance that the new NFS-RDMA features coming with OFED 
> 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27 
and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be 
done for OFED 1.4.1. Thanks.

-jeff

> Did anybody test this, or would pretend it is supposed to work ?
>  
> I mean without building a 2.6.27 or equivalent kernel on top of it, 
> keeping almost full support from the vendors.
>  
> Enhanced kernel modules may not be sufficient to work around the 
> limitations of old kernels...
>  
>  
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From ogerlitz at voltaire.com  Sun Nov  9 01:53:28 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 09 Nov 2008 11:53:28 +0200
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <49147107.2090600@linux.vnet.ibm.com>
References: <20081106012307.GP31163@sgi.com>	<200811061712.50605.jackm@dev.mellanox.co.il>	<20081106164005.GS31163@sgi.com>
	<49147107.2090600@linux.vnet.ibm.com>
Message-ID: <4916B318.50503@voltaire.com>

Pradeep Satyanarayana wrote:
> If I am not mistaken we saw a problem that showed similar characteristics more than two years ago on IBM platforms. The same issue of rx_ring reusing tx_ring skbs and so on and would show up only under stress. This was with UD mode (before CM came into the picture) and it turned out to be a driver issue. 
Can you send pointer to the relevant thread / commit that solved this issue?

Or.


From vlad at lists.openfabrics.org  Sun Nov  9 03:23:18 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun,  9 Nov 2008 03:23:18 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081109-0200 daily build status
Message-ID: <20081109112318.9F6EFE60E7A@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Sun Nov  9 05:56:46 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 9 Nov 2008 15:56:46 +0200
Subject: [ofa-general] Re: [PATCH] limit log records number and size
In-Reply-To: <4912FEA3.3090409@Voltaire.COM>
References: <4912FEA3.3090409@Voltaire.COM>
Message-ID: <20081109135646.GE29807@sashak.voltaire.com>

Hi Doron,

On 16:26 Thu 06 Nov     , Doron Shoham wrote:
> limit log records number and size
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/scripts/opensm.logrotate |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/opensm/scripts/opensm.logrotate b/opensm/scripts/opensm.logrotate
> index e16e227..e0f4125 100644
> --- a/opensm/scripts/opensm.logrotate
> +++ b/opensm/scripts/opensm.logrotate
> @@ -4,4 +4,6 @@
>      copytruncate
>      weekly
>      compress
> +    rotate 10
> +    size 100M

Why it should be limited this (and not another) way? Is not it better to
follow the default site policy?

Sasha


From sashak at voltaire.com  Sun Nov  9 09:25:18 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 9 Nov 2008 19:25:18 +0200
Subject: [ofa-general] Re: [opensm patch] support dump_conf command in opensm
	console
In-Reply-To: <1225759191.7307.9.camel@cardanus.llnl.gov>
References: <1225759191.7307.9.camel@cardanus.llnl.gov>
Message-ID: <20081109172518.GG30588@sashak.voltaire.com>

Hi Al,

On 16:39 Mon 03 Nov     , Al Chu wrote:
> Hey Sasha,
> 
> When config files are rescanned and loaded, there's no way to know if
> the right configuration was actually reloaded or not.  A console command
> to dump the current config is a useful way to verify the loading of new
> configs or not.
> 
> This patch assumes the fixes from my "fix qos config parsing bugs" is
> accepted.

Didn't pass over it, sorry about delay.

> 
> Al
> 
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From 249607e47ec7ef1b92f9578cece90460418d12b8 Mon Sep 17 00:00:00 2001
> From: Albert Chu <chu11 at llnl.gov>
> Date: Mon, 3 Nov 2008 16:22:29 -0800
> Subject: [PATCH] support dump_conf console command
> 
> 
> Signed-off-by: Albert Chu <chu11 at llnl.gov>
> ---
>  opensm/opensm/osm_console.c |  158 +++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 158 insertions(+), 0 deletions(-)
> 
> diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> index d9bbbc2..8422655 100644
> --- a/opensm/opensm/osm_console.c
> +++ b/opensm/opensm/osm_console.c
> @@ -53,6 +53,10 @@
>  #include <complib/cl_passivelock.h>
>  #include <opensm/osm_perfmgr.h>
>  
> +#define NULL_STR "(null)"
> +
> +#define BOOLEAN_STR(__b) ((__b) ? "TRUE" : "FALSE")
> +
>  struct command {
>  	char *name;
>  	void (*help_function) (FILE * out, int detail);
> @@ -189,6 +193,14 @@ static void help_lidbalance(FILE * out, int detail)
>  	}
>  }
>  
> +static void help_dump_conf(FILE *out, int detail)
> +{
> +	fprintf(out, "dump_conf\n");
> +	if (detail) {
> +		fprintf(out, "dump current opensm configuration\n");
> +	}
> +}
> +
>  #ifdef ENABLE_OSM_PERF_MGR
>  static void help_perfmgr(FILE * out, int detail)
>  {
> @@ -1136,6 +1148,151 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
>  }
>  #endif				/* ENABLE_OSM_PERF_MGR */
>  
> +static void dump_qos_options(osm_qos_options_t * opt,
> +			     osm_qos_options_t * dflt, 
> +			     char *prefix,
> +			     FILE * out)
> +{
> +	fprintf(out, "%s_max_vls : %u\n",
> +		prefix, opt->max_vls ? opt->max_vls : dflt->max_vls);
> +	fprintf(out, "%s_high_limit : %u\n",
> +		prefix, opt->high_limit >= 0 ? (unsigned)opt->high_limit : (unsigned)dflt->high_limit);
> +	fprintf(out, "%s_vlarb_high : %s\n",
> +		prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high);
> +	fprintf(out, "%s_vlarb_low : %s\n",
> +		prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low);
> +	fprintf(out, "%s_sl2vl : %s\n",
> +		prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
> +}
> +
> +static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> +{

Why to not use osm_subn_write_conf_file() function (wrapped by
dump_conf_parse())? I think we need to have config dumping code
consolidated.

Sasha

> +	osm_subn_opt_t * opt = &p_osm->subn.opt;
> +
> +	fprintf(out, "config_file : %s\n", 
> +		opt->config_file ? opt->config_file : NULL_STR);
> +	fprintf(out, "guid : 0x%016" PRIx64 "\n", opt->guid);
> +	fprintf(out, "m_key : 0x%016" PRIx64 "\n", opt->m_key);
> +	fprintf(out, "sm_key : 0x%016" PRIx64 "\n", opt->sm_key);
> +	fprintf(out, "sa_key : 0x%016" PRIx64 "\n", opt->sa_key);
> +	fprintf(out, "subnet_prefix : 0x%016" PRIx64 "\n", opt->subnet_prefix);
> +	fprintf(out, "m_key_lease_period : %u\n", opt->m_key_lease_period);
> +	fprintf(out, "sweep_interval : %u\n", opt->sweep_interval);
> +	fprintf(out, "max_wire_smps : %u\n", opt->max_wire_smps);
> +	fprintf(out, "transaction_timeout : %u\n", opt->transaction_timeout);
> +	fprintf(out, "sm_priority : %u\n", opt->sm_priority);
> +	fprintf(out, "lmc : %u\n", opt->lmc);
> +	fprintf(out, "lmc_esp0 : %s\n", 
> +		BOOLEAN_STR(opt->lmc_esp0));
> +	fprintf(out, "max_op_vls : %u\n", opt->max_op_vls);
> +	fprintf(out, "force_link_speed : %u\n", opt->force_link_speed);
> +	fprintf(out, "reassign_lids : %s\n", 
> +		BOOLEAN_STR(opt->reassign_lids));
> +	fprintf(out, "ignore_other_sm : %s\n", 
> +		BOOLEAN_STR(opt->ignore_other_sm));
> +	fprintf(out, "single_thread : %s\n", 
> +		BOOLEAN_STR(opt->single_thread));
> +	fprintf(out, "disable_multicast : %s\n", 
> +		BOOLEAN_STR(opt->disable_multicast));
> +	fprintf(out, "force_log_flush : %s\n", 
> +		BOOLEAN_STR(opt->force_log_flush));
> +	fprintf(out, "subnet_timeout : %u\n", opt->subnet_timeout);
> +	fprintf(out, "packet_life_time : %u\n", opt->packet_life_time);
> +	fprintf(out, "vl_stall_count : %u\n", opt->vl_stall_count);
> +	fprintf(out, "leaf_vl_stall_count : %u\n", opt->leaf_vl_stall_count);
> +	fprintf(out, "head_of_queue_lifetime : %u\n", opt->head_of_queue_lifetime);
> +	fprintf(out, "leaf_head_of_queue_lifetime : %u\n", opt->leaf_head_of_queue_lifetime);
> +	fprintf(out, "local_phy_errors_threshold : %u\n", opt->local_phy_errors_threshold);
> +	fprintf(out, "overrun_errors_threshold : %u\n", opt->overrun_errors_threshold);
> +	fprintf(out, "sminfo_polling_timeout : %u\n", opt->sminfo_polling_timeout);
> +	fprintf(out, "polling_retry_number : %u\n", opt->polling_retry_number);
> +	fprintf(out, "max_msg_fifo_timeout : %u\n", opt->max_msg_fifo_timeout);
> +	fprintf(out, "force_heavy_sweep : %s\n", 
> +		BOOLEAN_STR(opt->force_heavy_sweep));
> +	fprintf(out, "log_flags : 0x%02x\n", opt->log_flags);
> +	fprintf(out, "dump_files_dir : %s\n", 
> +		opt->dump_files_dir ? opt->dump_files_dir : NULL_STR);
> +	fprintf(out, "log_file : %s\n", 
> +		opt->log_file ? opt->log_file : NULL_STR);
> +	fprintf(out, "log_max_size : %lu\n", opt->log_max_size);
> +	fprintf(out, "partition_config_file : %s\n", 
> +		opt->partition_config_file ? opt->partition_config_file : NULL_STR);
> +	fprintf(out, "no_partition_enforcement : %s\n", 
> +		BOOLEAN_STR(opt->no_partition_enforcement));
> +	fprintf(out, "qos : %s\n", 
> +		BOOLEAN_STR(opt->qos));
> +	fprintf(out, "qos_policy_file : %s\n", 
> +		opt->qos_policy_file ? opt->qos_policy_file : NULL_STR);
> +	fprintf(out, "accum_log_file: %s\n", 
> +		BOOLEAN_STR(opt->accum_log_file));
> +	fprintf(out, "console : %s\n", 
> +		opt->console ? opt->console : NULL_STR);
> +	fprintf(out, "console_port : %u\n", opt->console_port);
> +	fprintf(out, "port_prof_ignore_file : %s\n", 
> +		opt->port_prof_ignore_file ? opt->port_prof_ignore_file : NULL_STR);
> +	fprintf(out, "port_profile_switch_nodes : %s\n", 
> +		BOOLEAN_STR(opt->port_profile_switch_nodes));
> +	fprintf(out, "sweep_on_trap : %s\n", 
> +		BOOLEAN_STR(opt->sweep_on_trap));
> +	fprintf(out, "routing_engine_names : %s\n", 
> +		opt->routing_engine_names ? opt->routing_engine_names : NULL_STR);
> +	fprintf(out, "use_ucast_cache : %s\n", 
> +		BOOLEAN_STR(opt->use_ucast_cache));
> +	fprintf(out, "connect_roots : %s\n", 
> +		BOOLEAN_STR(opt->connect_roots));
> +	fprintf(out, "lid_matrix_dump_file : %s\n", 
> +		opt->lid_matrix_dump_file ? opt->lid_matrix_dump_file : NULL_STR);
> +	fprintf(out, "lfts_file : %s\n", 
> +		opt->lfts_file ? opt->lfts_file : NULL_STR);
> +	fprintf(out, "root_guid_file : %s\n", 
> +		opt->root_guid_file ? opt->root_guid_file : NULL_STR);
> +	fprintf(out, "cn_guid_file : %s\n", 
> +		opt->cn_guid_file ? opt->cn_guid_file : NULL_STR);
> +	fprintf(out, "ids_guid_file : %s\n", 
> +		opt->ids_guid_file ? opt->ids_guid_file : NULL_STR);
> +	fprintf(out, "guid_routing_order_file : %s\n", 
> +		opt->guid_routing_order_file ? opt->guid_routing_order_file : NULL_STR);
> +	fprintf(out, "sa_db_file : %s\n", 
> +		opt->sa_db_file ? opt->sa_db_file : NULL_STR);
> +	fprintf(out, "exit_on_fatal : %s\n", 
> +		BOOLEAN_STR(opt->exit_on_fatal));
> +	fprintf(out, "honor_guid2lid_file : %s\n", 
> +		BOOLEAN_STR(opt->honor_guid2lid_file));
> +	fprintf(out, "daemon : %s\n", 
> +		BOOLEAN_STR(opt->daemon));
> +	fprintf(out, "sm_inactive : %s\n", 
> +		BOOLEAN_STR(opt->sm_inactive));
> +	fprintf(out, "babbling_port_policy : %s\n", 
> +		BOOLEAN_STR(opt->babbling_port_policy));
> +	dump_qos_options(&opt->qos_options, &opt->qos_options, "qos", out);
> +	dump_qos_options(&opt->qos_ca_options, &opt->qos_options, "qos_ca", out);
> +	dump_qos_options(&opt->qos_sw0_options, &opt->qos_options, "qos_sw0", out);
> +	dump_qos_options(&opt->qos_swe_options, &opt->qos_options, "qos_swe", out);
> +	dump_qos_options(&opt->qos_rtr_options, &opt->qos_options, "qos_rtr", out);
> +	fprintf(out, "enable_quirks : %s\n", 
> +		BOOLEAN_STR(opt->enable_quirks));
> +	fprintf(out, "no_clients_rereg : %s\n", 
> +		BOOLEAN_STR(opt->no_clients_rereg));
> +#ifdef ENABLE_OSM_PERF_MGR
> +	fprintf(out, "perfmgr : %s\n", 
> +		BOOLEAN_STR(opt->perfmgr));
> +	fprintf(out, "perfmgr_redir : %s\n", 
> +		BOOLEAN_STR(opt->perfmgr_redir));
> +	fprintf(out, "perfmgr_sweep_time_s : %u\n", opt->perfmgr_sweep_time_s);
> +	fprintf(out, "perfmgr_max_outstanding_queries : %u\n", opt->perfmgr_max_outstanding_queries);
> +	fprintf(out, "event_db_dump_file : %s\n", 
> +		opt->event_db_dump_file ? opt->event_db_dump_file : NULL_STR);
> +#endif
> +	fprintf(out, "event_plugin_name : %s\n", 
> +		opt->event_plugin_name ? opt->event_plugin_name : NULL_STR);
> +	fprintf(out, "node_name_map_name : %s\n", 
> +		opt->node_name_map_name ? opt->node_name_map_name : NULL_STR);
> +	fprintf(out, "prefix_routes_file : %s\n", 
> +		opt->prefix_routes_file ? opt->prefix_routes_file : NULL_STR);
> +	fprintf(out, "consolidate_ipv6_snm_req : %s\n", 
> +		BOOLEAN_STR(opt->consolidate_ipv6_snm_req));
> +}
> +
>  static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
>  {
>  	osm_console_exit(&p_osm->console, &p_osm->log);
> @@ -1166,6 +1323,7 @@ static const struct command console_cmds[] = {
>  	{"portstatus", &help_portstatus, &portstatus_parse},
>  	{"switchbalance", &help_switchbalance, &switchbalance_parse},
>  	{"lidbalance", &help_lidbalance, &lidbalance_parse},
> +	{"dump_conf", &help_dump_conf, &dump_conf_parse},
>  	{"version", &help_version, &version_parse},
>  #ifdef ENABLE_OSM_PERF_MGR
>  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
> -- 
> 1.5.4.5
> 


From sashak at voltaire.com  Sun Nov  9 09:47:33 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 9 Nov 2008 19:47:33 +0200
Subject: [ofa-general] Re: [PATCH] Add check for previous versions of
	plugins.
In-Reply-To: <20081104095812.2ff5920c.weiny2@llnl.gov>
References: <20081104095812.2ff5920c.weiny2@llnl.gov>
Message-ID: <20081109174733.GA30265@sashak.voltaire.com>

Hi Ira,

On 09:58 Tue 04 Nov     , Ira Weiny wrote:
> From 0db0d6667ed8baede1093a95127e2ce9c81959bd Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Mon, 3 Nov 2008 15:50:15 -0800
> Subject: [PATCH] Add check for previous versions of plugins.
> 
>    If old interface plugins are available to OpenSM they will cause a crash.
>    Check for this old version and error out gracefully.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  opensm/include/opensm/osm_event_plugin.h |    1 +
>  opensm/opensm/osm_event_plugin.c         |   10 ++++++++++
>  2 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h
> index b2deeba..0b80b63 100644
> --- a/opensm/include/opensm/osm_event_plugin.h
> +++ b/opensm/include/opensm/osm_event_plugin.h
> @@ -150,6 +150,7 @@ typedef struct osm_epi_trap_event {
>  #define OSM_EVENT_PLUGIN_IMPL_NAME "osm_event_plugin"
>  #define OSM_EVENT_PLUGIN_INTERFACE_VER 2
>  typedef struct osm_event_plugin {
> +	int interface_version;
>  	const char *osm_version;
>  	void *(*create) (struct osm_opensm *osm);
>  	void (*delete) (void *plugin_data);

The problem IMHO that this changes the current interface and will
require to change all plugins (not just rebuild - actually rebuild will
hide any interface changing issues and will not fail).

What about the check like this:


diff --git a/opensm/opensm/osm_event_plugin.c b/opensm/opensm/osm_event_plugin.c
index c6999f5..f332a24 100644
--- a/opensm/opensm/osm_event_plugin.c
+++ b/opensm/opensm/osm_event_plugin.c
@@ -66,6 +66,7 @@
 osm_epi_plugin_t *osm_epi_construct(osm_opensm_t *osm, char *plugin_name)
 {
 	char lib_name[OSM_PATH_MAX];
+	struct old_if { unsigned ver; } *old_impl;
 	osm_epi_plugin_t *rc = NULL;
 
 	if (!plugin_name || !*plugin_name)
@@ -96,6 +97,17 @@ osm_epi_plugin_t *osm_epi_construct(osm_opensm_t *osm, char *plugin_name)
 		goto Exit;
 	}
 
+	/* be sure that not old interface plugin is used */
+	old_impl = (struct old_if *) rc->impl;
+	if (old_impl->ver < OSM_EVENT_PLUGIN_INTERFACE_VER) {
+		OSM_LOG(&osm->log, OSM_LOG_ERROR, "Error loading plugin"
+			"\'%s\': it has the wrong interface version (%u); "
+			"OpenSM expected %u. Please rebuild.\n",
+			plugin_name, old_impl->ver,
+			OSM_EVENT_PLUGIN_INTERFACE_VER);
+		goto Exit;
+	}
+
 	/* Check the version to make sure this module will work with us */
 	if (strcmp(rc->impl->osm_version, osm->osm_version)) {
 		OSM_LOG(&osm->log, OSM_LOG_ERROR, "Error loading plugin"

Sasha


From sashak at voltaire.com  Sun Nov  9 10:13:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 9 Nov 2008 20:13:16 +0200
Subject: [ofa-general] Re: [PATCH 1/2] add default configuration files
In-Reply-To: <4912B7CA.9080508@Voltaire.COM>
References: <4912B719.3040907@Voltaire.COM> <4912B7CA.9080508@Voltaire.COM>
Message-ID: <20081109181316.GA30682@sashak.voltaire.com>

Hi Doron,

On 11:24 Thu 06 Nov     , Doron Shoham wrote:
> add default configuration files:
> opensm.conf
> partitions.conf
> qos-policy.conf
> root-nodes.conf
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/scripts/opensm.conf     |  331 ++++++++++++++++++++++++++++++++++++++++

Normally this file is autogenerated. And I don't see any good reason to
put generated files under source control.

>  opensm/scripts/partitions.conf |  100 ++++++++++++

Existence of partition file changes default behavior of PM in OpenSM,
so you will need to put some reasonable configuration there. OTOH you
already have it in OpenSM (when using without file), so why to bother?

>  opensm/scripts/qos-policy.conf |    2 +
>  opensm/scripts/root-nodes.conf |    3 +

Those are empty.

>  4 files changed, 436 insertions(+), 0 deletions(-)
>  create mode 100644 opensm/scripts/opensm.conf
>  create mode 100644 opensm/scripts/partitions.conf
>  create mode 100644 opensm/scripts/qos-policy.conf
>  create mode 100644 opensm/scripts/root-nodes.conf
> 
> diff --git a/opensm/scripts/opensm.conf b/opensm/scripts/opensm.conf
> new file mode 100644
> index 0000000..89e4145
> --- /dev/null
> +++ b/opensm/scripts/opensm.conf
> @@ -0,0 +1,331 @@
> +#
> +# DEVICE ATTRIBUTES OPTIONS
> +#
> +# The port GUID on which the OpenSM is running
> +guid 0x0000000000000000
> +
> +# M_Key value sent to all ports qualifying all Set(PortInfo)
> +m_key 0x0000000000000000
> +
> +# The lease period used for the M_Key on this subnet in [sec]
> +m_key_lease_period 0
> +
> +# SM_Key value of the SM used for SM authentication
> +sm_key 0x0000000000000001
> +
> +# SM_Key value to qualify rcv SA queries as 'trusted'
> +sa_key 0x0000000000000001
> +
> +# Note that for both values above (sm_key and sa_key)
> +# OpenSM version 3.2.1 and below used the default value '1'
> +# in a host byte order, it is fixed now but you may need to
> +# change the values to interoperate with old OpenSM running
> +# on a little endian machine.
> +
> +# Subnet prefix used on this subnet
> +subnet_prefix 0xfe80000000000000
> +
> +# The LMC value used on this subnet
> +lmc 0
> +
> +# lmc_esp0 determines whether LMC value used on subnet is used for
> +# enhanced switch port 0. If TRUE, LMC value for subnet is used for
> +# ESP0. Otherwise, LMC value for ESP0s is 0.
> +lmc_esp0 FALSE
> +
> +# The code of maximal time a packet can live in a switch
> +# The actual time is 4.096usec * 2^<packet_life_time>
> +# The value 0x14 disables this mechanism
> +packet_life_time 0x12
> +
> +# The number of sequential packets dropped that cause the port
> +# to enter the VLStalled state. The result of setting this value to
> +# zero is undefined.
> +vl_stall_count 0x07
> +
> +# The number of sequential packets dropped that cause the port
> +# to enter the VLStalled state. This value is for switch ports
> +# driving a CA or router port. The result of setting this value
> +# to zero is undefined.
> +leaf_vl_stall_count 0x07
> +
> +# The code of maximal time a packet can wait at the head of
> +# transmission queue.
> +# The actual time is 4.096usec * 2^<head_of_queue_lifetime>
> +# The value 0x14 disables this mechanism
> +head_of_queue_lifetime 0x12
> +
> +# The maximal time a packet can wait at the head of queue on
> +# switch port connected to a CA or router port
> +leaf_head_of_queue_lifetime 0x10
> +
> +# Limit the maximal operational VLs
> +max_op_vls 5
> +
> +# Force PortInfo:LinkSpeedEnabled on switch ports
> +# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port
> +# Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port
> +# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 "PortInfo")
> +#    1: 2.5 Gbps
> +#    3: 2.5 or 5.0 Gbps
> +#    5: 2.5 or 10.0 Gbps
> +#    7: 2.5 or 5.0 or 10.0 Gbps
> +#    2,4,6,8-14 Reserved
> +#    Default 15: set to PortInfo:LinkSpeedSupported
> +force_link_speed 15
> +
> +# The subnet_timeout code that will be set for all the ports
> +# The actual timeout is 4.096usec * 2^<subnet_timeout>
> +subnet_timeout 18
> +
> +# Threshold of local phy errors for sending Trap 129
> +local_phy_errors_threshold 0x08
> +
> +# Threshold of credit overrun errors for sending Trap 130
> +overrun_errors_threshold 0x08
> +
> +#
> +# PARTITIONING OPTIONS
> +#
> +# Partition configuration file to be used
> +partition_config_file /etc/opensm/partitions.conf
> +
> +# Disable partition enforcement by switches
> +no_partition_enforcement FALSE
> +
> +#
> +# SWEEP OPTIONS
> +#
> +# The number of seconds between subnet sweeps (0 disables it)
> +sweep_interval 10
> +
> +# If TRUE cause all lids to be reassigned
> +reassign_lids FALSE
> +
> +# If TRUE forces every sweep to be a heavy sweep
> +force_heavy_sweep FALSE
> +
> +# If TRUE every trap will cause a heavy sweep.
> +# NOTE: successive identical traps (>10) are suppressed
> +sweep_on_trap TRUE
> +
> +#
> +# ROUTING OPTIONS
> +#
> +# If TRUE count switches as link subscriptions
> +port_profile_switch_nodes FALSE
> +
> +# Name of file with port guids to be ignored by port profiling
> +port_prof_ignore_file (null)
> +
> +# Routing engine
> +# Multiple routing engines can be specified separated by
> +# commas so that specific ordering of routing algorithms will
> +# be tried if earlier routing engines fail.
> +# Supported engines: minhop, updn, file, ftree, lash, dor
> +routing_engine minhop
> +
> +# Connect roots (use FALSE if unsure)
> +connect_roots FALSE
> +
> +# Use unicast routing cache (use FALSE if unsure)
> +use_ucast_cache FALSE
> +
> +# Lid matrix dump file name
> +lid_matrix_dump_file (null)
> +
> +# LFTs file name
> +lfts_file (null)
> +
> +# The file holding the root node guids (for fat-tree or Up/Down)
> +# One guid in each line
> +root_guid_file (null)
> +
> +# The file holding the fat-tree compute node guids
> +# One guid in each line
> +cn_guid_file (null)
> +
> +# The file holding the node ids which will be used by Up/Down algorithm instead
> +# of GUIDs (one guid and id in each line)
> +ids_guid_file (null)
> +
> +# The file holding guid routing order guids (for MinHop and Up/Down)
> +guid_routing_order_file (null)
> +
> +# SA database file name
> +sa_db_file (null)
> +
> +#
> +# HANDOVER - MULTIPLE SMs OPTIONS
> +#
> +# SM priority used for deciding who is the master
> +# Range goes from 0 (lowest priority) to 15 (highest).
> +sm_priority 14

SM priority value 14 doesn't look as a good idea for a default value (we
are not starting "priority wars" with other SMs :)).

Sasha

> +
> +# If TRUE other SMs on the subnet should be ignored
> +ignore_other_sm FALSE
> +
> +# Timeout in [msec] between two polls of active master SM
> +sminfo_polling_timeout 10000
> +
> +# Number of failing polls of remote SM that declares it dead
> +polling_retry_number 4
> +
> +# If TRUE honor the guid2lid file when coming out of standby
> +# state, if such file exists and is valid
> +honor_guid2lid_file FALSE
> +
> +#
> +# TIMING AND THREADING OPTIONS
> +#
> +# Maximum number of SMPs sent in parallel
> +max_wire_smps 4
> +
> +# The maximum time in [msec] allowed for a transaction to complete
> +transaction_timeout 200
> +
> +# Maximal time in [msec] a message can stay in the incoming message queue.
> +# If there is more than one message in the queue and the last message
> +# stayed in the queue more than this value, any SA request will be
> +# immediately returned with a BUSY status.
> +max_msg_fifo_timeout 10000
> +
> +# Use a single thread for handling SA queries
> +single_thread FALSE
> +
> +#
> +# MISC OPTIONS
> +#
> +# Daemon mode
> +daemon FALSE
> +
> +# SM Inactive
> +sm_inactive FALSE
> +
> +# Babbling Port Policy
> +babbling_port_policy FALSE
> +
> +#
> +# Performance Manager Options
> +#
> +# perfmgr enable
> +perfmgr FALSE
> +
> +# perfmgr redirection enable
> +perfmgr_redir TRUE
> +
> +# sweep time in seconds
> +perfmgr_sweep_time_s 180
> +
> +# Max outstanding queries
> +perfmgr_max_outstanding_queries 500
> +
> +#
> +# Event DB Options
> +#
> +# Dump file to dump the events to
> +event_db_dump_file (null)
> +
> +#
> +# Event Plugin Options
> +#
> +event_plugin_name (null)
> +
> +#
> +# Node name map for mapping node's to more descriptive node descriptions
> +# (man ibnetdiscover for more information)
> +#
> +node_name_map_name (null)
> +
> +#
> +# DEBUG FEATURES
> +#
> +# The log flags used
> +log_flags 0x03
> +
> +# Force flush of the log file after each log message
> +force_log_flush FALSE
> +
> +# Log file to be used
> +log_file /var/log/opensm.log
> +
> +# Limit the size(MB) of the log file. If overrun, log is restarted
> +log_max_size 4096
> +
> +# If TRUE will accumulate the log over multiple OpenSM sessions
> +accum_log_file TRUE
> +
> +# The directory to hold the file OpenSM dumps
> +dump_files_dir /var/log/
> +
> +# If TRUE enables new high risk options and hardware specific quirks
> +enable_quirks FALSE
> +
> +# If TRUE disables client reregistration
> +no_clients_rereg FALSE
> +
> +# If TRUE OpenSM should disable multicast support and
> +# no multicast routing is performed if TRUE
> +disable_multicast FALSE
> +
> +# If TRUE opensm will exit on fatal initialization issues
> +exit_on_fatal TRUE
> +
> +# console [off|local]
> +console off
> +
> +# Telnet port for console (default 10000)
> +console_port 10000
> +
> +#
> +# QoS OPTIONS
> +#
> +# Enable QoS setup
> +qos FALSE
> +
> +# QoS policy file to be used
> +qos_policy_file /etc/opensm/qos-policy.conf
> +
> +# QoS default options
> +qos_max_vls 15
> +qos_high_limit 0
> +qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +# QoS CA options
> +qos_ca_max_vls 15
> +qos_ca_high_limit 0
> +qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +qos_ca_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +# QoS Switch Port 0 options
> +qos_sw0_max_vls 15
> +qos_sw0_high_limit 0
> +qos_sw0_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +qos_sw0_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +qos_sw0_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +# QoS Switch external ports options
> +qos_swe_max_vls 15
> +qos_swe_high_limit 0
> +qos_swe_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +qos_swe_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +# QoS Router ports options
> +qos_rtr_max_vls 15
> +qos_rtr_high_limit 0
> +qos_rtr_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> +qos_rtr_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> +qos_rtr_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +# Prefix routes file name
> +prefix_routes_file /etc/opensm/prefix-routes.conf
> +
> +#
> +# IPv6 Solicited Node Multicast (SNM) Options
> +#
> +consolidate_ipv6_snm_req FALSE
> +
> diff --git a/opensm/scripts/partitions.conf b/opensm/scripts/partitions.conf
> new file mode 100644
> index 0000000..868a26a
> --- /dev/null
> +++ b/opensm/scripts/partitions.conf
> @@ -0,0 +1,100 @@
> +# Default partition configuration file for OpenSM
> +# 
> +# The  default  name  of  OpenSM  partitions configuration file is /etc/opensm/partitions.conf. The default may be changed by using --Pconfig (-P)
> +# option with OpenSM.
> +# 
> +# The default partition will be created by OpenSM unconditionally even when partition configuration file does not exist or cannot be accessed.
> +# 
> +# The default partition has P_Key value 0x7fff. OpenSM??s port will have full membership in default partition. All other end ports will  have  par???
> +# tial membership.
> +# 
> +# File Format
> +# 
> +# Comments:
> +# 
> +# Line content followed after ??#?? character is comment and ignored by parser.
> +# 
> +# General file format:
> +# 
> +# <Partition Definition>:<PortGUIDs list> ;
> +# 
> +# Partition Definition:
> +# 
> +# [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
> +# 
> +# PartitionName - string, will be used with logging. When omitted
> +# 		empty string will be used.
> +# PKey          - P_Key value for this partition. Only low 15 bits will
> +# 		be used. When omitted will be autogenerated.
> +# flag          - used to indicate IPoIB capability of this partition.
> +# defmember=full|limited - specifies default membership for port guid
> +# 		list. Default is limited.
> +# 
> +# Currently recognized flags are:
> +# 
> +# ipoib       - indicates that this partition may be used for IPoIB, as
> +# 	      result IPoIB capable MC group will be created.
> +# rate=<val>  - specifies rate for this IPoIB MC group
> +# 	      (default is 3 (10GBps))
> +# mtu=<val>   - specifies MTU for this IPoIB MC group
> +# 	      (default is 4 (2048))
> +# sl=<val>    - specifies SL for this IPoIB MC group
> +# 	      (default is 0)
> +# scope=<val> - specifies scope for this IPoIB MC group
> +# 	      (default is 2 (link local)).  Multiple scope settings
> +# 	      are permitted for a partition.
> +# 
> +# Note that values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048).
> +# 
> +# PortGUIDs list:
> +# 
> +# PortGUID         - GUID of partition member EndPort. Hexadecimal
> +# 		   numbers should start from 0x, decimal numbers
> +# 		   are accepted too.
> +# full or limited  - indicates full or limited membership for this
> +# 		   port.  When omitted (or unrecognized) limited
> +# 		   membership is assumed.
> +# 
> +# There are two useful keywords for PortGUID definition:
> +# 
> +# - 'ALL' means all end ports in this subnet.
> +# - 'SELF' means subnet manager's port.
> +# 
> +# Empty list means no ports in this partition.
> +# 
> +# Notes:
> +# 
> +# White space is permitted between delimiters ('=', ',',':',';').
> +# 
> +# The line can be wrapped after ':' followed after Partition Definition and between.
> +# 
> +# PartitionName  does  not need to be unique, PKey does need to be unique.  If PKey is repeated then those partition configurations will be merged
> +# and first PartitionName will be used (see also next note).
> +# 
> +# It is possible to split partition configuration in more than one definition, but then PKey should be explicitly specified  (otherwise  different
> +# PKey values will be generated for those definitions).
> +# 
> +# Examples:
> +# 
> +# Default=0x7fff : ALL, SELF=full ;
> +# 
> +# NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
> +# 
> +# YetAnotherOne = 0x300 : SELF=full ;
> +# YetAnotherOne = 0x300 : ALL=limited ;
> +# 
> +# ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
> +# # 0x123453, 0x123454 will be limited
> +# ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
> +# # 0x123456, 0x123457 will be limited
> +# ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
> +# ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
> +# ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
> +# 
> +# 
> +# Note:
> +# 
> +# The following rule is equivalent to how OpenSM used to run prior to the partition manager:
> +# 
> + Default=0x7fff,ipoib:ALL=full;
> +# 
> diff --git a/opensm/scripts/qos-policy.conf b/opensm/scripts/qos-policy.conf
> new file mode 100644
> index 0000000..42a88c0
> --- /dev/null
> +++ b/opensm/scripts/qos-policy.conf
> @@ -0,0 +1,2 @@
> +# Default Quality of Service policy configuration file
> +# For further details see /usr/share/doc/opensm-<version>/QoS_management_in_OpenSM.txt
> diff --git a/opensm/scripts/root-nodes.conf b/opensm/scripts/root-nodes.conf
> new file mode 100644
> index 0000000..d84d732
> --- /dev/null
> +++ b/opensm/scripts/root-nodes.conf
> @@ -0,0 +1,3 @@
> +# Default root node GUIDs configuration file for OpenSM
> +# List of GUIDs in hex, one per line
> +# 0x8f10002322134567
> -- 
> 1.5.3.8
> 
> 


From sashak at voltaire.com  Sun Nov  9 10:30:35 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 9 Nov 2008 20:30:35 +0200
Subject: [ofa-general] Re: [PATCH 2/2] install QoS_management_in_OpenSM.txt
In-Reply-To: <4912BDAB.5040704@Voltaire.COM>
References: <4912BCFC.8030407@Voltaire.COM> <4912BDAB.5040704@Voltaire.COM>
Message-ID: <20081109183035.GB30682@sashak.voltaire.com>

On 11:49 Thu 06 Nov     , Doron Shoham wrote:
> install QoS_management_in_OpenSM.txt via the rpm
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>

Applied. Thanks.

Sasha


From vlad at lists.openfabrics.org  Mon Nov 10 03:16:57 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 10 Nov 2008 03:16:57 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081110-0200 daily build status
Message-ID: <20081110111657.83D12E60C87@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From kliteyn at dev.mellanox.co.il  Mon Nov 10 06:19:17 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 10 Nov 2008 16:19:17 +0200
Subject: [ofa-general] [PATCH] opensm/osm_pkey.c: cosmetics in some log
	message
Message-ID: <491842E5.6040203@dev.mellanox.co.il>

Hi Sasha,

Just some cosmetics in a log message.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_pkey.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_pkey.c b/opensm/opensm/osm_pkey.c
index 3adc8d7..e09faa8 100644
--- a/opensm/opensm/osm_pkey.c
+++ b/opensm/opensm/osm_pkey.c
@@ -475,7 +475,7 @@ osm_physp_has_pkey(IN osm_log_t * p_log,
 	OSM_LOG_ENTER(p_log);

 	OSM_LOG(p_log, OSM_LOG_DEBUG,
-		"Search for PKey: 0x%4x\n", cl_ntoh16(pkey));
+		"Search for PKey: 0x%04x\n", cl_ntoh16(pkey));

 	/* if the pkey given is an invalid pkey - return TRUE. */
 	if (ib_pkey_is_invalid(pkey)) {
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Mon Nov 10 06:25:09 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 10 Nov 2008 16:25:09 +0200
Subject: [ofa-general] [PATCH] opensm/ib_types.h: rename
	IB_MC_REC_STATE_SEND_ONLY_MEMBER
Message-ID: <49184445.10007@dev.mellanox.co.il>

Sasha,

The multicast Send Only bit is defined in spec as "SendOnlyNonMemeber",
to denote that the port is not considered a member for purposes of group
creation/deletion.

Renaming IB_MC_REC_STATE_SEND_ONLY_MEMBER to IB_MC_REC_STATE_SEND_ONLY_NON_MEMBER.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/iba/ib_types.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h
index 6412ea9..0f9d110 100644
--- a/opensm/include/iba/ib_types.h
+++ b/opensm/include/iba/ib_types.h
@@ -7085,7 +7085,7 @@ ib_member_set_join_state(IN OUT ib_member_rec_t * p_mc_rec,
  */
 #define IB_MC_REC_STATE_FULL_MEMBER 0x01
 #define IB_MC_REC_STATE_NON_MEMBER 0x02
-#define IB_MC_REC_STATE_SEND_ONLY_MEMBER 0x04
+#define IB_MC_REC_STATE_SEND_ONLY_NON_MEMBER 0x04

 /*
  *	Generic MAD notice types
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Mon Nov 10 06:36:54 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 10 Nov 2008 16:36:54 +0200
Subject: [ofa-general] [PATCH] opensm/osm_multicast.c: bug with
	joining/leaving mcast group
Message-ID: <49184706.9070103@dev.mellanox.co.il>

Hi Sasha,

I think there's a bug in the osm_mgrp_add/remove_port functions.
If some mcast group member has JoinState 0x1 (full member),
and then new join from the same port received with JoinState
0x2 (non member), OpenSM will reduce number of full members
of this group, which eventually might cause group deletion.
Similar problem (only in logically opposite direction) happens
when port tries to partially leave mcast group.

This patch should fix it.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_multicast.c |   33 +++++++++++----------------------
 1 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
index d62d585..350fd22 100644
--- a/opensm/opensm/osm_multicast.c
+++ b/opensm/opensm/osm_multicast.c
@@ -172,17 +172,11 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t *subn, osm_log_t *log,
 		p_mgrp->last_change_id++;
 	}

-	if ((join_state ^ prev_join_state) & IB_JOIN_STATE_FULL) {
-		if (join_state & IB_JOIN_STATE_FULL) {
-			if (++p_mgrp->full_members == 1) {
-				mgrp_send_notice(subn, log, p_mgrp, 66);
-				p_mgrp->to_be_deleted = 0;
-			}
-		} else if (--p_mgrp->full_members == 0) {
-			mgrp_send_notice(subn, log, p_mgrp, 67);
-			if (!p_mgrp->well_known)
-				p_mgrp->to_be_deleted = 1;
-		}
+	if ((join_state & IB_JOIN_STATE_FULL) &&
+	    !(prev_join_state & IB_JOIN_STATE_FULL) &&
+	    (++p_mgrp->full_members == 1)) {
+		mgrp_send_notice(subn, log, p_mgrp, 66);
+		p_mgrp->to_be_deleted = 0;
 	}

 	return (p_mcm_port);
@@ -224,17 +218,12 @@ int osm_mgrp_remove_port(osm_subn_t *subn, osm_log_t *log, osm_mgrp_t *mgrp,

 	/* no more full members so the group will be deleted after re-route
 	   but only if it is not a well known group */
-	if ((port_join_state ^ new_join_state) & IB_JOIN_STATE_FULL) {
-		if (port_join_state & IB_JOIN_STATE_FULL) {
-			if (--mgrp->full_members == 0) {
-				mgrp_send_notice(subn, log, mgrp, 67);
-				if (!mgrp->well_known)
-					mgrp->to_be_deleted = 1;
-			}
-		} else if (++mgrp->full_members == 1) {
-			mgrp_send_notice(subn, log, mgrp, 66);
-			mgrp->to_be_deleted = 0;
-		}
+	if ((port_join_state & IB_JOIN_STATE_FULL) &&
+	    !(new_join_state & IB_JOIN_STATE_FULL) &&
+	    (--mgrp->full_members == 0)) {
+		mgrp_send_notice(subn, log, mgrp, 67);
+		if (!mgrp->well_known)
+			mgrp->to_be_deleted = 1;
 	}

 	return ret;
-- 
1.5.1.4


From tziporet at mellanox.co.il  Mon Nov 10 06:57:59 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 10 Nov 2008 16:57:59 +0200
Subject: [ofa-general] Agenda for OFED meeting today - Nov 10
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADE93B3E@mtlexch01.mtl.com>

This is the agenda for OFED meeting today on OFED release status:

1. Decide on RC4 release - I suggest to do it tomorrow
2. Decide on GA release:
    my suggestion - RC5 in a week (Monday 17, Nov)
    GA - Nov 24 (we cannot delay more in that week since it will be on
Thanks Giving holiday)
    We can try on Friday Nov 21
3. Release notes - all owners must update the release notes
4. Bugs review:

1323    	blo  	stefan.roscher at de.ibm.com  	REOP
IB/ehca: possibillity of kernel panic under certain circu...
1370 	blo 	vlad at mellanox.co.il 		NEW 		Ping
over IPoIB I/F fails after ifconfig down and up
1364 	cri 	swise at opengridcomputing.com 	NEW 		system
hang on rmmod cxgb3 in rhel4.7
1365 	cri 	swise at opengridcomputing.com 	NEW 		Panic on
loading iw_cxgb3 in RHEL 4.6
1366 	cri 	swise at opengridcomputing.com 	NEW 		Panic
during boot-up after an OFED install in RHEL 4.5
1242 	cri 	yannick.cote at qlogic.com 	NEW 		kernel
panic while running mpi2007 against ofed1.4 -- ib_...
1289 	maj 	amirv at mellanox.co.il 		NEW 		Ib and
ipoib doesnt respond while running multiple tests ...
1349 	maj 	amirv at mellanox.co.il 		NEW 		Kernel
panic on sdp
1336 	maj 	vlad at mellanox.co.il 		NEW 		Can't to
unloading the mlx4_ib module on ppc64
1358 	maj 	vlad at mellanox.co.il 		ASSI 		fmr_test
causes eth0 transmit timeout - should be fixed
1359 	maj 	vlad at mellanox.co.il 		NEW 		Kernel
panic while running Ltp - ongoing

Tziporet & Vlad


From frederic.ciesielski at hp.com  Mon Nov 10 08:27:50 2008
From: frederic.ciesielski at hp.com (Ciesielski, Frederic (EMEA HPC&OSLO CC))
Date: Mon, 10 Nov 2008 16:27:50 +0000
Subject: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
In-Reply-To: <49160618.3050409@nasa.gov>
References: <7391130E01ED404FBD7A3C86731EEB7D20EC0F8737@GVW1087EXB.americas.hpqcorp.net>
	<49160618.3050409@nasa.gov>
Message-ID: <7391130E01ED404FBD7A3C86731EEB7D20ECAB457D@GVW1087EXB.americas.hpqcorp.net>

That's great, thanks.

I ran some tests with the 2.6.27 kernel as server and client, and basically it works fine.

I could not find yet any situation where NFS-RDMA would outperform NFS/IPoIB, at least when you compare apples to apples (same clients, same server, same protocol, and not just write to/read from the caches), and it even seems to have severe performance issues for reading with files larger than the memory size of the client and the server.
Hopefully this will improve when more users will be able to give valuable feedback...

Fred.

-----Original Message-----
From: Jeff Becker [mailto:Jeffrey.C.Becker at nasa.gov]
Sent: Saturday, 08 November, 2008 22:35
To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?

Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
> Is there any chance that the new NFS-RDMA features coming with OFED
> 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27 and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be done for OFED 1.4.1. Thanks.

-jeff

> Did anybody test this, or would pretend it is supposed to work ?
>
> I mean without building a 2.6.27 or equivalent kernel on top of it,
> keeping almost full support from the vendors.
>
> Enhanced kernel modules may not be sufficient to work around the
> limitations of old kernels...
>
>
>
> ----------------------------------------------------------------------
> --
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general


From tom at opengridcomputing.com  Mon Nov 10 09:07:14 2008
From: tom at opengridcomputing.com (Tom Tucker)
Date: Mon, 10 Nov 2008 11:07:14 -0600
Subject: [Fwd: RE: [ofa-general] NFS-RDMA (OFED1.4) with standard
	distributions ?]
In-Reply-To: <491867E1.4000101@nasa.gov>
References: <491867E1.4000101@nasa.gov>
Message-ID: <49186A42.8040303@opengridcomputing.com>

Jeff:

Unfortunately, the NFSRDMA transport cannot make your disks go faster. 
If the storage subsystem is incapable of keeping up with IPoIB, then it 
won't be able to keep up with NFSRDMA either.

To compare NFSRDMA and IPoIB performance absent a very fast storage 
subsystem you'll need to keep the file sizes small enough such that they 
fit within the server cache.

Tom


Jeff Becker wrote:
> Hi. Just passing this on in case you missed it. Do you have any advice
> on what knobs to tweak to get better performance (than NFS/IPoIB)? Thanks.
> 
> -jeff
> 
> -------- Original Message --------
> Subject: 	RE: [ofa-general] NFS-RDMA (OFED1.4) with standard
> distributions ?
> Date: 	Mon, 10 Nov 2008 16:27:50 +0000
> From: 	Ciesielski, Frederic (EMEA HPC&OSLO CC) <frederic.ciesielski at hp.com>
> To: 	Jeff Becker <Jeffrey.C.Becker at nasa.gov>
> CC: 	general at lists.openfabrics.org <general at lists.openfabrics.org>
> References:
> <7391130E01ED404FBD7A3C86731EEB7D20EC0F8737 at GVW1087EXB.americas.hpqcorp.net>
> <49160618.3050409 at nasa.gov>
> 
> 
> 
> That's great, thanks.
> 
> I ran some tests with the 2.6.27 kernel as server and client, and basically it works fine.
> 
> I could not find yet any situation where NFS-RDMA would outperform NFS/IPoIB, at least when you compare apples to apples (same clients, same server, same protocol, and not just write to/read from the caches), and it even seems to have severe performance issues for reading with files larger than the memory size of the client and the server.
> Hopefully this will improve when more users will be able to give valuable feedback...
> 
> Fred.
> 
> -----Original Message-----
> From: Jeff Becker [mailto:Jeffrey.C.Becker at nasa.gov]
> Sent: Saturday, 08 November, 2008 22:35
> To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
> 
> Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>> Is there any chance that the new NFS-RDMA features coming with OFED
>> 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
> Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27 and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be done for OFED 1.4.1. Thanks.
> 
> -jeff
> 
>> Did anybody test this, or would pretend it is supposed to work ?
>>
>> I mean without building a 2.6.27 or equivalent kernel on top of it,
>> keeping almost full support from the vendors.
>>
>> Enhanced kernel modules may not be sufficient to work around the
>> limitations of old kernels...
>>
>>
>>
>> ----------------------------------------------------------------------
>> --
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
> 


From swise at opengridcomputing.com  Mon Nov 10 09:42:53 2008
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 10 Nov 2008 11:42:53 -0600
Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today - Nov 10
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EADE93B3E@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EADE93B3E@mtlexch01.mtl.com>
Message-ID: <4918729D.7090906@opengridcomputing.com>


> 1364 	cri 	swise at opengridcomputing.com 	NEW 		system
> hang on rmmod cxgb3 in rhel4.7
> 1365 	cri 	swise at opengridcomputing.com 	NEW 		Panic on
> loading iw_cxgb3 in RHEL 4.6
> 1366 	cri 	swise at opengridcomputing.com 	NEW 		Panic
> during boot-up after an OFED install in RHEL 4.5
>   

Sorry I missed the call (yet again). 

1364 is under investigation, should have a fix today.
1365 closed.  Didn't see the problem in latest daily build
1366 will need a fix and hopefully I'll have something today/tomorrow.  
This isn't related to just RH4.5, but rather to new chelsio boards that 
aren't supported in ofed-1.4.

These can all wait for -rc5 if you don't want to hold up rc4.

Thanx,

Steve.


From chu11 at llnl.gov  Mon Nov 10 09:42:42 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 09:42:42 -0800
Subject: [ofa-general] Re: [opensm patch] support dump_conf command in opensm
	console
In-Reply-To: <20081109172518.GG30588@sashak.voltaire.com>
References: <1225759191.7307.9.camel@cardanus.llnl.gov>
	<20081109172518.GG30588@sashak.voltaire.com>
Message-ID: <1226338962.13603.21.camel@cardanus.llnl.gov>

Hey Sasha,

On Sun, 2008-11-09 at 19:25 +0200, Sasha Khapyorsky wrote:
> Hi Al,
> 
> On 16:39 Mon 03 Nov     , Al Chu wrote:
> > Hey Sasha,
> > 
> > When config files are rescanned and loaded, there's no way to know if
> > the right configuration was actually reloaded or not.  A console command
> > to dump the current config is a useful way to verify the loading of new
> > configs or not.
> > 
> > This patch assumes the fixes from my "fix qos config parsing bugs" is
> > accepted.
> 
> Didn't pass over it, sorry about delay.
> 
> > 
> > Al
> > 
> > -- 
> > Albert Chu
> > chu11 at llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> 
> > From 249607e47ec7ef1b92f9578cece90460418d12b8 Mon Sep 17 00:00:00 2001
> > From: Albert Chu <chu11 at llnl.gov>
> > Date: Mon, 3 Nov 2008 16:22:29 -0800
> > Subject: [PATCH] support dump_conf console command
> > 
> > 
> > Signed-off-by: Albert Chu <chu11 at llnl.gov>
> > ---
> >  opensm/opensm/osm_console.c |  158 +++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 158 insertions(+), 0 deletions(-)
> > 
> > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> > index d9bbbc2..8422655 100644
> > --- a/opensm/opensm/osm_console.c
> > +++ b/opensm/opensm/osm_console.c
> > @@ -53,6 +53,10 @@
> >  #include <complib/cl_passivelock.h>
> >  #include <opensm/osm_perfmgr.h>
> >  
> > +#define NULL_STR "(null)"
> > +
> > +#define BOOLEAN_STR(__b) ((__b) ? "TRUE" : "FALSE")
> > +
> >  struct command {
> >  	char *name;
> >  	void (*help_function) (FILE * out, int detail);
> > @@ -189,6 +193,14 @@ static void help_lidbalance(FILE * out, int detail)
> >  	}
> >  }
> >  
> > +static void help_dump_conf(FILE *out, int detail)
> > +{
> > +	fprintf(out, "dump_conf\n");
> > +	if (detail) {
> > +		fprintf(out, "dump current opensm configuration\n");
> > +	}
> > +}
> > +
> >  #ifdef ENABLE_OSM_PERF_MGR
> >  static void help_perfmgr(FILE * out, int detail)
> >  {
> > @@ -1136,6 +1148,151 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> >  }
> >  #endif				/* ENABLE_OSM_PERF_MGR */
> >  
> > +static void dump_qos_options(osm_qos_options_t * opt,
> > +			     osm_qos_options_t * dflt, 
> > +			     char *prefix,
> > +			     FILE * out)
> > +{
> > +	fprintf(out, "%s_max_vls : %u\n",
> > +		prefix, opt->max_vls ? opt->max_vls : dflt->max_vls);
> > +	fprintf(out, "%s_high_limit : %u\n",
> > +		prefix, opt->high_limit >= 0 ? (unsigned)opt->high_limit : (unsigned)dflt->high_limit);
> > +	fprintf(out, "%s_vlarb_high : %s\n",
> > +		prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high);
> > +	fprintf(out, "%s_vlarb_low : %s\n",
> > +		prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low);
> > +	fprintf(out, "%s_sl2vl : %s\n",
> > +		prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
> > +}
> > +
> > +static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > +{
> 
> Why to not use osm_subn_write_conf_file() function (wrapped by
> dump_conf_parse())? I think we need to have config dumping code
> consolidated.

I had thought of that, but I didn't want all of the instructions and all
the extra lines of output.  But I guess it's not that big of a deal in
the end.  I'll send a new patch.

Al

> Sasha
> 
> > +	osm_subn_opt_t * opt = &p_osm->subn.opt;
> > +
> > +	fprintf(out, "config_file : %s\n", 
> > +		opt->config_file ? opt->config_file : NULL_STR);
> > +	fprintf(out, "guid : 0x%016" PRIx64 "\n", opt->guid);
> > +	fprintf(out, "m_key : 0x%016" PRIx64 "\n", opt->m_key);
> > +	fprintf(out, "sm_key : 0x%016" PRIx64 "\n", opt->sm_key);
> > +	fprintf(out, "sa_key : 0x%016" PRIx64 "\n", opt->sa_key);
> > +	fprintf(out, "subnet_prefix : 0x%016" PRIx64 "\n", opt->subnet_prefix);
> > +	fprintf(out, "m_key_lease_period : %u\n", opt->m_key_lease_period);
> > +	fprintf(out, "sweep_interval : %u\n", opt->sweep_interval);
> > +	fprintf(out, "max_wire_smps : %u\n", opt->max_wire_smps);
> > +	fprintf(out, "transaction_timeout : %u\n", opt->transaction_timeout);
> > +	fprintf(out, "sm_priority : %u\n", opt->sm_priority);
> > +	fprintf(out, "lmc : %u\n", opt->lmc);
> > +	fprintf(out, "lmc_esp0 : %s\n", 
> > +		BOOLEAN_STR(opt->lmc_esp0));
> > +	fprintf(out, "max_op_vls : %u\n", opt->max_op_vls);
> > +	fprintf(out, "force_link_speed : %u\n", opt->force_link_speed);
> > +	fprintf(out, "reassign_lids : %s\n", 
> > +		BOOLEAN_STR(opt->reassign_lids));
> > +	fprintf(out, "ignore_other_sm : %s\n", 
> > +		BOOLEAN_STR(opt->ignore_other_sm));
> > +	fprintf(out, "single_thread : %s\n", 
> > +		BOOLEAN_STR(opt->single_thread));
> > +	fprintf(out, "disable_multicast : %s\n", 
> > +		BOOLEAN_STR(opt->disable_multicast));
> > +	fprintf(out, "force_log_flush : %s\n", 
> > +		BOOLEAN_STR(opt->force_log_flush));
> > +	fprintf(out, "subnet_timeout : %u\n", opt->subnet_timeout);
> > +	fprintf(out, "packet_life_time : %u\n", opt->packet_life_time);
> > +	fprintf(out, "vl_stall_count : %u\n", opt->vl_stall_count);
> > +	fprintf(out, "leaf_vl_stall_count : %u\n", opt->leaf_vl_stall_count);
> > +	fprintf(out, "head_of_queue_lifetime : %u\n", opt->head_of_queue_lifetime);
> > +	fprintf(out, "leaf_head_of_queue_lifetime : %u\n", opt->leaf_head_of_queue_lifetime);
> > +	fprintf(out, "local_phy_errors_threshold : %u\n", opt->local_phy_errors_threshold);
> > +	fprintf(out, "overrun_errors_threshold : %u\n", opt->overrun_errors_threshold);
> > +	fprintf(out, "sminfo_polling_timeout : %u\n", opt->sminfo_polling_timeout);
> > +	fprintf(out, "polling_retry_number : %u\n", opt->polling_retry_number);
> > +	fprintf(out, "max_msg_fifo_timeout : %u\n", opt->max_msg_fifo_timeout);
> > +	fprintf(out, "force_heavy_sweep : %s\n", 
> > +		BOOLEAN_STR(opt->force_heavy_sweep));
> > +	fprintf(out, "log_flags : 0x%02x\n", opt->log_flags);
> > +	fprintf(out, "dump_files_dir : %s\n", 
> > +		opt->dump_files_dir ? opt->dump_files_dir : NULL_STR);
> > +	fprintf(out, "log_file : %s\n", 
> > +		opt->log_file ? opt->log_file : NULL_STR);
> > +	fprintf(out, "log_max_size : %lu\n", opt->log_max_size);
> > +	fprintf(out, "partition_config_file : %s\n", 
> > +		opt->partition_config_file ? opt->partition_config_file : NULL_STR);
> > +	fprintf(out, "no_partition_enforcement : %s\n", 
> > +		BOOLEAN_STR(opt->no_partition_enforcement));
> > +	fprintf(out, "qos : %s\n", 
> > +		BOOLEAN_STR(opt->qos));
> > +	fprintf(out, "qos_policy_file : %s\n", 
> > +		opt->qos_policy_file ? opt->qos_policy_file : NULL_STR);
> > +	fprintf(out, "accum_log_file: %s\n", 
> > +		BOOLEAN_STR(opt->accum_log_file));
> > +	fprintf(out, "console : %s\n", 
> > +		opt->console ? opt->console : NULL_STR);
> > +	fprintf(out, "console_port : %u\n", opt->console_port);
> > +	fprintf(out, "port_prof_ignore_file : %s\n", 
> > +		opt->port_prof_ignore_file ? opt->port_prof_ignore_file : NULL_STR);
> > +	fprintf(out, "port_profile_switch_nodes : %s\n", 
> > +		BOOLEAN_STR(opt->port_profile_switch_nodes));
> > +	fprintf(out, "sweep_on_trap : %s\n", 
> > +		BOOLEAN_STR(opt->sweep_on_trap));
> > +	fprintf(out, "routing_engine_names : %s\n", 
> > +		opt->routing_engine_names ? opt->routing_engine_names : NULL_STR);
> > +	fprintf(out, "use_ucast_cache : %s\n", 
> > +		BOOLEAN_STR(opt->use_ucast_cache));
> > +	fprintf(out, "connect_roots : %s\n", 
> > +		BOOLEAN_STR(opt->connect_roots));
> > +	fprintf(out, "lid_matrix_dump_file : %s\n", 
> > +		opt->lid_matrix_dump_file ? opt->lid_matrix_dump_file : NULL_STR);
> > +	fprintf(out, "lfts_file : %s\n", 
> > +		opt->lfts_file ? opt->lfts_file : NULL_STR);
> > +	fprintf(out, "root_guid_file : %s\n", 
> > +		opt->root_guid_file ? opt->root_guid_file : NULL_STR);
> > +	fprintf(out, "cn_guid_file : %s\n", 
> > +		opt->cn_guid_file ? opt->cn_guid_file : NULL_STR);
> > +	fprintf(out, "ids_guid_file : %s\n", 
> > +		opt->ids_guid_file ? opt->ids_guid_file : NULL_STR);
> > +	fprintf(out, "guid_routing_order_file : %s\n", 
> > +		opt->guid_routing_order_file ? opt->guid_routing_order_file : NULL_STR);
> > +	fprintf(out, "sa_db_file : %s\n", 
> > +		opt->sa_db_file ? opt->sa_db_file : NULL_STR);
> > +	fprintf(out, "exit_on_fatal : %s\n", 
> > +		BOOLEAN_STR(opt->exit_on_fatal));
> > +	fprintf(out, "honor_guid2lid_file : %s\n", 
> > +		BOOLEAN_STR(opt->honor_guid2lid_file));
> > +	fprintf(out, "daemon : %s\n", 
> > +		BOOLEAN_STR(opt->daemon));
> > +	fprintf(out, "sm_inactive : %s\n", 
> > +		BOOLEAN_STR(opt->sm_inactive));
> > +	fprintf(out, "babbling_port_policy : %s\n", 
> > +		BOOLEAN_STR(opt->babbling_port_policy));
> > +	dump_qos_options(&opt->qos_options, &opt->qos_options, "qos", out);
> > +	dump_qos_options(&opt->qos_ca_options, &opt->qos_options, "qos_ca", out);
> > +	dump_qos_options(&opt->qos_sw0_options, &opt->qos_options, "qos_sw0", out);
> > +	dump_qos_options(&opt->qos_swe_options, &opt->qos_options, "qos_swe", out);
> > +	dump_qos_options(&opt->qos_rtr_options, &opt->qos_options, "qos_rtr", out);
> > +	fprintf(out, "enable_quirks : %s\n", 
> > +		BOOLEAN_STR(opt->enable_quirks));
> > +	fprintf(out, "no_clients_rereg : %s\n", 
> > +		BOOLEAN_STR(opt->no_clients_rereg));
> > +#ifdef ENABLE_OSM_PERF_MGR
> > +	fprintf(out, "perfmgr : %s\n", 
> > +		BOOLEAN_STR(opt->perfmgr));
> > +	fprintf(out, "perfmgr_redir : %s\n", 
> > +		BOOLEAN_STR(opt->perfmgr_redir));
> > +	fprintf(out, "perfmgr_sweep_time_s : %u\n", opt->perfmgr_sweep_time_s);
> > +	fprintf(out, "perfmgr_max_outstanding_queries : %u\n", opt->perfmgr_max_outstanding_queries);
> > +	fprintf(out, "event_db_dump_file : %s\n", 
> > +		opt->event_db_dump_file ? opt->event_db_dump_file : NULL_STR);
> > +#endif
> > +	fprintf(out, "event_plugin_name : %s\n", 
> > +		opt->event_plugin_name ? opt->event_plugin_name : NULL_STR);
> > +	fprintf(out, "node_name_map_name : %s\n", 
> > +		opt->node_name_map_name ? opt->node_name_map_name : NULL_STR);
> > +	fprintf(out, "prefix_routes_file : %s\n", 
> > +		opt->prefix_routes_file ? opt->prefix_routes_file : NULL_STR);
> > +	fprintf(out, "consolidate_ipv6_snm_req : %s\n", 
> > +		BOOLEAN_STR(opt->consolidate_ipv6_snm_req));
> > +}
> > +
> >  static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> >  {
> >  	osm_console_exit(&p_osm->console, &p_osm->log);
> > @@ -1166,6 +1323,7 @@ static const struct command console_cmds[] = {
> >  	{"portstatus", &help_portstatus, &portstatus_parse},
> >  	{"switchbalance", &help_switchbalance, &switchbalance_parse},
> >  	{"lidbalance", &help_lidbalance, &lidbalance_parse},
> > +	{"dump_conf", &help_dump_conf, &dump_conf_parse},
> >  	{"version", &help_version, &version_parse},
> >  #ifdef ENABLE_OSM_PERF_MGR
> >  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
> > -- 
> > 1.5.4.5
> > 
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From chu11 at llnl.gov  Mon Nov 10 09:42:42 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 09:42:42 -0800
Subject: [ofa-general] Re: [opensm patch] support dump_conf command in opensm
	console
In-Reply-To: <20081109172518.GG30588@sashak.voltaire.com>
References: <1225759191.7307.9.camel@cardanus.llnl.gov>
	<20081109172518.GG30588@sashak.voltaire.com>
Message-ID: <1226338962.13603.21.camel@cardanus.llnl.gov>

Hey Sasha,

On Sun, 2008-11-09 at 19:25 +0200, Sasha Khapyorsky wrote:
> Hi Al,
> 
> On 16:39 Mon 03 Nov     , Al Chu wrote:
> > Hey Sasha,
> > 
> > When config files are rescanned and loaded, there's no way to know if
> > the right configuration was actually reloaded or not.  A console command
> > to dump the current config is a useful way to verify the loading of new
> > configs or not.
> > 
> > This patch assumes the fixes from my "fix qos config parsing bugs" is
> > accepted.
> 
> Didn't pass over it, sorry about delay.
> 
> > 
> > Al
> > 
> > -- 
> > Albert Chu
> > chu11 at llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> 
> > From 249607e47ec7ef1b92f9578cece90460418d12b8 Mon Sep 17 00:00:00 2001
> > From: Albert Chu <chu11 at llnl.gov>
> > Date: Mon, 3 Nov 2008 16:22:29 -0800
> > Subject: [PATCH] support dump_conf console command
> > 
> > 
> > Signed-off-by: Albert Chu <chu11 at llnl.gov>
> > ---
> >  opensm/opensm/osm_console.c |  158 +++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 158 insertions(+), 0 deletions(-)
> > 
> > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> > index d9bbbc2..8422655 100644
> > --- a/opensm/opensm/osm_console.c
> > +++ b/opensm/opensm/osm_console.c
> > @@ -53,6 +53,10 @@
> >  #include <complib/cl_passivelock.h>
> >  #include <opensm/osm_perfmgr.h>
> >  
> > +#define NULL_STR "(null)"
> > +
> > +#define BOOLEAN_STR(__b) ((__b) ? "TRUE" : "FALSE")
> > +
> >  struct command {
> >  	char *name;
> >  	void (*help_function) (FILE * out, int detail);
> > @@ -189,6 +193,14 @@ static void help_lidbalance(FILE * out, int detail)
> >  	}
> >  }
> >  
> > +static void help_dump_conf(FILE *out, int detail)
> > +{
> > +	fprintf(out, "dump_conf\n");
> > +	if (detail) {
> > +		fprintf(out, "dump current opensm configuration\n");
> > +	}
> > +}
> > +
> >  #ifdef ENABLE_OSM_PERF_MGR
> >  static void help_perfmgr(FILE * out, int detail)
> >  {
> > @@ -1136,6 +1148,151 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> >  }
> >  #endif				/* ENABLE_OSM_PERF_MGR */
> >  
> > +static void dump_qos_options(osm_qos_options_t * opt,
> > +			     osm_qos_options_t * dflt, 
> > +			     char *prefix,
> > +			     FILE * out)
> > +{
> > +	fprintf(out, "%s_max_vls : %u\n",
> > +		prefix, opt->max_vls ? opt->max_vls : dflt->max_vls);
> > +	fprintf(out, "%s_high_limit : %u\n",
> > +		prefix, opt->high_limit >= 0 ? (unsigned)opt->high_limit : (unsigned)dflt->high_limit);
> > +	fprintf(out, "%s_vlarb_high : %s\n",
> > +		prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high);
> > +	fprintf(out, "%s_vlarb_low : %s\n",
> > +		prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low);
> > +	fprintf(out, "%s_sl2vl : %s\n",
> > +		prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
> > +}
> > +
> > +static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > +{
> 
> Why to not use osm_subn_write_conf_file() function (wrapped by
> dump_conf_parse())? I think we need to have config dumping code
> consolidated.

I had thought of that, but I didn't want all of the instructions and all
the extra lines of output.  But I guess it's not that big of a deal in
the end.  I'll send a new patch.

Al

> Sasha
> 
> > +	osm_subn_opt_t * opt = &p_osm->subn.opt;
> > +
> > +	fprintf(out, "config_file : %s\n", 
> > +		opt->config_file ? opt->config_file : NULL_STR);
> > +	fprintf(out, "guid : 0x%016" PRIx64 "\n", opt->guid);
> > +	fprintf(out, "m_key : 0x%016" PRIx64 "\n", opt->m_key);
> > +	fprintf(out, "sm_key : 0x%016" PRIx64 "\n", opt->sm_key);
> > +	fprintf(out, "sa_key : 0x%016" PRIx64 "\n", opt->sa_key);
> > +	fprintf(out, "subnet_prefix : 0x%016" PRIx64 "\n", opt->subnet_prefix);
> > +	fprintf(out, "m_key_lease_period : %u\n", opt->m_key_lease_period);
> > +	fprintf(out, "sweep_interval : %u\n", opt->sweep_interval);
> > +	fprintf(out, "max_wire_smps : %u\n", opt->max_wire_smps);
> > +	fprintf(out, "transaction_timeout : %u\n", opt->transaction_timeout);
> > +	fprintf(out, "sm_priority : %u\n", opt->sm_priority);
> > +	fprintf(out, "lmc : %u\n", opt->lmc);
> > +	fprintf(out, "lmc_esp0 : %s\n", 
> > +		BOOLEAN_STR(opt->lmc_esp0));
> > +	fprintf(out, "max_op_vls : %u\n", opt->max_op_vls);
> > +	fprintf(out, "force_link_speed : %u\n", opt->force_link_speed);
> > +	fprintf(out, "reassign_lids : %s\n", 
> > +		BOOLEAN_STR(opt->reassign_lids));
> > +	fprintf(out, "ignore_other_sm : %s\n", 
> > +		BOOLEAN_STR(opt->ignore_other_sm));
> > +	fprintf(out, "single_thread : %s\n", 
> > +		BOOLEAN_STR(opt->single_thread));
> > +	fprintf(out, "disable_multicast : %s\n", 
> > +		BOOLEAN_STR(opt->disable_multicast));
> > +	fprintf(out, "force_log_flush : %s\n", 
> > +		BOOLEAN_STR(opt->force_log_flush));
> > +	fprintf(out, "subnet_timeout : %u\n", opt->subnet_timeout);
> > +	fprintf(out, "packet_life_time : %u\n", opt->packet_life_time);
> > +	fprintf(out, "vl_stall_count : %u\n", opt->vl_stall_count);
> > +	fprintf(out, "leaf_vl_stall_count : %u\n", opt->leaf_vl_stall_count);
> > +	fprintf(out, "head_of_queue_lifetime : %u\n", opt->head_of_queue_lifetime);
> > +	fprintf(out, "leaf_head_of_queue_lifetime : %u\n", opt->leaf_head_of_queue_lifetime);
> > +	fprintf(out, "local_phy_errors_threshold : %u\n", opt->local_phy_errors_threshold);
> > +	fprintf(out, "overrun_errors_threshold : %u\n", opt->overrun_errors_threshold);
> > +	fprintf(out, "sminfo_polling_timeout : %u\n", opt->sminfo_polling_timeout);
> > +	fprintf(out, "polling_retry_number : %u\n", opt->polling_retry_number);
> > +	fprintf(out, "max_msg_fifo_timeout : %u\n", opt->max_msg_fifo_timeout);
> > +	fprintf(out, "force_heavy_sweep : %s\n", 
> > +		BOOLEAN_STR(opt->force_heavy_sweep));
> > +	fprintf(out, "log_flags : 0x%02x\n", opt->log_flags);
> > +	fprintf(out, "dump_files_dir : %s\n", 
> > +		opt->dump_files_dir ? opt->dump_files_dir : NULL_STR);
> > +	fprintf(out, "log_file : %s\n", 
> > +		opt->log_file ? opt->log_file : NULL_STR);
> > +	fprintf(out, "log_max_size : %lu\n", opt->log_max_size);
> > +	fprintf(out, "partition_config_file : %s\n", 
> > +		opt->partition_config_file ? opt->partition_config_file : NULL_STR);
> > +	fprintf(out, "no_partition_enforcement : %s\n", 
> > +		BOOLEAN_STR(opt->no_partition_enforcement));
> > +	fprintf(out, "qos : %s\n", 
> > +		BOOLEAN_STR(opt->qos));
> > +	fprintf(out, "qos_policy_file : %s\n", 
> > +		opt->qos_policy_file ? opt->qos_policy_file : NULL_STR);
> > +	fprintf(out, "accum_log_file: %s\n", 
> > +		BOOLEAN_STR(opt->accum_log_file));
> > +	fprintf(out, "console : %s\n", 
> > +		opt->console ? opt->console : NULL_STR);
> > +	fprintf(out, "console_port : %u\n", opt->console_port);
> > +	fprintf(out, "port_prof_ignore_file : %s\n", 
> > +		opt->port_prof_ignore_file ? opt->port_prof_ignore_file : NULL_STR);
> > +	fprintf(out, "port_profile_switch_nodes : %s\n", 
> > +		BOOLEAN_STR(opt->port_profile_switch_nodes));
> > +	fprintf(out, "sweep_on_trap : %s\n", 
> > +		BOOLEAN_STR(opt->sweep_on_trap));
> > +	fprintf(out, "routing_engine_names : %s\n", 
> > +		opt->routing_engine_names ? opt->routing_engine_names : NULL_STR);
> > +	fprintf(out, "use_ucast_cache : %s\n", 
> > +		BOOLEAN_STR(opt->use_ucast_cache));
> > +	fprintf(out, "connect_roots : %s\n", 
> > +		BOOLEAN_STR(opt->connect_roots));
> > +	fprintf(out, "lid_matrix_dump_file : %s\n", 
> > +		opt->lid_matrix_dump_file ? opt->lid_matrix_dump_file : NULL_STR);
> > +	fprintf(out, "lfts_file : %s\n", 
> > +		opt->lfts_file ? opt->lfts_file : NULL_STR);
> > +	fprintf(out, "root_guid_file : %s\n", 
> > +		opt->root_guid_file ? opt->root_guid_file : NULL_STR);
> > +	fprintf(out, "cn_guid_file : %s\n", 
> > +		opt->cn_guid_file ? opt->cn_guid_file : NULL_STR);
> > +	fprintf(out, "ids_guid_file : %s\n", 
> > +		opt->ids_guid_file ? opt->ids_guid_file : NULL_STR);
> > +	fprintf(out, "guid_routing_order_file : %s\n", 
> > +		opt->guid_routing_order_file ? opt->guid_routing_order_file : NULL_STR);
> > +	fprintf(out, "sa_db_file : %s\n", 
> > +		opt->sa_db_file ? opt->sa_db_file : NULL_STR);
> > +	fprintf(out, "exit_on_fatal : %s\n", 
> > +		BOOLEAN_STR(opt->exit_on_fatal));
> > +	fprintf(out, "honor_guid2lid_file : %s\n", 
> > +		BOOLEAN_STR(opt->honor_guid2lid_file));
> > +	fprintf(out, "daemon : %s\n", 
> > +		BOOLEAN_STR(opt->daemon));
> > +	fprintf(out, "sm_inactive : %s\n", 
> > +		BOOLEAN_STR(opt->sm_inactive));
> > +	fprintf(out, "babbling_port_policy : %s\n", 
> > +		BOOLEAN_STR(opt->babbling_port_policy));
> > +	dump_qos_options(&opt->qos_options, &opt->qos_options, "qos", out);
> > +	dump_qos_options(&opt->qos_ca_options, &opt->qos_options, "qos_ca", out);
> > +	dump_qos_options(&opt->qos_sw0_options, &opt->qos_options, "qos_sw0", out);
> > +	dump_qos_options(&opt->qos_swe_options, &opt->qos_options, "qos_swe", out);
> > +	dump_qos_options(&opt->qos_rtr_options, &opt->qos_options, "qos_rtr", out);
> > +	fprintf(out, "enable_quirks : %s\n", 
> > +		BOOLEAN_STR(opt->enable_quirks));
> > +	fprintf(out, "no_clients_rereg : %s\n", 
> > +		BOOLEAN_STR(opt->no_clients_rereg));
> > +#ifdef ENABLE_OSM_PERF_MGR
> > +	fprintf(out, "perfmgr : %s\n", 
> > +		BOOLEAN_STR(opt->perfmgr));
> > +	fprintf(out, "perfmgr_redir : %s\n", 
> > +		BOOLEAN_STR(opt->perfmgr_redir));
> > +	fprintf(out, "perfmgr_sweep_time_s : %u\n", opt->perfmgr_sweep_time_s);
> > +	fprintf(out, "perfmgr_max_outstanding_queries : %u\n", opt->perfmgr_max_outstanding_queries);
> > +	fprintf(out, "event_db_dump_file : %s\n", 
> > +		opt->event_db_dump_file ? opt->event_db_dump_file : NULL_STR);
> > +#endif
> > +	fprintf(out, "event_plugin_name : %s\n", 
> > +		opt->event_plugin_name ? opt->event_plugin_name : NULL_STR);
> > +	fprintf(out, "node_name_map_name : %s\n", 
> > +		opt->node_name_map_name ? opt->node_name_map_name : NULL_STR);
> > +	fprintf(out, "prefix_routes_file : %s\n", 
> > +		opt->prefix_routes_file ? opt->prefix_routes_file : NULL_STR);
> > +	fprintf(out, "consolidate_ipv6_snm_req : %s\n", 
> > +		BOOLEAN_STR(opt->consolidate_ipv6_snm_req));
> > +}
> > +
> >  static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> >  {
> >  	osm_console_exit(&p_osm->console, &p_osm->log);
> > @@ -1166,6 +1323,7 @@ static const struct command console_cmds[] = {
> >  	{"portstatus", &help_portstatus, &portstatus_parse},
> >  	{"switchbalance", &help_switchbalance, &switchbalance_parse},
> >  	{"lidbalance", &help_lidbalance, &lidbalance_parse},
> > +	{"dump_conf", &help_dump_conf, &dump_conf_parse},
> >  	{"version", &help_version, &version_parse},
> >  #ifdef ENABLE_OSM_PERF_MGR
> >  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
> > -- 
> > 1.5.4.5
> > 
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From meier3 at llnl.gov  Mon Nov 10 10:26:17 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Mon, 10 Nov 2008 10:26:17 -0800
Subject: [ofa-general] [PATCH] opensm: osm_opensm.c added a method to remove
	plugins
Message-ID: <49187CC9.6010600@llnl.gov>

Sasha,

During development, I am constantly bringing the SM up and down, so this helps make sure things
shut down gracefully.

Should have no impact, if people are not using plugins... yet.

>From e0434e676d0b3dd63a323218d207f029da9e27a4 Mon Sep 17 00:00:00 2001
From: Tim Meier <meier3 at llnl.gov>
Date: Mon, 10 Nov 2008 09:48:55 -0800
Subject: [PATCH] opensm:  osm_opensm.c added a method to remove plugins

Upon shutdown, iterates through the plugins and releases
resources and removes them via their destroy() method.

Signed-off-by: Tim Meier <meier3 at llnl.gov>
---
 opensm/opensm/osm_opensm.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c
index 7deea6d..7286782 100644
--- a/opensm/opensm/osm_opensm.c
+++ b/opensm/opensm/osm_opensm.c
@@ -238,6 +238,19 @@ static void destroy_routing_engines(osm_opensm_t *osm)
        }
 }

+/**********************************************************************
+ **********************************************************************/
+static void destroy_plugins(osm_opensm_t *osm)
+{
+       osm_epi_plugin_t *p;
+       // remove from the list, and destroy it
+       while (!cl_is_qlist_empty(&osm->plugin_list)){
+               p = (osm_epi_plugin_t *)cl_qlist_remove_head(&osm->plugin_list);
+               // plugin is responsible for freeing its own resources
+               osm_epi_destroy(p);
+       }
+}
+
 void osm_opensm_destroy(IN osm_opensm_t * const p_osm)
 {
        /* in case of shutdown through exit proc - no ^C */
@@ -275,6 +288,7 @@ void osm_opensm_destroy(IN osm_opensm_t * const p_osm)
        osm_sa_db_file_dump(p_osm);

        /* do the destruction in reverse order as init */
+       destroy_plugins(p_osm);
        destroy_routing_engines(p_osm);
        osm_sa_destroy(&p_osm->sa);
        osm_sm_destroy(&p_osm->sm);
--
1.5.4.5

-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
925.422.3341
meier3 at llnl.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-opensm-osm_opensm.c-added-a-method-to-remove-plugi.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/436f10b0/attachment.ksh>

From sashak at voltaire.com  Mon Nov 10 11:11:08 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:11:08 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_multicast.c: bug with
	joining/leaving mcast group
In-Reply-To: <49184706.9070103@dev.mellanox.co.il>
References: <49184706.9070103@dev.mellanox.co.il>
Message-ID: <20081110191108.GD313@sashak.voltaire.com>

Hi Yevgeny,

On 16:36 Mon 10 Nov     , Yevgeny Kliteynik wrote:
> 
> I think there's a bug in the osm_mgrp_add/remove_port functions.
> If some mcast group member has JoinState 0x1 (full member),
> and then new join from the same port received with JoinState
> 0x2 (non member), OpenSM will reduce number of full members
> of this group, which eventually might cause group deletion.

Right, isn't this how things should work? When full member updates it
state to non member the number of full members are reduced, and then
last full member leaves the MC group is deleted (o15-0.2-1.9).

Sasha

> Similar problem (only in logically opposite direction) happens
> when port tries to partially leave mcast group.
> 
> This patch should fix it.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_multicast.c |   33 +++++++++++----------------------
>  1 files changed, 11 insertions(+), 22 deletions(-)
> 
> diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
> index d62d585..350fd22 100644
> --- a/opensm/opensm/osm_multicast.c
> +++ b/opensm/opensm/osm_multicast.c
> @@ -172,17 +172,11 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t *subn, osm_log_t *log,
>  		p_mgrp->last_change_id++;
>  	}
> 
> -	if ((join_state ^ prev_join_state) & IB_JOIN_STATE_FULL) {
> -		if (join_state & IB_JOIN_STATE_FULL) {
> -			if (++p_mgrp->full_members == 1) {
> -				mgrp_send_notice(subn, log, p_mgrp, 66);
> -				p_mgrp->to_be_deleted = 0;
> -			}
> -		} else if (--p_mgrp->full_members == 0) {
> -			mgrp_send_notice(subn, log, p_mgrp, 67);
> -			if (!p_mgrp->well_known)
> -				p_mgrp->to_be_deleted = 1;
> -		}
> +	if ((join_state & IB_JOIN_STATE_FULL) &&
> +	    !(prev_join_state & IB_JOIN_STATE_FULL) &&
> +	    (++p_mgrp->full_members == 1)) {
> +		mgrp_send_notice(subn, log, p_mgrp, 66);
> +		p_mgrp->to_be_deleted = 0;
>  	}
> 
>  	return (p_mcm_port);
> @@ -224,17 +218,12 @@ int osm_mgrp_remove_port(osm_subn_t *subn, osm_log_t *log, osm_mgrp_t *mgrp,
> 
>  	/* no more full members so the group will be deleted after re-route
>  	   but only if it is not a well known group */
> -	if ((port_join_state ^ new_join_state) & IB_JOIN_STATE_FULL) {
> -		if (port_join_state & IB_JOIN_STATE_FULL) {
> -			if (--mgrp->full_members == 0) {
> -				mgrp_send_notice(subn, log, mgrp, 67);
> -				if (!mgrp->well_known)
> -					mgrp->to_be_deleted = 1;
> -			}
> -		} else if (++mgrp->full_members == 1) {
> -			mgrp_send_notice(subn, log, mgrp, 66);
> -			mgrp->to_be_deleted = 0;
> -		}
> +	if ((port_join_state & IB_JOIN_STATE_FULL) &&
> +	    !(new_join_state & IB_JOIN_STATE_FULL) &&
> +	    (--mgrp->full_members == 0)) {
> +		mgrp_send_notice(subn, log, mgrp, 67);
> +		if (!mgrp->well_known)
> +			mgrp->to_be_deleted = 1;
>  	}
> 
>  	return ret;
> -- 
> 1.5.1.4
> 


From sashak at voltaire.com  Mon Nov 10 11:11:40 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:11:40 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_pkey.c: cosmetics in some log
	message
In-Reply-To: <491842E5.6040203@dev.mellanox.co.il>
References: <491842E5.6040203@dev.mellanox.co.il>
Message-ID: <20081110191140.GE313@sashak.voltaire.com>

On 16:19 Mon 10 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Just some cosmetics in a log message.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 10 11:12:01 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:12:01 +0200
Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: rename
	IB_MC_REC_STATE_SEND_ONLY_MEMBER
In-Reply-To: <49184445.10007@dev.mellanox.co.il>
References: <49184445.10007@dev.mellanox.co.il>
Message-ID: <20081110191201.GF313@sashak.voltaire.com>

On 16:25 Mon 10 Nov     , Yevgeny Kliteynik wrote:
> Sasha,
> 
> The multicast Send Only bit is defined in spec as "SendOnlyNonMemeber",
> to denote that the port is not considered a member for purposes of group
> creation/deletion.
> 
> Renaming IB_MC_REC_STATE_SEND_ONLY_MEMBER to IB_MC_REC_STATE_SEND_ONLY_NON_MEMBER.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Mon Nov 10 11:18:19 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 10 Nov 2008 21:18:19 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_multicast.c: bug with
 joining/leaving mcast group
In-Reply-To: <20081110191108.GD313@sashak.voltaire.com>
References: <49184706.9070103@dev.mellanox.co.il>
	<20081110191108.GD313@sashak.voltaire.com>
Message-ID: <491888FB.5020107@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 16:36 Mon 10 Nov     , Yevgeny Kliteynik wrote:
>> I think there's a bug in the osm_mgrp_add/remove_port functions.
>> If some mcast group member has JoinState 0x1 (full member),
>> and then new join from the same port received with JoinState
>> 0x2 (non member), OpenSM will reduce number of full members
>> of this group, which eventually might cause group deletion.
> 
> Right, isn't this how things should work? When full member updates it
> state to non member the number of full members are reduced, and then
> last full member leaves the MC group is deleted (o15-0.2-1.9).

I thought so too, but turns out that it's wrong:

o15-0.1.11: If SA supports UD multicast, then if an endport joins a
multicast group as specified in o15-0.1.10:, SA shall replace the
endport’s current MCMemberRecord:JoinState component with the logical
OR of the MCMemberRecord:JoinState component with the endport’s current
MCMemberRecord:JoinState component if the endport had joined this
multicast group before.

So the full member doesn't update its state to non-member, but rather
adds additional bit to the JoinState (the non-member).

-- Yevgeny

> Sasha
> 
>> Similar problem (only in logically opposite direction) happens
>> when port tries to partially leave mcast group.
>>
>> This patch should fix it.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/opensm/osm_multicast.c |   33 +++++++++++----------------------
>>  1 files changed, 11 insertions(+), 22 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
>> index d62d585..350fd22 100644
>> --- a/opensm/opensm/osm_multicast.c
>> +++ b/opensm/opensm/osm_multicast.c
>> @@ -172,17 +172,11 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t *subn, osm_log_t *log,
>>  		p_mgrp->last_change_id++;
>>  	}
>>
>> -	if ((join_state ^ prev_join_state) & IB_JOIN_STATE_FULL) {
>> -		if (join_state & IB_JOIN_STATE_FULL) {
>> -			if (++p_mgrp->full_members == 1) {
>> -				mgrp_send_notice(subn, log, p_mgrp, 66);
>> -				p_mgrp->to_be_deleted = 0;
>> -			}
>> -		} else if (--p_mgrp->full_members == 0) {
>> -			mgrp_send_notice(subn, log, p_mgrp, 67);
>> -			if (!p_mgrp->well_known)
>> -				p_mgrp->to_be_deleted = 1;
>> -		}
>> +	if ((join_state & IB_JOIN_STATE_FULL) &&
>> +	    !(prev_join_state & IB_JOIN_STATE_FULL) &&
>> +	    (++p_mgrp->full_members == 1)) {
>> +		mgrp_send_notice(subn, log, p_mgrp, 66);
>> +		p_mgrp->to_be_deleted = 0;
>>  	}
>>
>>  	return (p_mcm_port);
>> @@ -224,17 +218,12 @@ int osm_mgrp_remove_port(osm_subn_t *subn, osm_log_t *log, osm_mgrp_t *mgrp,
>>
>>  	/* no more full members so the group will be deleted after re-route
>>  	   but only if it is not a well known group */
>> -	if ((port_join_state ^ new_join_state) & IB_JOIN_STATE_FULL) {
>> -		if (port_join_state & IB_JOIN_STATE_FULL) {
>> -			if (--mgrp->full_members == 0) {
>> -				mgrp_send_notice(subn, log, mgrp, 67);
>> -				if (!mgrp->well_known)
>> -					mgrp->to_be_deleted = 1;
>> -			}
>> -		} else if (++mgrp->full_members == 1) {
>> -			mgrp_send_notice(subn, log, mgrp, 66);
>> -			mgrp->to_be_deleted = 0;
>> -		}
>> +	if ((port_join_state & IB_JOIN_STATE_FULL) &&
>> +	    !(new_join_state & IB_JOIN_STATE_FULL) &&
>> +	    (--mgrp->full_members == 0)) {
>> +		mgrp_send_notice(subn, log, mgrp, 67);
>> +		if (!mgrp->well_known)
>> +			mgrp->to_be_deleted = 1;
>>  	}
>>
>>  	return ret;
>> -- 
>> 1.5.1.4
>>
> 


From sashak at voltaire.com  Mon Nov 10 11:20:26 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:20:26 +0200
Subject: [ofa-general] Re: [PATCH] opensm: osm_opensm.c added a method to
	remove plugins
In-Reply-To: <49187CC9.6010600@llnl.gov>
References: <49187CC9.6010600@llnl.gov>
Message-ID: <20081110192026.GH313@sashak.voltaire.com>

On 10:26 Mon 10 Nov     , Timothy A. Meier wrote:
> Sasha,
> 
> During development, I am constantly bringing the SM up and down, so this helps make sure things
> shut down gracefully.
> 
> Should have no impact, if people are not using plugins... yet.
> 
> From e0434e676d0b3dd63a323218d207f029da9e27a4 Mon Sep 17 00:00:00 2001
> From: Tim Meier <meier3 at llnl.gov>
> Date: Mon, 10 Nov 2008 09:48:55 -0800
> Subject: [PATCH] opensm:  osm_opensm.c added a method to remove plugins
> 
> Upon shutdown, iterates through the plugins and releases
> resources and removes them via their destroy() method.
> 
> Signed-off-by: Tim Meier <meier3 at llnl.gov>

Applied. Thanks.

Sasha


From boris at mellanox.com  Mon Nov 10 11:24:36 2008
From: boris at mellanox.com (Boris Shpolyansky)
Date: Mon, 10 Nov 2008 11:24:36 -0800
Subject: [ofa-general] ib_mthca catastrophic error detected
References: <4906645D.6010101@ucla.edu>
	<4907054E.9080205@mellanox.co.il><490763D0.5020002@ucla.edu><200811061154.02260.jackm@dev.mellanox.co.il>
	<491338D1.8050205@ucla.edu>
Message-ID: <1E3DCD1C63492545881FACB6063A57C1030CF354@mtiexch01.mti.com>


Scott,

Do you use any form of Boot-over-IB in this cluster?
If so - what version/flavor of it?

Thanks,
Boris Shpolyansky
Sr. Member of Technical Staff
Applications
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott A.
Friedman
Sent: Thursday, November 06, 2008 10:35 AM
To: Jack Morgenstein
Cc: Matthew Finlay; general at lists.openfabrics.org
Subject: Re: [ofa-general] ib_mthca catastrophic error detected

Hi

We have been working with Matthew Finlay <Matt at mellanox.com> on this 
recently - you/we might pull all of this together. We are able to make 
any of our sdr cards have a catastrophic error - and are unable to do 
the same with our ddr cards. Matt has suggested that there is a firmware

fix possibly?

Anyway, to answer your questions:

The hosts are Sun X2200M, but we have swapped a few around with some 
hosts we have from Aspen systems and the problem remains. I suppose the 
similarity is that they are all nForce based.

The MPI used was the latest OpenMPI - I will find the version, but I do 
not think it matters whether we are using OpenMPI or MVAPICH.

The job itself does not seem to matter either. The situation is after a 
node comes up it takes a very long time for the card to become ACTIVE. 
It seems to ocsillate between ACTIVE and INIT. We have waited several 
minutes sometimes but can never be sure of when it will settle down. The

queue certainly doesn't know and a job submitted to such a node will die

as the cards will have a catastrophic error.

Scott


 > Console output from the following linux commands:
 >   cat /etc/*rel*


Not a good idea...maybe this

#cat /etc/redhat-release
CentOS release 5 (Final)

 >   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using

grub)

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this
file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/hda3
#          initrd /initrd-version.img
#boot=/dev/hda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.18-92.1.6.el5)
  root (hd0,0)
  kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
  initrd /initrd-2.6.18-92.1.6.el5.img


 >   uname -a

Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008 x86_64 
x86_64 x86_64 GNU/Linux


 >   cat /proc/cpuinfo
 >   cat /proc/meminfo

processor : 0
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4424.75
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 1
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4426.22
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 2
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.37
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 3
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 0
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4421.65
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 4
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 0
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.36
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 5
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 1
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.71
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 6
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 2
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

processor : 7
vendor_id : AuthenticAMD
cpu family   : 16
model  : 2
model name   : Quad-Core AMD Opteron(tm) Processor 2354
stepping : 3
cpu MHz  : 2200.000
cache size   : 512 KB
physical id  : 1
siblings : 4
core id  : 3
cpu cores : 4
fpu  : yes
fpu_exception : yes
cpuid level  : 5
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
3dnowprefetch osvw
bogomips : 4422.17
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]


MemTotal:      8182568 kB
MemFree:       4535892 kB
Buffers:        318232 kB
Cached:        1583772 kB
SwapCached:          0 kB
Active:        2714400 kB
Inactive:       730260 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      8182568 kB
LowFree:       4535892 kB
SwapTotal:     8289532 kB
SwapFree:      8289380 kB
Dirty:             340 kB
Writeback:           0 kB
AnonPages:     1542636 kB
Mapped:          14588 kB
Slab:           139788 kB
PageTables:       7208 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  12380816 kB
Committed_AS:  1679420 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      4600 kB
VmallocChunk: 34359733707 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB


Jack Morgenstein wrote:
> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>> Hi
>>
>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel
module 
>> reports the following on startup:
>>
>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>
>> The cards in all (22) of the nodes we have seen this error on are as 
>> follows:
>>
>> hca_id: mthca0
>>          fw_ver:                         1.2.0
>>          vendor_id:                      0x02c9
>>          vendor_part_id:                 25204
>>          hw_ver:                         0xA0
>>          board_id:                       MT_03B0140001
>>          phys_port_cnt:                  1
>>
>> It appears that when this happens the driver restarts (loads?) itself

>> however the job running at the time of the error is, of course,
killed.
>>
>> Scott
> 
> Scott,
> We are trying to reproduce this here.  It would help if you could
supply
> the following info:
> 
> Host model for hosts which are experiencing the failure:
>  
> Console output from the following linux commands:
>   cat /etc/*rel*
>   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using
grub)
>   uname -a
>   cat /proc/cpuinfo
>   cat /proc/meminfo
> 
> Also, what sort of job was running when the failure occurred:
> -- which MPI are you using?
> -- do you have a test example which we can run here to reproduce the
problem?
> 
> Thanks in advance for your help!
> 
> Jack Morgenstein
> Senior Software Development Engineer
> Mellanox
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From hal.rosenstock at gmail.com  Mon Nov 10 11:42:12 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 10 Nov 2008 14:42:12 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_multicast.c: bug
	with joining/leaving mcast group
In-Reply-To: <491888FB.5020107@dev.mellanox.co.il>
References: <49184706.9070103@dev.mellanox.co.il>
	<20081110191108.GD313@sashak.voltaire.com>
	<491888FB.5020107@dev.mellanox.co.il>
Message-ID: <f0e08f230811101142w421cf489p71fb52264b81d585@mail.gmail.com>

On Mon, Nov 10, 2008 at 2:18 PM, Yevgeny Kliteynik
<kliteyn at dev.mellanox.co.il> wrote:
> Hi Sasha,
>
> Sasha Khapyorsky wrote:
>>
>> Hi Yevgeny,
>>
>> On 16:36 Mon 10 Nov     , Yevgeny Kliteynik wrote:
>>>
>>> I think there's a bug in the osm_mgrp_add/remove_port functions.
>>> If some mcast group member has JoinState 0x1 (full member),
>>> and then new join from the same port received with JoinState
>>> 0x2 (non member), OpenSM will reduce number of full members
>>> of this group, which eventually might cause group deletion.
>>
>> Right, isn't this how things should work? When full member updates it
>> state to non member the number of full members are reduced, and then
>> last full member leaves the MC group is deleted (o15-0.2-1.9).
>
> I thought so too,

It's true; what you are seeing is the addition of send only non member
(to full member) and not eliminating full member.

>but turns out that it's wrong:
>
> o15-0.1.11: If SA supports UD multicast, then if an endport joins a
> multicast group as specified in o15-0.1.10:, SA shall replace the
> endport's current MCMemberRecord:JoinState component with the logical
> OR of the MCMemberRecord:JoinState component with the endport's current
> MCMemberRecord:JoinState component if the endport had joined this
> multicast group before.
>
> So the full member doesn't update its state to non-member, but rather
> adds additional bit to the JoinState (the non-member).

Right, a port can simultaneously be full member, non member, and send
only non member.

-- Hal

>
> -- Yevgeny
>
>> Sasha
>>
>>> Similar problem (only in logically opposite direction) happens
>>> when port tries to partially leave mcast group.
>>>
>>> This patch should fix it.
>>>
>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>> ---
>>>  opensm/opensm/osm_multicast.c |   33 +++++++++++----------------------
>>>  1 files changed, 11 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/opensm/opensm/osm_multicast.c
>>> b/opensm/opensm/osm_multicast.c
>>> index d62d585..350fd22 100644
>>> --- a/opensm/opensm/osm_multicast.c
>>> +++ b/opensm/opensm/osm_multicast.c
>>> @@ -172,17 +172,11 @@ osm_mcm_port_t *osm_mgrp_add_port(IN osm_subn_t
>>> *subn, osm_log_t *log,
>>>                p_mgrp->last_change_id++;
>>>        }
>>>
>>> -       if ((join_state ^ prev_join_state) & IB_JOIN_STATE_FULL) {
>>> -               if (join_state & IB_JOIN_STATE_FULL) {
>>> -                       if (++p_mgrp->full_members == 1) {
>>> -                               mgrp_send_notice(subn, log, p_mgrp, 66);
>>> -                               p_mgrp->to_be_deleted = 0;
>>> -                       }
>>> -               } else if (--p_mgrp->full_members == 0) {
>>> -                       mgrp_send_notice(subn, log, p_mgrp, 67);
>>> -                       if (!p_mgrp->well_known)
>>> -                               p_mgrp->to_be_deleted = 1;
>>> -               }
>>> +       if ((join_state & IB_JOIN_STATE_FULL) &&
>>> +           !(prev_join_state & IB_JOIN_STATE_FULL) &&
>>> +           (++p_mgrp->full_members == 1)) {
>>> +               mgrp_send_notice(subn, log, p_mgrp, 66);
>>> +               p_mgrp->to_be_deleted = 0;
>>>        }
>>>
>>>        return (p_mcm_port);
>>> @@ -224,17 +218,12 @@ int osm_mgrp_remove_port(osm_subn_t *subn,
>>> osm_log_t *log, osm_mgrp_t *mgrp,
>>>
>>>        /* no more full members so the group will be deleted after
>>> re-route
>>>           but only if it is not a well known group */
>>> -       if ((port_join_state ^ new_join_state) & IB_JOIN_STATE_FULL) {
>>> -               if (port_join_state & IB_JOIN_STATE_FULL) {
>>> -                       if (--mgrp->full_members == 0) {
>>> -                               mgrp_send_notice(subn, log, mgrp, 67);
>>> -                               if (!mgrp->well_known)
>>> -                                       mgrp->to_be_deleted = 1;
>>> -                       }
>>> -               } else if (++mgrp->full_members == 1) {
>>> -                       mgrp_send_notice(subn, log, mgrp, 66);
>>> -                       mgrp->to_be_deleted = 0;
>>> -               }
>>> +       if ((port_join_state & IB_JOIN_STATE_FULL) &&
>>> +           !(new_join_state & IB_JOIN_STATE_FULL) &&
>>> +           (--mgrp->full_members == 0)) {
>>> +               mgrp_send_notice(subn, log, mgrp, 67);
>>> +               if (!mgrp->well_known)
>>> +                       mgrp->to_be_deleted = 1;
>>>        }
>>>
>>>        return ret;
>>> --
>>> 1.5.1.4
>>>
>>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From sashak at voltaire.com  Mon Nov 10 11:43:34 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:43:34 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_multicast.c: bug with
	joining/leaving mcast group
In-Reply-To: <491888FB.5020107@dev.mellanox.co.il>
References: <49184706.9070103@dev.mellanox.co.il>
	<20081110191108.GD313@sashak.voltaire.com>
	<491888FB.5020107@dev.mellanox.co.il>
Message-ID: <20081110194334.GJ313@sashak.voltaire.com>

On 21:18 Mon 10 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
>
> Sasha Khapyorsky wrote:
>> Hi Yevgeny,
>> On 16:36 Mon 10 Nov     , Yevgeny Kliteynik wrote:
>>> I think there's a bug in the osm_mgrp_add/remove_port functions.
>>> If some mcast group member has JoinState 0x1 (full member),
>>> and then new join from the same port received with JoinState
>>> 0x2 (non member), OpenSM will reduce number of full members
>>> of this group, which eventually might cause group deletion.
>> Right, isn't this how things should work? When full member updates it
>> state to non member the number of full members are reduced, and then
>> last full member leaves the MC group is deleted (o15-0.2-1.9).
>
> I thought so too, but turns out that it's wrong:
>
> o15-0.1.11: If SA supports UD multicast, then if an endport joins a
> multicast group as specified in o15-0.1.10:, SA shall replace the
> endport?s current MCMemberRecord:JoinState component with the logical
> OR of the MCMemberRecord:JoinState component with the endport?s current
> MCMemberRecord:JoinState component if the endport had joined this
> multicast group before.
>
> So the full member doesn't update its state to non-member, but rather
> adds additional bit to the JoinState (the non-member).

Ok. I see now.

Applied. Thanks.

Sasha


From friedman at ucla.edu  Mon Nov 10 11:45:07 2008
From: friedman at ucla.edu (Scott A. Friedman)
Date: Mon, 10 Nov 2008 11:45:07 -0800
Subject: [ofa-general] ib_mthca catastrophic error detected
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C1030CF354@mtiexch01.mti.com>
References: <4906645D.6010101@ucla.edu>
	<4907054E.9080205@mellanox.co.il><490763D0.5020002@ucla.edu><200811061154.02260.jackm@dev.mellanox.co.il>
	<491338D1.8050205@ucla.edu>
	<1E3DCD1C63492545881FACB6063A57C1030CF354@mtiexch01.mti.com>
Message-ID: <49188F43.3050907@ucla.edu>

Hi

No, no boot over IB - in fact there is no IPoIB configured on this 
cluster at all.

The firmware Matt sent seems to have fixed the problem as we have been 
unable to reproduce since we flashed some test nodes. We are in the 
process of flashing the remaining 100 or so nodes that have SDR cards as 
jobs finish.

Scott

Boris Shpolyansky wrote:
> Scott,
> 
> Do you use any form of Boot-over-IB in this cluster?
> If so - what version/flavor of it?
> 
> Thanks,
> Boris Shpolyansky
> Sr. Member of Technical Staff
> Applications
> Mellanox Technologies Inc.
> 2900 Stender Way
> Santa Clara, CA 95054
> Tel.: (408) 916 0014
> Fax: (408) 970 3403
> Cell: (408) 834 9365
> www.mellanox.com
> 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott A.
> Friedman
> Sent: Thursday, November 06, 2008 10:35 AM
> To: Jack Morgenstein
> Cc: Matthew Finlay; general at lists.openfabrics.org
> Subject: Re: [ofa-general] ib_mthca catastrophic error detected
> 
> Hi
> 
> We have been working with Matthew Finlay <Matt at mellanox.com> on this 
> recently - you/we might pull all of this together. We are able to make 
> any of our sdr cards have a catastrophic error - and are unable to do 
> the same with our ddr cards. Matt has suggested that there is a firmware
> 
> fix possibly?
> 
> Anyway, to answer your questions:
> 
> The hosts are Sun X2200M, but we have swapped a few around with some 
> hosts we have from Aspen systems and the problem remains. I suppose the 
> similarity is that they are all nForce based.
> 
> The MPI used was the latest OpenMPI - I will find the version, but I do 
> not think it matters whether we are using OpenMPI or MVAPICH.
> 
> The job itself does not seem to matter either. The situation is after a 
> node comes up it takes a very long time for the card to become ACTIVE. 
> It seems to ocsillate between ACTIVE and INIT. We have waited several 
> minutes sometimes but can never be sure of when it will settle down. The
> 
> queue certainly doesn't know and a job submitted to such a node will die
> 
> as the cards will have a catastrophic error.
> 
> Scott
> 
> 
>  > Console output from the following linux commands:
>  >   cat /etc/*rel*
> 
> 
> Not a good idea...maybe this
> 
> #cat /etc/redhat-release
> CentOS release 5 (Final)
> 
>  >   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using
> 
> grub)
> 
> # grub.conf generated by anaconda
> #
> # Note that you do not have to rerun grub after making changes to this
> file
> # NOTICE:  You have a /boot partition.  This means that
> #          all kernel and initrd paths are relative to /boot/, eg.
> #          root (hd0,0)
> #          kernel /vmlinuz-version ro root=/dev/hda3
> #          initrd /initrd-version.img
> #boot=/dev/hda
> default=0
> timeout=5
> splashimage=(hd0,0)/grub/splash.xpm.gz
> hiddenmenu
> title CentOS (2.6.18-92.1.6.el5)
>   root (hd0,0)
>   kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
>   initrd /initrd-2.6.18-92.1.6.el5.img
> 
> 
>  >   uname -a
> 
> Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> 
>  >   cat /proc/cpuinfo
>  >   cat /proc/meminfo
> 
> processor : 0
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 0
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4424.75
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 1
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 1
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4426.22
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 2
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 2
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4421.37
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 3
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 3
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4421.65
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 4
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 0
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.36
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 5
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 1
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.71
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 6
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 2
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.17
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 7
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 3
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.17
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> 
> 
> 
> MemTotal:      8182568 kB
> MemFree:       4535892 kB
> Buffers:        318232 kB
> Cached:        1583772 kB
> SwapCached:          0 kB
> Active:        2714400 kB
> Inactive:       730260 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:      8182568 kB
> LowFree:       4535892 kB
> SwapTotal:     8289532 kB
> SwapFree:      8289380 kB
> Dirty:             340 kB
> Writeback:           0 kB
> AnonPages:     1542636 kB
> Mapped:          14588 kB
> Slab:           139788 kB
> PageTables:       7208 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:  12380816 kB
> Committed_AS:  1679420 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      4600 kB
> VmallocChunk: 34359733707 kB
> HugePages_Total:     0
> HugePages_Free:      0
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> 
> 
> 
> Jack Morgenstein wrote:
>> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>>> Hi
>>>
>>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel
> module 
>>> reports the following on startup:
>>>
>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>
>>> The cards in all (22) of the nodes we have seen this error on are as 
>>> follows:
>>>
>>> hca_id: mthca0
>>>          fw_ver:                         1.2.0
>>>          vendor_id:                      0x02c9
>>>          vendor_part_id:                 25204
>>>          hw_ver:                         0xA0
>>>          board_id:                       MT_03B0140001
>>>          phys_port_cnt:                  1
>>>
>>> It appears that when this happens the driver restarts (loads?) itself
> 
>>> however the job running at the time of the error is, of course,
> killed.
>>> Scott
>> Scott,
>> We are trying to reproduce this here.  It would help if you could
> supply
>> the following info:
>>
>> Host model for hosts which are experiencing the failure:
>>  
>> Console output from the following linux commands:
>>   cat /etc/*rel*
>>   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using
> grub)
>>   uname -a
>>   cat /proc/cpuinfo
>>   cat /proc/meminfo
>>
>> Also, what sort of job was running when the failure occurred:
>> -- which MPI are you using?
>> -- do you have a test example which we can run here to reproduce the
> problem?
>> Thanks in advance for your help!
>>
>> Jack Morgenstein
>> Senior Software Development Engineer
>> Mellanox
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Mon Nov 10 11:51:29 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:51:29 +0200
Subject: [ofa-general] Re: [PATCH 1/2] fix default configuration files path
In-Reply-To: <4912BD7C.1030603@Voltaire.COM>
References: <4912BCFC.8030407@Voltaire.COM> <4912BD7C.1030603@Voltaire.COM>
Message-ID: <20081110195129.GK313@sashak.voltaire.com>

On 11:48 Thu 06 Nov     , Doron Shoham wrote:
> fix default configuration files path in QoS_management_in_OpenSM.txt file
> from /usr/local/etc/opensm/ to /etc/opensm/
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>
> ---
>  opensm/doc/QoS_management_in_OpenSM.txt |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/doc/QoS_management_in_OpenSM.txt b/opensm/doc/QoS_management_in_OpenSM.txt
> index ba1b4b1..1a48b1a 100644
> --- a/opensm/doc/QoS_management_in_OpenSM.txt
> +++ b/opensm/doc/QoS_management_in_OpenSM.txt
> @@ -20,7 +20,7 @@
>  
>  When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file.
>  The default name of OpenSM QoS policy file is
> -/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y
> +/etc/opensm/qos-policy.conf. The default may be changed by using -Y
>  or --qos_policy_file option with OpenSM.

The OpenSM config dir is configured value so it could be
/usr/local/etc/opensm or /etc/opensm or something else.

Basically I'm fine with using '/etc/opensm', but then it should be
updated to other docs too (specifically in
doc/performance-manager-HOWTO.txt).

Other way to handle this is to make *.in templates for those docs where
config path is used and generate the file in ./configure time (similar
to how it is done with OpenSM man page). Probably it is overkill for
docs...

Thoughts?

Sasha


From sashak at voltaire.com  Mon Nov 10 11:58:17 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 21:58:17 +0200
Subject: [ofa-general] Re: [PATCH] export osm_log_max in MB
In-Reply-To: <4912DC30.40309@Voltaire.COM>
References: <49101D1F.4040605@Voltaire.COM> <4912DC30.40309@Voltaire.COM>
Message-ID: <20081110195817.GL313@sashak.voltaire.com>

On 13:59 Thu 06 Nov     , Doron Shoham wrote:
> export the osm_log_max in MB when using 'opensm -c <conf>
> 
> Signed-off-by: Doron Shoham <dorons at voltaire.com>

Both applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 10 12:13:33 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 22:13:33 +0200
Subject: [ofa-general] Re: [PATCH] opensm/opensm/osm_state_mgr.c: Add check
	for valid physical port before using pointer.
In-Reply-To: <20081104095744.35893d4a.weiny2@llnl.gov>
References: <20081104095744.35893d4a.weiny2@llnl.gov>
Message-ID: <20081110201333.GM313@sashak.voltaire.com>

On 09:57 Tue 04 Nov     , Ira Weiny wrote:
> From 567c3893f24f4dc25ef5f4e74ef9deeb8ae541ad Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Mon, 3 Nov 2008 14:47:50 -0800
> Subject: [PATCH] opensm/opensm/osm_state_mgr.c: Add check for valid physical port before using
>  pointer.
> 
>    There are times when PortInfo fails which leaves osm_node_t with invalid
>    osm_physp_t pointers.  In this case do not use an invalid pointer.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

However some note is below.

> ---
>  opensm/opensm/osm_state_mgr.c |    6 ++++++
>  1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index ba3b6bf..841438c 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -542,6 +542,12 @@ static void __osm_state_mgr_get_node_desc(IN cl_map_item_t * const p_object,
>  
>  	/* get a physp to request from. */
>  	p_physp = osm_node_get_any_physp_ptr(p_node);
> +	if (!osm_physp_is_valid(p_physp)) {
> +		OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> +			"__osm_state_mgr_get_node_desc: ERR 331C: "
> +			"Failed to get valid physical port object\n");
> +		goto exit;
> +	}

Actually it can be a valid case. For example when node was first time
discovered via port A, when this port was disconnected and the same node
was discovered via port B - it is not a new node and node_info (where
port number for osm_node_get_any_physp_ptr() is stored) will not be
updated.

Obviously the patch is fine. But probably we need more general fix, for
example to redo osm_node_get_any_physp_ptr() so that it will not return
invalid ports. Need to review other osm_node_get_any_physp_ptr() usages.

Sasha


From rdreier at cisco.com  Mon Nov 10 12:36:23 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Nov 2008 12:36:23 -0800
Subject: [ofa-general] Re: [PATCH] IB/ehca: Fix suppression of port
	activation events
In-Reply-To: <200811071742.51867.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Fri, 7 Nov 2008 17:42:51 +0100")
References: <200806061835.43802.fenkes@de.ibm.com>
	<48499C11.7030504@gmail.com> <200811071742.51867.fenkes@de.ibm.com>
Message-ID: <adaod0nqpx4.fsf@cisco.com>

 > A previous fix introduced a regression where port activation events were
 > dropped unconditionally if port autodetection was not enabled. Fixed.

Is this a fix to "IB/ehca: Remove reference to special QP in case of
port activation failure"?  Because if so I can roll it into that patch,
since Linus hasn't pulled it yet.

 - R.


From boris at mellanox.com  Mon Nov 10 12:50:13 2008
From: boris at mellanox.com (Boris Shpolyansky)
Date: Mon, 10 Nov 2008 12:50:13 -0800
Subject: [ofa-general] ib_mthca catastrophic error detected
References: <4906645D.6010101@ucla.edu>
	<4907054E.9080205@mellanox.co.il><490763D0.5020002@ucla.edu><200811061154.02260.jackm@dev.mellanox.co.il>
	<491338D1.8050205@ucla.edu>
	<1E3DCD1C63492545881FACB6063A57C1030CF354@mtiexch01.mti.com>
	<49188F43.3050907@ucla.edu>
Message-ID: <1E3DCD1C63492545881FACB6063A57C1031A10D9@mtiexch01.mti.com>

OK, great!

Please, update us as soon as you have the entire cluster upgraded to the
new FW and have run more tests on it.

Thanks,
Boris Shpolyansky
Sr. Member of Technical Staff
Applications
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com

-----Original Message-----
From: Scott A. Friedman [mailto:friedman at ucla.edu] 
Sent: Monday, November 10, 2008 11:45 AM
To: Boris Shpolyansky
Cc: Jack Morgenstein; Matthew Finlay; general at lists.openfabrics.org
Subject: Re: [ofa-general] ib_mthca catastrophic error detected

Hi

No, no boot over IB - in fact there is no IPoIB configured on this 
cluster at all.

The firmware Matt sent seems to have fixed the problem as we have been 
unable to reproduce since we flashed some test nodes. We are in the 
process of flashing the remaining 100 or so nodes that have SDR cards as

jobs finish.

Scott

Boris Shpolyansky wrote:
> Scott,
> 
> Do you use any form of Boot-over-IB in this cluster?
> If so - what version/flavor of it?
> 
> Thanks,
> Boris Shpolyansky
> Sr. Member of Technical Staff
> Applications
> Mellanox Technologies Inc.
> 2900 Stender Way
> Santa Clara, CA 95054
> Tel.: (408) 916 0014
> Fax: (408) 970 3403
> Cell: (408) 834 9365
> www.mellanox.com
> 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Scott A.
> Friedman
> Sent: Thursday, November 06, 2008 10:35 AM
> To: Jack Morgenstein
> Cc: Matthew Finlay; general at lists.openfabrics.org
> Subject: Re: [ofa-general] ib_mthca catastrophic error detected
> 
> Hi
> 
> We have been working with Matthew Finlay <Matt at mellanox.com> on this 
> recently - you/we might pull all of this together. We are able to make

> any of our sdr cards have a catastrophic error - and are unable to do 
> the same with our ddr cards. Matt has suggested that there is a
firmware
> 
> fix possibly?
> 
> Anyway, to answer your questions:
> 
> The hosts are Sun X2200M, but we have swapped a few around with some 
> hosts we have from Aspen systems and the problem remains. I suppose
the 
> similarity is that they are all nForce based.
> 
> The MPI used was the latest OpenMPI - I will find the version, but I
do 
> not think it matters whether we are using OpenMPI or MVAPICH.
> 
> The job itself does not seem to matter either. The situation is after
a 
> node comes up it takes a very long time for the card to become ACTIVE.

> It seems to ocsillate between ACTIVE and INIT. We have waited several 
> minutes sometimes but can never be sure of when it will settle down.
The
> 
> queue certainly doesn't know and a job submitted to such a node will
die
> 
> as the cards will have a catastrophic error.
> 
> Scott
> 
> 
>  > Console output from the following linux commands:
>  >   cat /etc/*rel*
> 
> 
> Not a good idea...maybe this
> 
> #cat /etc/redhat-release
> CentOS release 5 (Final)
> 
>  >   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are
using
> 
> grub)
> 
> # grub.conf generated by anaconda
> #
> # Note that you do not have to rerun grub after making changes to this
> file
> # NOTICE:  You have a /boot partition.  This means that
> #          all kernel and initrd paths are relative to /boot/, eg.
> #          root (hd0,0)
> #          kernel /vmlinuz-version ro root=/dev/hda3
> #          initrd /initrd-version.img
> #boot=/dev/hda
> default=0
> timeout=5
> splashimage=(hd0,0)/grub/splash.xpm.gz
> hiddenmenu
> title CentOS (2.6.18-92.1.6.el5)
>   root (hd0,0)
>   kernel /vmlinuz-2.6.18-92.1.6.el5 ro root=LABEL=/ rhgb quiet
>   initrd /initrd-2.6.18-92.1.6.el5.img
> 
> 
>  >   uname -a
> 
> Linux n141 2.6.18-92.1.6.el5 #1 SMP Wed Jun 25 13:45:47 EDT 2008
x86_64 
> x86_64 x86_64 GNU/Linux
> 
> 
>  >   cat /proc/cpuinfo
>  >   cat /proc/meminfo
> 
> processor : 0
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 0
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4424.75
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 1
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 1
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4426.22
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 2
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 2
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4421.37
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 3
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 0
> siblings : 4
> core id  : 3
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4421.65
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 4
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 0
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.36
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 5
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 1
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.71
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 6
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 2
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.17
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> processor : 7
> vendor_id : AuthenticAMD
> cpu family   : 16
> model  : 2
> model name   : Quad-Core AMD Opteron(tm) Processor 2354
> stepping : 3
> cpu MHz  : 2200.000
> cache size   : 512 KB
> physical id  : 1
> siblings : 4
> core id  : 3
> cpu cores : 4
> fpu  : yes
> fpu_exception : yes
> cpuid level  : 5
> wp  : yes
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm 
> cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 
> 3dnowprefetch osvw
> bogomips : 4422.17
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> 
> 
> 
> MemTotal:      8182568 kB
> MemFree:       4535892 kB
> Buffers:        318232 kB
> Cached:        1583772 kB
> SwapCached:          0 kB
> Active:        2714400 kB
> Inactive:       730260 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:      8182568 kB
> LowFree:       4535892 kB
> SwapTotal:     8289532 kB
> SwapFree:      8289380 kB
> Dirty:             340 kB
> Writeback:           0 kB
> AnonPages:     1542636 kB
> Mapped:          14588 kB
> Slab:           139788 kB
> PageTables:       7208 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:  12380816 kB
> Committed_AS:  1679420 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      4600 kB
> VmallocChunk: 34359733707 kB
> HugePages_Total:     0
> HugePages_Free:      0
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> 
> 
> 
> Jack Morgenstein wrote:
>> On Tuesday 28 October 2008 21:11, Scott A. Friedman wrote:
>>> Hi
>>>
>>> This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel
> module 
>>> reports the following on startup:
>>>
>>> ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008)
>>>
>>> The cards in all (22) of the nodes we have seen this error on are as

>>> follows:
>>>
>>> hca_id: mthca0
>>>          fw_ver:                         1.2.0
>>>          vendor_id:                      0x02c9
>>>          vendor_part_id:                 25204
>>>          hw_ver:                         0xA0
>>>          board_id:                       MT_03B0140001
>>>          phys_port_cnt:                  1
>>>
>>> It appears that when this happens the driver restarts (loads?)
itself
> 
>>> however the job running at the time of the error is, of course,
> killed.
>>> Scott
>> Scott,
>> We are trying to reproduce this here.  It would help if you could
> supply
>> the following info:
>>
>> Host model for hosts which are experiencing the failure:
>>  
>> Console output from the following linux commands:
>>   cat /etc/*rel*
>>   cat /etc/lilo.conf , or:  cat /boot/grub/menu.lst (if you are using
> grub)
>>   uname -a
>>   cat /proc/cpuinfo
>>   cat /proc/meminfo
>>
>> Also, what sort of job was running when the failure occurred:
>> -- which MPI are you using?
>> -- do you have a test example which we can run here to reproduce the
> problem?
>> Thanks in advance for your help!
>>
>> Jack Morgenstein
>> Senior Software Development Engineer
>> Mellanox
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Mon Nov 10 13:02:33 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 10 Nov 2008 23:02:33 +0200
Subject: [ofa-general] Re: [opensm patch][2/2] verify config inputs when
	config file is rescanned
In-Reply-To: <1225404081.1197.534.camel@cardanus.llnl.gov>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
Message-ID: <20081110210233.GE3467@sashak.voltaire.com>

Hi Al,

On 15:01 Thu 30 Oct     , Al Chu wrote:
> Hey Sasha,
> 
> I noticed that after the config file is rescanned, the new potential
> inputs aren't checked for validity.  Patch is attached.
> 
> Al
> 
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From edfcd2de96c3525d1609b4c0f03c17ecc0495c18 Mon Sep 17 00:00:00 2001
> From: root <root at wopri.(none)>
> Date: Thu, 30 Oct 2008 13:58:55 -0700
> Subject: [PATCH] verify rescanned config input
> 
> 
> Signed-off-by: root <root at wopri.(none)>
                 ^^^^^^^^^^^^^^^^^^^^^^^^

I'm fine with this patch, but could you fix S-O-B line? Thanks.

Sasha


From chu11 at llnl.gov  Mon Nov 10 13:03:53 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:03:53 -0800
Subject: [ofa-general] Re: [opensm patch] support dump_conf command in
	opensm console
In-Reply-To: <1226338962.13603.21.camel@cardanus.llnl.gov>
References: <1225759191.7307.9.camel@cardanus.llnl.gov>
	<20081109172518.GG30588@sashak.voltaire.com>
	<1226338962.13603.21.camel@cardanus.llnl.gov>
Message-ID: <1226351033.13603.23.camel@cardanus.llnl.gov>

Hey Sasha,

Attached is the re-worked patch.  Assumes changes from my "fix qos
config parsing bugs" patch are accepted.

Al

On Mon, 2008-11-10 at 09:42 -0800, Al Chu wrote:
> Hey Sasha,
> 
> On Sun, 2008-11-09 at 19:25 +0200, Sasha Khapyorsky wrote:
> > Hi Al,
> > 
> > On 16:39 Mon 03 Nov     , Al Chu wrote:
> > > Hey Sasha,
> > > 
> > > When config files are rescanned and loaded, there's no way to know if
> > > the right configuration was actually reloaded or not.  A console command
> > > to dump the current config is a useful way to verify the loading of new
> > > configs or not.
> > > 
> > > This patch assumes the fixes from my "fix qos config parsing bugs" is
> > > accepted.
> > 
> > Didn't pass over it, sorry about delay.
> > 
> > > 
> > > Al
> > > 
> > > -- 
> > > Albert Chu
> > > chu11 at llnl.gov
> > > Computer Scientist
> > > High Performance Systems Division
> > > Lawrence Livermore National Laboratory
> > 
> > > From 249607e47ec7ef1b92f9578cece90460418d12b8 Mon Sep 17 00:00:00 2001
> > > From: Albert Chu <chu11 at llnl.gov>
> > > Date: Mon, 3 Nov 2008 16:22:29 -0800
> > > Subject: [PATCH] support dump_conf console command
> > > 
> > > 
> > > Signed-off-by: Albert Chu <chu11 at llnl.gov>
> > > ---
> > >  opensm/opensm/osm_console.c |  158 +++++++++++++++++++++++++++++++++++++++++++
> > >  1 files changed, 158 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> > > index d9bbbc2..8422655 100644
> > > --- a/opensm/opensm/osm_console.c
> > > +++ b/opensm/opensm/osm_console.c
> > > @@ -53,6 +53,10 @@
> > >  #include <complib/cl_passivelock.h>
> > >  #include <opensm/osm_perfmgr.h>
> > >  
> > > +#define NULL_STR "(null)"
> > > +
> > > +#define BOOLEAN_STR(__b) ((__b) ? "TRUE" : "FALSE")
> > > +
> > >  struct command {
> > >  	char *name;
> > >  	void (*help_function) (FILE * out, int detail);
> > > @@ -189,6 +193,14 @@ static void help_lidbalance(FILE * out, int detail)
> > >  	}
> > >  }
> > >  
> > > +static void help_dump_conf(FILE *out, int detail)
> > > +{
> > > +	fprintf(out, "dump_conf\n");
> > > +	if (detail) {
> > > +		fprintf(out, "dump current opensm configuration\n");
> > > +	}
> > > +}
> > > +
> > >  #ifdef ENABLE_OSM_PERF_MGR
> > >  static void help_perfmgr(FILE * out, int detail)
> > >  {
> > > @@ -1136,6 +1148,151 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > >  }
> > >  #endif				/* ENABLE_OSM_PERF_MGR */
> > >  
> > > +static void dump_qos_options(osm_qos_options_t * opt,
> > > +			     osm_qos_options_t * dflt, 
> > > +			     char *prefix,
> > > +			     FILE * out)
> > > +{
> > > +	fprintf(out, "%s_max_vls : %u\n",
> > > +		prefix, opt->max_vls ? opt->max_vls : dflt->max_vls);
> > > +	fprintf(out, "%s_high_limit : %u\n",
> > > +		prefix, opt->high_limit >= 0 ? (unsigned)opt->high_limit : (unsigned)dflt->high_limit);
> > > +	fprintf(out, "%s_vlarb_high : %s\n",
> > > +		prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high);
> > > +	fprintf(out, "%s_vlarb_low : %s\n",
> > > +		prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low);
> > > +	fprintf(out, "%s_sl2vl : %s\n",
> > > +		prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
> > > +}
> > > +
> > > +static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > > +{
> > 
> > Why to not use osm_subn_write_conf_file() function (wrapped by
> > dump_conf_parse())? I think we need to have config dumping code
> > consolidated.
> 
> I had thought of that, but I didn't want all of the instructions and all
> the extra lines of output.  But I guess it's not that big of a deal in
> the end.  I'll send a new patch.
> 
> Al
> 
> > Sasha
> > 
> > > +	osm_subn_opt_t * opt = &p_osm->subn.opt;
> > > +
> > > +	fprintf(out, "config_file : %s\n", 
> > > +		opt->config_file ? opt->config_file : NULL_STR);
> > > +	fprintf(out, "guid : 0x%016" PRIx64 "\n", opt->guid);
> > > +	fprintf(out, "m_key : 0x%016" PRIx64 "\n", opt->m_key);
> > > +	fprintf(out, "sm_key : 0x%016" PRIx64 "\n", opt->sm_key);
> > > +	fprintf(out, "sa_key : 0x%016" PRIx64 "\n", opt->sa_key);
> > > +	fprintf(out, "subnet_prefix : 0x%016" PRIx64 "\n", opt->subnet_prefix);
> > > +	fprintf(out, "m_key_lease_period : %u\n", opt->m_key_lease_period);
> > > +	fprintf(out, "sweep_interval : %u\n", opt->sweep_interval);
> > > +	fprintf(out, "max_wire_smps : %u\n", opt->max_wire_smps);
> > > +	fprintf(out, "transaction_timeout : %u\n", opt->transaction_timeout);
> > > +	fprintf(out, "sm_priority : %u\n", opt->sm_priority);
> > > +	fprintf(out, "lmc : %u\n", opt->lmc);
> > > +	fprintf(out, "lmc_esp0 : %s\n", 
> > > +		BOOLEAN_STR(opt->lmc_esp0));
> > > +	fprintf(out, "max_op_vls : %u\n", opt->max_op_vls);
> > > +	fprintf(out, "force_link_speed : %u\n", opt->force_link_speed);
> > > +	fprintf(out, "reassign_lids : %s\n", 
> > > +		BOOLEAN_STR(opt->reassign_lids));
> > > +	fprintf(out, "ignore_other_sm : %s\n", 
> > > +		BOOLEAN_STR(opt->ignore_other_sm));
> > > +	fprintf(out, "single_thread : %s\n", 
> > > +		BOOLEAN_STR(opt->single_thread));
> > > +	fprintf(out, "disable_multicast : %s\n", 
> > > +		BOOLEAN_STR(opt->disable_multicast));
> > > +	fprintf(out, "force_log_flush : %s\n", 
> > > +		BOOLEAN_STR(opt->force_log_flush));
> > > +	fprintf(out, "subnet_timeout : %u\n", opt->subnet_timeout);
> > > +	fprintf(out, "packet_life_time : %u\n", opt->packet_life_time);
> > > +	fprintf(out, "vl_stall_count : %u\n", opt->vl_stall_count);
> > > +	fprintf(out, "leaf_vl_stall_count : %u\n", opt->leaf_vl_stall_count);
> > > +	fprintf(out, "head_of_queue_lifetime : %u\n", opt->head_of_queue_lifetime);
> > > +	fprintf(out, "leaf_head_of_queue_lifetime : %u\n", opt->leaf_head_of_queue_lifetime);
> > > +	fprintf(out, "local_phy_errors_threshold : %u\n", opt->local_phy_errors_threshold);
> > > +	fprintf(out, "overrun_errors_threshold : %u\n", opt->overrun_errors_threshold);
> > > +	fprintf(out, "sminfo_polling_timeout : %u\n", opt->sminfo_polling_timeout);
> > > +	fprintf(out, "polling_retry_number : %u\n", opt->polling_retry_number);
> > > +	fprintf(out, "max_msg_fifo_timeout : %u\n", opt->max_msg_fifo_timeout);
> > > +	fprintf(out, "force_heavy_sweep : %s\n", 
> > > +		BOOLEAN_STR(opt->force_heavy_sweep));
> > > +	fprintf(out, "log_flags : 0x%02x\n", opt->log_flags);
> > > +	fprintf(out, "dump_files_dir : %s\n", 
> > > +		opt->dump_files_dir ? opt->dump_files_dir : NULL_STR);
> > > +	fprintf(out, "log_file : %s\n", 
> > > +		opt->log_file ? opt->log_file : NULL_STR);
> > > +	fprintf(out, "log_max_size : %lu\n", opt->log_max_size);
> > > +	fprintf(out, "partition_config_file : %s\n", 
> > > +		opt->partition_config_file ? opt->partition_config_file : NULL_STR);
> > > +	fprintf(out, "no_partition_enforcement : %s\n", 
> > > +		BOOLEAN_STR(opt->no_partition_enforcement));
> > > +	fprintf(out, "qos : %s\n", 
> > > +		BOOLEAN_STR(opt->qos));
> > > +	fprintf(out, "qos_policy_file : %s\n", 
> > > +		opt->qos_policy_file ? opt->qos_policy_file : NULL_STR);
> > > +	fprintf(out, "accum_log_file: %s\n", 
> > > +		BOOLEAN_STR(opt->accum_log_file));
> > > +	fprintf(out, "console : %s\n", 
> > > +		opt->console ? opt->console : NULL_STR);
> > > +	fprintf(out, "console_port : %u\n", opt->console_port);
> > > +	fprintf(out, "port_prof_ignore_file : %s\n", 
> > > +		opt->port_prof_ignore_file ? opt->port_prof_ignore_file : NULL_STR);
> > > +	fprintf(out, "port_profile_switch_nodes : %s\n", 
> > > +		BOOLEAN_STR(opt->port_profile_switch_nodes));
> > > +	fprintf(out, "sweep_on_trap : %s\n", 
> > > +		BOOLEAN_STR(opt->sweep_on_trap));
> > > +	fprintf(out, "routing_engine_names : %s\n", 
> > > +		opt->routing_engine_names ? opt->routing_engine_names : NULL_STR);
> > > +	fprintf(out, "use_ucast_cache : %s\n", 
> > > +		BOOLEAN_STR(opt->use_ucast_cache));
> > > +	fprintf(out, "connect_roots : %s\n", 
> > > +		BOOLEAN_STR(opt->connect_roots));
> > > +	fprintf(out, "lid_matrix_dump_file : %s\n", 
> > > +		opt->lid_matrix_dump_file ? opt->lid_matrix_dump_file : NULL_STR);
> > > +	fprintf(out, "lfts_file : %s\n", 
> > > +		opt->lfts_file ? opt->lfts_file : NULL_STR);
> > > +	fprintf(out, "root_guid_file : %s\n", 
> > > +		opt->root_guid_file ? opt->root_guid_file : NULL_STR);
> > > +	fprintf(out, "cn_guid_file : %s\n", 
> > > +		opt->cn_guid_file ? opt->cn_guid_file : NULL_STR);
> > > +	fprintf(out, "ids_guid_file : %s\n", 
> > > +		opt->ids_guid_file ? opt->ids_guid_file : NULL_STR);
> > > +	fprintf(out, "guid_routing_order_file : %s\n", 
> > > +		opt->guid_routing_order_file ? opt->guid_routing_order_file : NULL_STR);
> > > +	fprintf(out, "sa_db_file : %s\n", 
> > > +		opt->sa_db_file ? opt->sa_db_file : NULL_STR);
> > > +	fprintf(out, "exit_on_fatal : %s\n", 
> > > +		BOOLEAN_STR(opt->exit_on_fatal));
> > > +	fprintf(out, "honor_guid2lid_file : %s\n", 
> > > +		BOOLEAN_STR(opt->honor_guid2lid_file));
> > > +	fprintf(out, "daemon : %s\n", 
> > > +		BOOLEAN_STR(opt->daemon));
> > > +	fprintf(out, "sm_inactive : %s\n", 
> > > +		BOOLEAN_STR(opt->sm_inactive));
> > > +	fprintf(out, "babbling_port_policy : %s\n", 
> > > +		BOOLEAN_STR(opt->babbling_port_policy));
> > > +	dump_qos_options(&opt->qos_options, &opt->qos_options, "qos", out);
> > > +	dump_qos_options(&opt->qos_ca_options, &opt->qos_options, "qos_ca", out);
> > > +	dump_qos_options(&opt->qos_sw0_options, &opt->qos_options, "qos_sw0", out);
> > > +	dump_qos_options(&opt->qos_swe_options, &opt->qos_options, "qos_swe", out);
> > > +	dump_qos_options(&opt->qos_rtr_options, &opt->qos_options, "qos_rtr", out);
> > > +	fprintf(out, "enable_quirks : %s\n", 
> > > +		BOOLEAN_STR(opt->enable_quirks));
> > > +	fprintf(out, "no_clients_rereg : %s\n", 
> > > +		BOOLEAN_STR(opt->no_clients_rereg));
> > > +#ifdef ENABLE_OSM_PERF_MGR
> > > +	fprintf(out, "perfmgr : %s\n", 
> > > +		BOOLEAN_STR(opt->perfmgr));
> > > +	fprintf(out, "perfmgr_redir : %s\n", 
> > > +		BOOLEAN_STR(opt->perfmgr_redir));
> > > +	fprintf(out, "perfmgr_sweep_time_s : %u\n", opt->perfmgr_sweep_time_s);
> > > +	fprintf(out, "perfmgr_max_outstanding_queries : %u\n", opt->perfmgr_max_outstanding_queries);
> > > +	fprintf(out, "event_db_dump_file : %s\n", 
> > > +		opt->event_db_dump_file ? opt->event_db_dump_file : NULL_STR);
> > > +#endif
> > > +	fprintf(out, "event_plugin_name : %s\n", 
> > > +		opt->event_plugin_name ? opt->event_plugin_name : NULL_STR);
> > > +	fprintf(out, "node_name_map_name : %s\n", 
> > > +		opt->node_name_map_name ? opt->node_name_map_name : NULL_STR);
> > > +	fprintf(out, "prefix_routes_file : %s\n", 
> > > +		opt->prefix_routes_file ? opt->prefix_routes_file : NULL_STR);
> > > +	fprintf(out, "consolidate_ipv6_snm_req : %s\n", 
> > > +		BOOLEAN_STR(opt->consolidate_ipv6_snm_req));
> > > +}
> > > +
> > >  static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > >  {
> > >  	osm_console_exit(&p_osm->console, &p_osm->log);
> > > @@ -1166,6 +1323,7 @@ static const struct command console_cmds[] = {
> > >  	{"portstatus", &help_portstatus, &portstatus_parse},
> > >  	{"switchbalance", &help_switchbalance, &switchbalance_parse},
> > >  	{"lidbalance", &help_lidbalance, &lidbalance_parse},
> > > +	{"dump_conf", &help_dump_conf, &dump_conf_parse},
> > >  	{"version", &help_version, &version_parse},
> > >  #ifdef ENABLE_OSM_PERF_MGR
> > >  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
> > > -- 
> > > 1.5.4.5
> > > 
> > 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-support-dump_conf-console-command.patch
Type: text/x-patch
Size: 11769 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/99c99b00/attachment.bin>

From weiny2 at llnl.gov  Mon Nov 10 13:11:40 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 10 Nov 2008 13:11:40 -0800
Subject: [ofa-general] Re: [PATCH] opensm/opensm/osm_state_mgr.c: Add check
 for valid physical port before using pointer.
In-Reply-To: <20081110201333.GM313@sashak.voltaire.com>
References: <20081104095744.35893d4a.weiny2@llnl.gov>
	<20081110201333.GM313@sashak.voltaire.com>
Message-ID: <20081110131140.52561f42.weiny2@llnl.gov>

On Mon, 10 Nov 2008 22:13:33 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 09:57 Tue 04 Nov     , Ira Weiny wrote:
> > From 567c3893f24f4dc25ef5f4e74ef9deeb8ae541ad Mon Sep 17 00:00:00 2001
> > From: Ira Weiny <weiny2 at llnl.gov>
> > Date: Mon, 3 Nov 2008 14:47:50 -0800
> > Subject: [PATCH] opensm/opensm/osm_state_mgr.c: Add check for valid physical port before using
> >  pointer.
> > 
> >    There are times when PortInfo fails which leaves osm_node_t with invalid
> >    osm_physp_t pointers.  In this case do not use an invalid pointer.
> > 
> > Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> 
> Applied. Thanks.
> 
> However some note is below.
> 
> > ---
> >  opensm/opensm/osm_state_mgr.c |    6 ++++++
> >  1 files changed, 6 insertions(+), 0 deletions(-)
> > 
> > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> > index ba3b6bf..841438c 100644
> > --- a/opensm/opensm/osm_state_mgr.c
> > +++ b/opensm/opensm/osm_state_mgr.c
> > @@ -542,6 +542,12 @@ static void __osm_state_mgr_get_node_desc(IN cl_map_item_t * const p_object,
> >  
> >  	/* get a physp to request from. */
> >  	p_physp = osm_node_get_any_physp_ptr(p_node);
> > +	if (!osm_physp_is_valid(p_physp)) {
> > +		OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> > +			"__osm_state_mgr_get_node_desc: ERR 331C: "
> > +			"Failed to get valid physical port object\n");
> > +		goto exit;
> > +	}
> 
> Actually it can be a valid case. For example when node was first time
> discovered via port A, when this port was disconnected and the same node
> was discovered via port B - it is not a new node and node_info (where
> port number for osm_node_get_any_physp_ptr() is stored) will not be
> updated.

Ah, good point, I just happened to see it when PortInfo failed.

> 
> Obviously the patch is fine. But probably we need more general fix, for
> example to redo osm_node_get_any_physp_ptr() so that it will not return
> invalid ports. Need to review other osm_node_get_any_physp_ptr() usages.

I was wondering if it would return invalid ports ever.  It would be easy for it
to return only valid ports but perhaps that should be another function to
preserve functionality?

Ira


From chu11 at llnl.gov  Mon Nov 10 13:15:30 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:15:30 -0800
Subject: [ofa-general] Re: [opensm patch][2/2] verify config inputs when
	config file is rescanned
In-Reply-To: <20081110210233.GE3467@sashak.voltaire.com>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
Message-ID: <1226351730.13603.27.camel@cardanus.llnl.gov>

On Mon, 2008-11-10 at 23:02 +0200, Sasha Khapyorsky wrote:
> Hi Al,
> 
> On 15:01 Thu 30 Oct     , Al Chu wrote:
> > Hey Sasha,
> > 
> > I noticed that after the config file is rescanned, the new potential
> > inputs aren't checked for validity.  Patch is attached.
> > 
> > Al
> > 
> > -- 
> > Albert Chu
> > chu11 at llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> 
> > From edfcd2de96c3525d1609b4c0f03c17ecc0495c18 Mon Sep 17 00:00:00 2001
> > From: root <root at wopri.(none)>
> > Date: Thu, 30 Oct 2008 13:58:55 -0700
> > Subject: [PATCH] verify rescanned config input
> > 
> > 
> > Signed-off-by: root <root at wopri.(none)>
>                  ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> I'm fine with this patch, but could you fix S-O-B line? Thanks.

Oops.  New one is attached (I'll repost the [1/2] patch too).

Al

> Sasha
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-verify-rescanned-config-input.patch
Type: text/x-patch
Size: 1047 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/fdec92b8/attachment.bin>

From chu11 at llnl.gov  Mon Nov 10 13:16:15 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:16:15 -0800
Subject: [ofa-general] [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <1225404078.1197.533.camel@cardanus.llnl.gov>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
Message-ID: <1226351775.13603.30.camel@cardanus.llnl.gov>

Hey Sasha,

New patch w/ proper "signed off by" line.

Al

On Thu, 2008-10-30 at 15:01 -0700, Al Chu wrote:
> Hey Sasha,
> 
> I found a bunch of qos config parsing issues, listed below:
> 
> 1)
> 
> If the user sets the qos default fields (i.e. qos_high_limit,
> qos_vlarb_high. etc.), but do not have the qos_ca, qos_swe, qos_rtr,
> etc. equivalent fields listed (i.e. qos_ca_high_limit,
> qos_sw0_vlarb_high), the values set in teh qos default fields are not
> loaded into the CAs, switches, etc.  The reason is in qos_build_config()
> we load defaults like this:
> 
> p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> 
> but we always set the fields to something non-NULL.
> 
> static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
> {
>         opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS;
>         opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
>         opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH;
>         opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW;
>         opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
> }
> 
> 2)
> 
> In qos_build_config() we load the high_limit like this:
> 
> cfg->vl_high_limit = (uint8_t) opt->high_limit;
> 
> So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> options to "go back to" the default high_limit.  It just assumes that
> whatever is input (or was set by default) is what you should use.
> 
> 3)
> 
> Some fields like qos_vlarb_high are assumed to be correctly set and can
> segfault opensm.
> 
> The attached patch fixes these up.  Obviously there's tons of ways to
> do this.  I decided to ...
> 
> A) only initialization qos_options to the real defaults
> 
> B) init all qos_*_options to sentinel values (-1, NULL, etc.) to
> indicate it should use the configured defaults if they aren't set by the
> user.  The high_limit was changed from an unsigned to an int b/c 0 is a
> valid high_limit value.
> 
> C) verify that the default qos inputs are definitely correct (i.e. can't
> be NULL).  Reset to hard coded defaults if need be.
> 
> D) load the default vs. non-default appropriately in QoS.
> 
> Al
> 
> P.S.  This patch does not rely on my previous "remove qos_max_vls
> config" patch.  I assume we're keeping the max_vls fields in this patch.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-fix-qos-config-parsing-bugs.patch
Type: text/x-patch
Size: 21153 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/9196f015/attachment.bin>

From chu11 at llnl.gov  Mon Nov 10 13:41:04 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:41:04 -0800
Subject: [ofa-general] [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <1226351775.13603.30.camel@cardanus.llnl.gov>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
	<1226351775.13603.30.camel@cardanus.llnl.gov>
Message-ID: <1226353264.13603.37.camel@cardanus.llnl.gov>

On Mon, 2008-11-10 at 13:16 -0800, Al Chu wrote:
> Hey Sasha,
> 
> New patch w/ proper "signed off by" line.

Argh.  Repost, w/ right Author.  Sorry.

Al

> Al
> 
> On Thu, 2008-10-30 at 15:01 -0700, Al Chu wrote:
> > Hey Sasha,
> > 
> > I found a bunch of qos config parsing issues, listed below:
> > 
> > 1)
> > 
> > If the user sets the qos default fields (i.e. qos_high_limit,
> > qos_vlarb_high. etc.), but do not have the qos_ca, qos_swe, qos_rtr,
> > etc. equivalent fields listed (i.e. qos_ca_high_limit,
> > qos_sw0_vlarb_high), the values set in teh qos default fields are not
> > loaded into the CAs, switches, etc.  The reason is in qos_build_config()
> > we load defaults like this:
> > 
> > p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> > 
> > but we always set the fields to something non-NULL.
> > 
> > static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
> > {
> >         opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS;
> >         opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
> >         opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH;
> >         opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW;
> >         opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
> > }
> > 
> > 2)
> > 
> > In qos_build_config() we load the high_limit like this:
> > 
> > cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > 
> > So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> > options to "go back to" the default high_limit.  It just assumes that
> > whatever is input (or was set by default) is what you should use.
> > 
> > 3)
> > 
> > Some fields like qos_vlarb_high are assumed to be correctly set and can
> > segfault opensm.
> > 
> > The attached patch fixes these up.  Obviously there's tons of ways to
> > do this.  I decided to ...
> > 
> > A) only initialization qos_options to the real defaults
> > 
> > B) init all qos_*_options to sentinel values (-1, NULL, etc.) to
> > indicate it should use the configured defaults if they aren't set by the
> > user.  The high_limit was changed from an unsigned to an int b/c 0 is a
> > valid high_limit value.
> > 
> > C) verify that the default qos inputs are definitely correct (i.e. can't
> > be NULL).  Reset to hard coded defaults if need be.
> > 
> > D) load the default vs. non-default appropriately in QoS.
> > 
> > Al
> > 
> > P.S.  This patch does not rely on my previous "remove qos_max_vls
> > config" patch.  I assume we're keeping the max_vls fields in this patch.
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://  lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://  openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-fix-qos-config-parsing-bugs.patch
Type: text/x-patch
Size: 21156 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/5f868a46/attachment.bin>

From chu11 at llnl.gov  Mon Nov 10 13:41:04 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:41:04 -0800
Subject: [ofa-general] [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <1226351775.13603.30.camel@cardanus.llnl.gov>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
	<1226351775.13603.30.camel@cardanus.llnl.gov>
Message-ID: <1226353264.13603.37.camel@cardanus.llnl.gov>

On Mon, 2008-11-10 at 13:16 -0800, Al Chu wrote:
> Hey Sasha,
> 
> New patch w/ proper "signed off by" line.

Argh.  Repost, w/ right Author.  Sorry.

Al

> Al
> 
> On Thu, 2008-10-30 at 15:01 -0700, Al Chu wrote:
> > Hey Sasha,
> > 
> > I found a bunch of qos config parsing issues, listed below:
> > 
> > 1)
> > 
> > If the user sets the qos default fields (i.e. qos_high_limit,
> > qos_vlarb_high. etc.), but do not have the qos_ca, qos_swe, qos_rtr,
> > etc. equivalent fields listed (i.e. qos_ca_high_limit,
> > qos_sw0_vlarb_high), the values set in teh qos default fields are not
> > loaded into the CAs, switches, etc.  The reason is in qos_build_config()
> > we load defaults like this:
> > 
> > p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> > 
> > but we always set the fields to something non-NULL.
> > 
> > static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
> > {
> >         opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS;
> >         opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
> >         opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH;
> >         opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW;
> >         opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
> > }
> > 
> > 2)
> > 
> > In qos_build_config() we load the high_limit like this:
> > 
> > cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > 
> > So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> > options to "go back to" the default high_limit.  It just assumes that
> > whatever is input (or was set by default) is what you should use.
> > 
> > 3)
> > 
> > Some fields like qos_vlarb_high are assumed to be correctly set and can
> > segfault opensm.
> > 
> > The attached patch fixes these up.  Obviously there's tons of ways to
> > do this.  I decided to ...
> > 
> > A) only initialization qos_options to the real defaults
> > 
> > B) init all qos_*_options to sentinel values (-1, NULL, etc.) to
> > indicate it should use the configured defaults if they aren't set by the
> > user.  The high_limit was changed from an unsigned to an int b/c 0 is a
> > valid high_limit value.
> > 
> > C) verify that the default qos inputs are definitely correct (i.e. can't
> > be NULL).  Reset to hard coded defaults if need be.
> > 
> > D) load the default vs. non-default appropriately in QoS.
> > 
> > Al
> > 
> > P.S.  This patch does not rely on my previous "remove qos_max_vls
> > config" patch.  I assume we're keeping the max_vls fields in this patch.
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://  lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://  openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-fix-qos-config-parsing-bugs.patch
Type: text/x-patch
Size: 21156 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/5f868a46/attachment-0001.bin>

From chu11 at llnl.gov  Mon Nov 10 13:41:13 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:41:13 -0800
Subject: [ofa-general] Re: [opensm patch][2/2] verify config inputs
	when config file is rescanned
In-Reply-To: <1226351730.13603.27.camel@cardanus.llnl.gov>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
	<1226351730.13603.27.camel@cardanus.llnl.gov>
Message-ID: <1226353273.13603.39.camel@cardanus.llnl.gov>

Hey Sasha,

Sorry, repost, w/ the right Author.

Al

On Mon, 2008-11-10 at 13:15 -0800, Al Chu wrote:
> On Mon, 2008-11-10 at 23:02 +0200, Sasha Khapyorsky wrote:
> > Hi Al,
> > 
> > On 15:01 Thu 30 Oct     , Al Chu wrote:
> > > Hey Sasha,
> > > 
> > > I noticed that after the config file is rescanned, the new potential
> > > inputs aren't checked for validity.  Patch is attached.
> > > 
> > > Al
> > > 
> > > -- 
> > > Albert Chu
> > > chu11 at llnl.gov
> > > Computer Scientist
> > > High Performance Systems Division
> > > Lawrence Livermore National Laboratory
> > 
> > > From edfcd2de96c3525d1609b4c0f03c17ecc0495c18 Mon Sep 17 00:00:00 2001
> > > From: root <root at wopri.(none)>
> > > Date: Thu, 30 Oct 2008 13:58:55 -0700
> > > Subject: [PATCH] verify rescanned config input
> > > 
> > > 
> > > Signed-off-by: root <root at wopri.(none)>
> >                  ^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > I'm fine with this patch, but could you fix S-O-B line? Thanks.
> 
> Oops.  New one is attached (I'll repost the [1/2] patch too).
> 
> Al
> 
> > Sasha
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-verify-rescanned-config-input.patch
Type: text/x-patch
Size: 1050 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/77332198/attachment.bin>

From chu11 at llnl.gov  Mon Nov 10 13:42:31 2008
From: chu11 at llnl.gov (Al Chu)
Date: Mon, 10 Nov 2008 13:42:31 -0800
Subject: [ofa-general] Re: [opensm patch] support dump_conf command in
	opensm console
In-Reply-To: <1226351033.13603.23.camel@cardanus.llnl.gov>
References: <1225759191.7307.9.camel@cardanus.llnl.gov>
	<20081109172518.GG30588@sashak.voltaire.com>
	<1226338962.13603.21.camel@cardanus.llnl.gov>
	<1226351033.13603.23.camel@cardanus.llnl.gov>
Message-ID: <1226353351.13603.42.camel@cardanus.llnl.gov>

Hey Sasha,

Sorry.  Repost patch w/ the right Author.

Al

On Mon, 2008-11-10 at 13:03 -0800, Al Chu wrote:
> Hey Sasha,
> 
> Attached is the re-worked patch.  Assumes changes from my "fix qos
> config parsing bugs" patch are accepted.
> 
> Al
> 
> On Mon, 2008-11-10 at 09:42 -0800, Al Chu wrote:
> > Hey Sasha,
> > 
> > On Sun, 2008-11-09 at 19:25 +0200, Sasha Khapyorsky wrote:
> > > Hi Al,
> > > 
> > > On 16:39 Mon 03 Nov     , Al Chu wrote:
> > > > Hey Sasha,
> > > > 
> > > > When config files are rescanned and loaded, there's no way to know if
> > > > the right configuration was actually reloaded or not.  A console command
> > > > to dump the current config is a useful way to verify the loading of new
> > > > configs or not.
> > > > 
> > > > This patch assumes the fixes from my "fix qos config parsing bugs" is
> > > > accepted.
> > > 
> > > Didn't pass over it, sorry about delay.
> > > 
> > > > 
> > > > Al
> > > > 
> > > > -- 
> > > > Albert Chu
> > > > chu11 at llnl.gov
> > > > Computer Scientist
> > > > High Performance Systems Division
> > > > Lawrence Livermore National Laboratory
> > > 
> > > > From 249607e47ec7ef1b92f9578cece90460418d12b8 Mon Sep 17 00:00:00 2001
> > > > From: Albert Chu <chu11 at llnl.gov>
> > > > Date: Mon, 3 Nov 2008 16:22:29 -0800
> > > > Subject: [PATCH] support dump_conf console command
> > > > 
> > > > 
> > > > Signed-off-by: Albert Chu <chu11 at llnl.gov>
> > > > ---
> > > >  opensm/opensm/osm_console.c |  158 +++++++++++++++++++++++++++++++++++++++++++
> > > >  1 files changed, 158 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
> > > > index d9bbbc2..8422655 100644
> > > > --- a/opensm/opensm/osm_console.c
> > > > +++ b/opensm/opensm/osm_console.c
> > > > @@ -53,6 +53,10 @@
> > > >  #include <complib/cl_passivelock.h>
> > > >  #include <opensm/osm_perfmgr.h>
> > > >  
> > > > +#define NULL_STR "(null)"
> > > > +
> > > > +#define BOOLEAN_STR(__b) ((__b) ? "TRUE" : "FALSE")
> > > > +
> > > >  struct command {
> > > >  	char *name;
> > > >  	void (*help_function) (FILE * out, int detail);
> > > > @@ -189,6 +193,14 @@ static void help_lidbalance(FILE * out, int detail)
> > > >  	}
> > > >  }
> > > >  
> > > > +static void help_dump_conf(FILE *out, int detail)
> > > > +{
> > > > +	fprintf(out, "dump_conf\n");
> > > > +	if (detail) {
> > > > +		fprintf(out, "dump current opensm configuration\n");
> > > > +	}
> > > > +}
> > > > +
> > > >  #ifdef ENABLE_OSM_PERF_MGR
> > > >  static void help_perfmgr(FILE * out, int detail)
> > > >  {
> > > > @@ -1136,6 +1148,151 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > > >  }
> > > >  #endif				/* ENABLE_OSM_PERF_MGR */
> > > >  
> > > > +static void dump_qos_options(osm_qos_options_t * opt,
> > > > +			     osm_qos_options_t * dflt, 
> > > > +			     char *prefix,
> > > > +			     FILE * out)
> > > > +{
> > > > +	fprintf(out, "%s_max_vls : %u\n",
> > > > +		prefix, opt->max_vls ? opt->max_vls : dflt->max_vls);
> > > > +	fprintf(out, "%s_high_limit : %u\n",
> > > > +		prefix, opt->high_limit >= 0 ? (unsigned)opt->high_limit : (unsigned)dflt->high_limit);
> > > > +	fprintf(out, "%s_vlarb_high : %s\n",
> > > > +		prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high);
> > > > +	fprintf(out, "%s_vlarb_low : %s\n",
> > > > +		prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low);
> > > > +	fprintf(out, "%s_sl2vl : %s\n",
> > > > +		prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
> > > > +}
> > > > +
> > > > +static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > > > +{
> > > 
> > > Why to not use osm_subn_write_conf_file() function (wrapped by
> > > dump_conf_parse())? I think we need to have config dumping code
> > > consolidated.
> > 
> > I had thought of that, but I didn't want all of the instructions and all
> > the extra lines of output.  But I guess it's not that big of a deal in
> > the end.  I'll send a new patch.
> > 
> > Al
> > 
> > > Sasha
> > > 
> > > > +	osm_subn_opt_t * opt = &p_osm->subn.opt;
> > > > +
> > > > +	fprintf(out, "config_file : %s\n", 
> > > > +		opt->config_file ? opt->config_file : NULL_STR);
> > > > +	fprintf(out, "guid : 0x%016" PRIx64 "\n", opt->guid);
> > > > +	fprintf(out, "m_key : 0x%016" PRIx64 "\n", opt->m_key);
> > > > +	fprintf(out, "sm_key : 0x%016" PRIx64 "\n", opt->sm_key);
> > > > +	fprintf(out, "sa_key : 0x%016" PRIx64 "\n", opt->sa_key);
> > > > +	fprintf(out, "subnet_prefix : 0x%016" PRIx64 "\n", opt->subnet_prefix);
> > > > +	fprintf(out, "m_key_lease_period : %u\n", opt->m_key_lease_period);
> > > > +	fprintf(out, "sweep_interval : %u\n", opt->sweep_interval);
> > > > +	fprintf(out, "max_wire_smps : %u\n", opt->max_wire_smps);
> > > > +	fprintf(out, "transaction_timeout : %u\n", opt->transaction_timeout);
> > > > +	fprintf(out, "sm_priority : %u\n", opt->sm_priority);
> > > > +	fprintf(out, "lmc : %u\n", opt->lmc);
> > > > +	fprintf(out, "lmc_esp0 : %s\n", 
> > > > +		BOOLEAN_STR(opt->lmc_esp0));
> > > > +	fprintf(out, "max_op_vls : %u\n", opt->max_op_vls);
> > > > +	fprintf(out, "force_link_speed : %u\n", opt->force_link_speed);
> > > > +	fprintf(out, "reassign_lids : %s\n", 
> > > > +		BOOLEAN_STR(opt->reassign_lids));
> > > > +	fprintf(out, "ignore_other_sm : %s\n", 
> > > > +		BOOLEAN_STR(opt->ignore_other_sm));
> > > > +	fprintf(out, "single_thread : %s\n", 
> > > > +		BOOLEAN_STR(opt->single_thread));
> > > > +	fprintf(out, "disable_multicast : %s\n", 
> > > > +		BOOLEAN_STR(opt->disable_multicast));
> > > > +	fprintf(out, "force_log_flush : %s\n", 
> > > > +		BOOLEAN_STR(opt->force_log_flush));
> > > > +	fprintf(out, "subnet_timeout : %u\n", opt->subnet_timeout);
> > > > +	fprintf(out, "packet_life_time : %u\n", opt->packet_life_time);
> > > > +	fprintf(out, "vl_stall_count : %u\n", opt->vl_stall_count);
> > > > +	fprintf(out, "leaf_vl_stall_count : %u\n", opt->leaf_vl_stall_count);
> > > > +	fprintf(out, "head_of_queue_lifetime : %u\n", opt->head_of_queue_lifetime);
> > > > +	fprintf(out, "leaf_head_of_queue_lifetime : %u\n", opt->leaf_head_of_queue_lifetime);
> > > > +	fprintf(out, "local_phy_errors_threshold : %u\n", opt->local_phy_errors_threshold);
> > > > +	fprintf(out, "overrun_errors_threshold : %u\n", opt->overrun_errors_threshold);
> > > > +	fprintf(out, "sminfo_polling_timeout : %u\n", opt->sminfo_polling_timeout);
> > > > +	fprintf(out, "polling_retry_number : %u\n", opt->polling_retry_number);
> > > > +	fprintf(out, "max_msg_fifo_timeout : %u\n", opt->max_msg_fifo_timeout);
> > > > +	fprintf(out, "force_heavy_sweep : %s\n", 
> > > > +		BOOLEAN_STR(opt->force_heavy_sweep));
> > > > +	fprintf(out, "log_flags : 0x%02x\n", opt->log_flags);
> > > > +	fprintf(out, "dump_files_dir : %s\n", 
> > > > +		opt->dump_files_dir ? opt->dump_files_dir : NULL_STR);
> > > > +	fprintf(out, "log_file : %s\n", 
> > > > +		opt->log_file ? opt->log_file : NULL_STR);
> > > > +	fprintf(out, "log_max_size : %lu\n", opt->log_max_size);
> > > > +	fprintf(out, "partition_config_file : %s\n", 
> > > > +		opt->partition_config_file ? opt->partition_config_file : NULL_STR);
> > > > +	fprintf(out, "no_partition_enforcement : %s\n", 
> > > > +		BOOLEAN_STR(opt->no_partition_enforcement));
> > > > +	fprintf(out, "qos : %s\n", 
> > > > +		BOOLEAN_STR(opt->qos));
> > > > +	fprintf(out, "qos_policy_file : %s\n", 
> > > > +		opt->qos_policy_file ? opt->qos_policy_file : NULL_STR);
> > > > +	fprintf(out, "accum_log_file: %s\n", 
> > > > +		BOOLEAN_STR(opt->accum_log_file));
> > > > +	fprintf(out, "console : %s\n", 
> > > > +		opt->console ? opt->console : NULL_STR);
> > > > +	fprintf(out, "console_port : %u\n", opt->console_port);
> > > > +	fprintf(out, "port_prof_ignore_file : %s\n", 
> > > > +		opt->port_prof_ignore_file ? opt->port_prof_ignore_file : NULL_STR);
> > > > +	fprintf(out, "port_profile_switch_nodes : %s\n", 
> > > > +		BOOLEAN_STR(opt->port_profile_switch_nodes));
> > > > +	fprintf(out, "sweep_on_trap : %s\n", 
> > > > +		BOOLEAN_STR(opt->sweep_on_trap));
> > > > +	fprintf(out, "routing_engine_names : %s\n", 
> > > > +		opt->routing_engine_names ? opt->routing_engine_names : NULL_STR);
> > > > +	fprintf(out, "use_ucast_cache : %s\n", 
> > > > +		BOOLEAN_STR(opt->use_ucast_cache));
> > > > +	fprintf(out, "connect_roots : %s\n", 
> > > > +		BOOLEAN_STR(opt->connect_roots));
> > > > +	fprintf(out, "lid_matrix_dump_file : %s\n", 
> > > > +		opt->lid_matrix_dump_file ? opt->lid_matrix_dump_file : NULL_STR);
> > > > +	fprintf(out, "lfts_file : %s\n", 
> > > > +		opt->lfts_file ? opt->lfts_file : NULL_STR);
> > > > +	fprintf(out, "root_guid_file : %s\n", 
> > > > +		opt->root_guid_file ? opt->root_guid_file : NULL_STR);
> > > > +	fprintf(out, "cn_guid_file : %s\n", 
> > > > +		opt->cn_guid_file ? opt->cn_guid_file : NULL_STR);
> > > > +	fprintf(out, "ids_guid_file : %s\n", 
> > > > +		opt->ids_guid_file ? opt->ids_guid_file : NULL_STR);
> > > > +	fprintf(out, "guid_routing_order_file : %s\n", 
> > > > +		opt->guid_routing_order_file ? opt->guid_routing_order_file : NULL_STR);
> > > > +	fprintf(out, "sa_db_file : %s\n", 
> > > > +		opt->sa_db_file ? opt->sa_db_file : NULL_STR);
> > > > +	fprintf(out, "exit_on_fatal : %s\n", 
> > > > +		BOOLEAN_STR(opt->exit_on_fatal));
> > > > +	fprintf(out, "honor_guid2lid_file : %s\n", 
> > > > +		BOOLEAN_STR(opt->honor_guid2lid_file));
> > > > +	fprintf(out, "daemon : %s\n", 
> > > > +		BOOLEAN_STR(opt->daemon));
> > > > +	fprintf(out, "sm_inactive : %s\n", 
> > > > +		BOOLEAN_STR(opt->sm_inactive));
> > > > +	fprintf(out, "babbling_port_policy : %s\n", 
> > > > +		BOOLEAN_STR(opt->babbling_port_policy));
> > > > +	dump_qos_options(&opt->qos_options, &opt->qos_options, "qos", out);
> > > > +	dump_qos_options(&opt->qos_ca_options, &opt->qos_options, "qos_ca", out);
> > > > +	dump_qos_options(&opt->qos_sw0_options, &opt->qos_options, "qos_sw0", out);
> > > > +	dump_qos_options(&opt->qos_swe_options, &opt->qos_options, "qos_swe", out);
> > > > +	dump_qos_options(&opt->qos_rtr_options, &opt->qos_options, "qos_rtr", out);
> > > > +	fprintf(out, "enable_quirks : %s\n", 
> > > > +		BOOLEAN_STR(opt->enable_quirks));
> > > > +	fprintf(out, "no_clients_rereg : %s\n", 
> > > > +		BOOLEAN_STR(opt->no_clients_rereg));
> > > > +#ifdef ENABLE_OSM_PERF_MGR
> > > > +	fprintf(out, "perfmgr : %s\n", 
> > > > +		BOOLEAN_STR(opt->perfmgr));
> > > > +	fprintf(out, "perfmgr_redir : %s\n", 
> > > > +		BOOLEAN_STR(opt->perfmgr_redir));
> > > > +	fprintf(out, "perfmgr_sweep_time_s : %u\n", opt->perfmgr_sweep_time_s);
> > > > +	fprintf(out, "perfmgr_max_outstanding_queries : %u\n", opt->perfmgr_max_outstanding_queries);
> > > > +	fprintf(out, "event_db_dump_file : %s\n", 
> > > > +		opt->event_db_dump_file ? opt->event_db_dump_file : NULL_STR);
> > > > +#endif
> > > > +	fprintf(out, "event_plugin_name : %s\n", 
> > > > +		opt->event_plugin_name ? opt->event_plugin_name : NULL_STR);
> > > > +	fprintf(out, "node_name_map_name : %s\n", 
> > > > +		opt->node_name_map_name ? opt->node_name_map_name : NULL_STR);
> > > > +	fprintf(out, "prefix_routes_file : %s\n", 
> > > > +		opt->prefix_routes_file ? opt->prefix_routes_file : NULL_STR);
> > > > +	fprintf(out, "consolidate_ipv6_snm_req : %s\n", 
> > > > +		BOOLEAN_STR(opt->consolidate_ipv6_snm_req));
> > > > +}
> > > > +
> > > >  static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out)
> > > >  {
> > > >  	osm_console_exit(&p_osm->console, &p_osm->log);
> > > > @@ -1166,6 +1323,7 @@ static const struct command console_cmds[] = {
> > > >  	{"portstatus", &help_portstatus, &portstatus_parse},
> > > >  	{"switchbalance", &help_switchbalance, &switchbalance_parse},
> > > >  	{"lidbalance", &help_lidbalance, &lidbalance_parse},
> > > > +	{"dump_conf", &help_dump_conf, &dump_conf_parse},
> > > >  	{"version", &help_version, &version_parse},
> > > >  #ifdef ENABLE_OSM_PERF_MGR
> > > >  	{"perfmgr", &help_perfmgr, &perfmgr_parse},
> > > > -- 
> > > > 1.5.4.5
> > > > 
> > > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-support-dump_conf-console-command.patch
Type: text/x-patch
Size: 11772 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/54bb3cb9/attachment.bin>

From rpearson at systemfabricworks.com  Mon Nov 10 13:47:58 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Mon, 10 Nov 2008 15:47:58 -0600
Subject: [ofa-general] opensm support for toroidal meshes
Message-ID: <000501c9437d$ffa7cd90$fef768b0$@com>

We have been involved in a project to deliver a large system based on a
toroidal mesh fabric. One of the requirements for this system is to be able
to guarantee a deadlock free routing of the fabric. The lash routing engine
in opensm did not work in this case because required number of VLs for the
machine as configured was 12 which exceeded the number of VLs supported by
Mellanox switch ASICs. It turns out that if one has the freedom to reorder
the order of the port assignments used by lash optimally that lash can
successfully route the fabric but that is impractical in the hardware. The
attached note describes an algorithm for automatically recognizing when a
Cartesian mesh fabric is a torus, determining its size and optimally
reordering the ports in opensm so that lash can generate a route with the
smallest number of VLs.

We have implemented a set of changes to opensm that implement this algorithm
and will submit the changes as patches. This note will help to understand
the code.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lash_changes.doc
Type: application/msword
Size: 411136 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081110/8c8befd2/attachment.doc>

From rdreier at cisco.com  Mon Nov 10 14:37:08 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 10 Nov 2008 14:37:08 -0800
Subject: [ofa-general] Higher than usual latency (new baby)
Message-ID: <adavduvp5rf.fsf@cisco.com>

Hi everyone,

My wife gave birth to a son on November 6.  Everyone is healthy and
doing well.  But for obvious reasons you should expect me to be a lot
less responsive than usual for the next few weeks.

Thanks,
  Roland


From pradeeps at linux.vnet.ibm.com  Mon Nov 10 15:30:34 2008
From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana)
Date: Mon, 10 Nov 2008 15:30:34 -0800
Subject: [ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free
In-Reply-To: <4916B318.50503@voltaire.com>
References: <20081106012307.GP31163@sgi.com>	<200811061712.50605.jackm@dev.mellanox.co.il>	<20081106164005.GS31163@sgi.com>
	<49147107.2090600@linux.vnet.ibm.com> <4916B318.50503@voltaire.com>
Message-ID: <4918C41A.9060609@linux.vnet.ibm.com>

Or Gerlitz wrote:
> Pradeep Satyanarayana wrote:
>> If I am not mistaken we saw a problem that showed similar
>> characteristics more than two years ago on IBM platforms. The same
>> issue of rx_ring reusing tx_ring skbs and so on and would show up only
>> under stress. This was with UD mode (before CM came into the picture)
>> and it turned out to be a driver issue. 
> Can you send pointer to the relevant thread / commit that solved this
> issue?
Or,

Even though I searched in the archives could not locate that particular one.
I know that Nam submitted the patch and it was in the June/July 2006 time frame. 
It was a missing read memory barrier in the ehca driver. I am copying him so that 
he might help.

Pradeep


From rpearson at systemfabricworks.com  Mon Nov 10 22:44:49 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 00:44:49 -0600
Subject: [ofa-general] [PATCH] opensm: skeleton for toroidal mesh analysis
Message-ID: <000001c943c8$fef921f0$fceb65d0$@com>

Sasha, 

Here is the first patch in a series to implement the algorithm described in
the file lash_changes.doc.

This patch
      - creates a new command line flag --do_mesh_analysis and a new Boolean
that is set if the flag is used.
      - adds code to main to implement the flag and option.
      - creates a new file osm_mesh.c to hold the algorithm code
      - moves declarations from osm_ucast_lash.c and osm_mesh.c into header
files
      - adds these files to Makefile.am
      - adds a stub do_mesh_analysis() that is called from lash_core.

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>

-----

diff --git a/opensm/include/opensm/osm_mesh.h
b/opensm/include/opensm/osm_mesh.h
new file mode 100644
index 0000000..1467440
--- /dev/null
+++ b/opensm/include/opensm/osm_mesh.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright (c) 2088      System Fabric Works, Inc.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *      Declarations for mesh analysis
+ */
+
+#ifndef OSM_UCAST_MESH_H
+#define OSM_UCAST_MESH_H
+
+struct _lash;
+
+int do_mesh_analysis(struct _lash *p_lash);
+
+#endif
diff --git a/opensm/include/opensm/osm_subnet.h
b/opensm/include/opensm/osm_subnet.h
index 7259587..2abe36d 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -215,6 +215,7 @@ typedef struct osm_subn_opt {
 	char *node_name_map_name;
 	char *prefix_routes_file;
 	boolean_t consolidate_ipv6_snm_req;
+	boolean_t do_mesh_analysis;
 } osm_subn_opt_t;
 /*
 * FIELDS
diff --git a/opensm/include/opensm/osm_ucast_lash.h
b/opensm/include/opensm/osm_ucast_lash.h
new file mode 100644
index 0000000..646e9a3
--- /dev/null
+++ b/opensm/include/opensm/osm_ucast_lash.h
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2008      System Fabric Works, Inc.
+ * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
+ * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *      Declarations for LASH algorithm
+ */
+
+#ifndef OSM_UCAST_LASH_H
+#define OSM_UCAST_LASH_H
+
+enum {
+	UNQUEUED,
+	Q_MEMBER,
+	MST_MEMBER,
+	MAX_INT = 9999,
+	NONE = MAX_INT
+};
+
+typedef struct _cdg_vertex {
+	int num_dependencies;
+	struct _cdg_vertex **dependency;
+	int from;
+	int to;
+	int seen;
+	int temp;
+	int visiting_number;
+	struct _cdg_vertex *next;
+	int num_temp_depend;
+	int num_using_vertex;
+	int *num_using_this_depend;
+} cdg_vertex_t;
+
+typedef struct _reachable_dest {
+	int switch_id;
+	struct _reachable_dest *next;
+} reachable_dest_t;
+
+typedef struct _switch {
+	osm_switch_t *p_sw;
+	int *dij_channels;
+	int id;
+	int used_channels;
+	int q_state;
+	struct routing_table {
+		unsigned out_link;
+		unsigned lane;
+	} *routing_table;
+	unsigned int num_connections;
+	int *virtual_physical_port_table;
+	int *phys_connections;
+} switch_t;
+
+typedef struct _lash {
+	osm_opensm_t *p_osm;
+	int num_switches;
+	uint8_t vl_min;
+	int balance_limit;
+	switch_t **switches;
+	cdg_vertex_t ****cdg_vertex_matrix;
+	int *num_mst_in_lane;
+	int ***virtual_location;
+} lash_t;
+
+#endif
diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am
index 01573d2..7b9da18 100644
--- a/opensm/opensm/Makefile.am
+++ b/opensm/opensm/Makefile.am
@@ -31,7 +31,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c
osm_db_files.c \
 		 osm_inform.c osm_lid_mgr.c osm_lin_fwd_rcv.c \
 		 osm_link_mgr.c osm_mcast_fwd_rcv.c \
 		 osm_mcast_mgr.c osm_mcast_tbl.c osm_mcm_info.c \
-		 osm_mcm_port.c osm_mtree.c osm_multicast.c osm_node.c \
+		 osm_mcm_port.c osm_mesh.c osm_mtree.c osm_multicast.c
osm_node.c \
 		 osm_node_desc_rcv.c osm_node_info_rcv.c \
 		 osm_opensm.c osm_pkey.c osm_pkey_mgr.c osm_pkey_rcv.c \
 		 osm_port.c osm_port_info_rcv.c \
@@ -76,6 +76,7 @@ opensminclude_HEADERS = \
 	$(srcdir)/../include/opensm/osm_errors.h \
 	$(srcdir)/../include/opensm/osm_helper.h \
 	$(srcdir)/../include/opensm/osm_inform.h \
+	$(srcdir)/../include/opensm/osm_ucast_lash.h \
 	$(srcdir)/../include/opensm/osm_lid_mgr.h \
 	$(srcdir)/../include/opensm/osm_log.h \
 	$(srcdir)/../include/opensm/osm_mad_pool.h \
@@ -83,6 +84,7 @@ opensminclude_HEADERS = \
 	$(srcdir)/../include/opensm/osm_mcast_tbl.h \
 	$(srcdir)/../include/opensm/osm_mcm_info.h \
 	$(srcdir)/../include/opensm/osm_mcm_port.h \
+	$(srcdir)/../include/opensm/osm_mesh.h \
 	$(srcdir)/../include/opensm/osm_mtree.h \
 	$(srcdir)/../include/opensm/osm_multicast.h \
 	$(srcdir)/../include/opensm/osm_msgdef.h \
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 53648d6..63bd5a6 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -585,6 +585,7 @@ int main(int argc, char *argv[])
 #endif
 		{"prefix_routes_file", 1, NULL, 3},
 		{"consolidate_ipv6_snm_req", 0, NULL, 4},
+		{"do_mesh_analysis", 0, NULL, 5},
 		{NULL, 0, NULL, 0}	/* Required at the end of the array
*/
 	};
 
@@ -922,6 +923,9 @@ int main(int argc, char *argv[])
 		case 4:
 			opt.consolidate_ipv6_snm_req = TRUE;
 			break;
+		case 5:
+			opt.do_mesh_analysis = TRUE;
+			break;
 		case 'h':
 		case '?':
 		case ':':
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
new file mode 100644
index 0000000..7943274
--- /dev/null
+++ b/opensm/opensm/osm_mesh.c
@@ -0,0 +1,65 @@
+/*
+ * Copyright (c) 2008      System Fabric Works, Inc.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *      routines to analyze certain meshes
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif				/* HAVE_CONFIG_H */
+
+#include <stdio.h>
+#include <opensm/osm_switch.h>
+#include <opensm/osm_opensm.h>
+#include <opensm/osm_log.h>
+#include <opensm/osm_mesh.h>
+#include <opensm/osm_ucast_lash.h>
+
+/*
+ * do_mesh_analysis
+ */
+int do_mesh_analysis(lash_t *p_lash)
+{
+	int ret = 0;
+	osm_log_t *p_log = &p_lash->p_osm->log;
+
+	OSM_LOG_ENTER(p_log);
+
+	printf("lash: do_mesh_analysis stub called\n");
+
+	OSM_LOG_EXIT(p_log);
+
+	return ret;
+}
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index c082798..e10371c 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -52,64 +52,13 @@
 #include <opensm/osm_switch.h>
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_log.h>
+#include <opensm/osm_mesh.h>
+#include <opensm/osm_ucast_lash.h>
 
 /* //////////////////////////// */
 /*  Local types                 */
 /* //////////////////////////// */
 
-enum {
-	UNQUEUED,
-	Q_MEMBER,
-	MST_MEMBER,
-	MAX_INT = 9999,
-	NONE = MAX_INT
-};
-
-typedef struct _cdg_vertex {
-	int num_dependencies;
-	struct _cdg_vertex **dependency;
-	int from;
-	int to;
-	int seen;
-	int temp;
-	int visiting_number;
-	struct _cdg_vertex *next;
-	int num_temp_depend;
-	int num_using_vertex;
-	int *num_using_this_depend;
-} cdg_vertex_t;
-
-typedef struct _reachable_dest {
-	int switch_id;
-	struct _reachable_dest *next;
-} reachable_dest_t;
-
-typedef struct _switch {
-	osm_switch_t *p_sw;
-	int *dij_channels;
-	int id;
-	int used_channels;
-	int q_state;
-	struct routing_table {
-		unsigned out_link;
-		unsigned lane;
-	} *routing_table;
-	unsigned int num_connections;
-	int *virtual_physical_port_table;
-	int *phys_connections;
-} switch_t;
-
-typedef struct _lash {
-	osm_opensm_t *p_osm;
-	int num_switches;
-	uint8_t vl_min;
-	int balance_limit;
-	switch_t **switches;
-	cdg_vertex_t ****cdg_vertex_matrix;
-	int *num_mst_in_lane;
-	int ***virtual_location;
-} lash_t;
-
 static cdg_vertex_t *create_cdg_vertex(unsigned num_switches)
 {
 	cdg_vertex_t *cdg_vertex = (cdg_vertex_t *)
malloc(sizeof(cdg_vertex_t));
@@ -872,10 +821,15 @@ static int lash_core(lash_t * p_lash)
 	int output_link2, i_next_switch2;
 	int cycle_found2 = 0;
 	int status = 0;
-	int *switch_bitmap;	/* Bitmap to check if we have processed this
pair */
+	int *switch_bitmap = NULL;	/* Bitmap to check if we have
processed this pair */
 
 	OSM_LOG_ENTER(p_log);
 
+	if (p_lash->p_osm->subn.opt.do_mesh_analysis &&
do_mesh_analysis(p_lash)) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "Mesh analysis failed\n");
+		goto Exit;
+	}
+
 	for (i = 0; i < num_switches; i++) {
 
 		shortest_path(p_lash, i);


From rpearson at systemfabricworks.com  Mon Nov 10 23:26:32 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 01:26:32 -0600
Subject: [ofa-general] {PATCH] [2] opensm: per mesh data
Message-ID: <000101c943ce$d2707880$77516980$@com>

Sasha,

Here is the second patch implementing the mesh analysis algorithm.

This patch:
      - creates a data structure, mesh_t, that holds per mesh information
      - adds a pointer to this structure in lash_t
      - creates methods to allocate and free memory for mesh_t
      - adds osm_ prefix to global routine names (oops)
      - calls create and cleanup methods

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/include/opensm/osm_mesh.h
b/opensm/include/opensm/osm_mesh.h
index 1467440..8313614 100644
--- a/opensm/include/opensm/osm_mesh.h
+++ b/opensm/include/opensm/osm_mesh.h
@@ -41,6 +41,18 @@
 
 struct _lash;
 
-int do_mesh_analysis(struct _lash *p_lash);
+/*
+ * per fabric mesh info
+ */
+typedef struct _mesh {
+	int num_class;			/* number of switch classes */
+	int *class_type;		/* index of first switch found for
each class */
+	int *class_count;		/* population of each class */
+	int dimension;			/* mesh dimension */
+	int *size;			/* an array to hold size of mesh */
+} mesh_t;
+
+void osm_mesh_cleanup(struct _lash *p_lash);
+int osm_do_mesh_analysis(struct _lash *p_lash);
 
 #endif
diff --git a/opensm/include/opensm/osm_ucast_lash.h
b/opensm/include/opensm/osm_ucast_lash.h
index 646e9a3..1ae3bb6 100644
--- a/opensm/include/opensm/osm_ucast_lash.h
+++ b/opensm/include/opensm/osm_ucast_lash.h
@@ -95,6 +95,7 @@ typedef struct _lash {
 	cdg_vertex_t ****cdg_vertex_matrix;
 	int *num_mst_in_lane;
 	int ***virtual_location;
+	mesh_t *mesh;
 } lash_t;
 
 #endif
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 7943274..c97925b 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -41,6 +41,7 @@
 #endif				/* HAVE_CONFIG_H */
 
 #include <stdio.h>
+#include <stdlib.h>
 #include <opensm/osm_switch.h>
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_log.h>
@@ -48,15 +49,72 @@
 #include <opensm/osm_ucast_lash.h>
 
 /*
+ * osm_mesh_cleanup - free per mesh resources
+ */
+void osm_mesh_cleanup(lash_t *p_lash)
+{
+	mesh_t *mesh = p_lash->mesh;
+
+	if (mesh) {
+		if (mesh->class_type)
+			free(mesh->class_type);
+
+		if (mesh->class_count)
+			free(mesh->class_count);
+
+		free(mesh);
+
+		p_lash->mesh = NULL;
+	}
+}
+
+/*
+ * mesh_create - allocate per mesh resources
+ */
+static int mesh_create(lash_t *p_lash)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	mesh_t *mesh;
+
+	if(!(mesh = p_lash->mesh = calloc(1, sizeof(mesh_t)))) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating mesh - out
of memory\n");
+		return -1;
+	}
+
+	if (!(mesh->class_type = calloc(p_lash->num_switches, sizeof(int))))
{
+		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating
mesh->class_type - out of memory\n");
+		free(mesh);
+		return -1;
+	}
+
+	if (!(mesh->class_count = calloc(p_lash->num_switches,
sizeof(int)))) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating
mesh->class_count - out of memory\n");
+		free(mesh->class_type);
+		free(mesh);
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
  * do_mesh_analysis
  */
-int do_mesh_analysis(lash_t *p_lash)
+int osm_do_mesh_analysis(lash_t *p_lash)
 {
 	int ret = 0;
 	osm_log_t *p_log = &p_lash->p_osm->log;
 
 	OSM_LOG_ENTER(p_log);
 
+	/*
+	 * allocate per mesh data structures
+	 */
+	if (mesh_create(p_lash)) {
+		OSM_LOG_EXIT(p_log);
+		return -1;
+	}
+
 	printf("lash: do_mesh_analysis stub called\n");
 
 	OSM_LOG_EXIT(p_log);
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index e10371c..3577cca 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -825,7 +825,7 @@ static int lash_core(lash_t * p_lash)
 
 	OSM_LOG_ENTER(p_log);
 
-	if (p_lash->p_osm->subn.opt.do_mesh_analysis &&
do_mesh_analysis(p_lash)) {
+	if (p_lash->p_osm->subn.opt.do_mesh_analysis &&
osm_do_mesh_analysis(p_lash)) {
 		OSM_LOG(p_log, OSM_LOG_ERROR, "Mesh analysis failed\n");
 		goto Exit;
 	}
@@ -1124,6 +1124,8 @@ static void lash_cleanup(lash_t * p_lash)
 		free(p_lash->switches);
 	}
 	p_lash->switches = NULL;
+
+	osm_mesh_cleanup(p_lash);
 }
 
 /*


From rpearson at systemfabricworks.com  Tue Nov 11 00:06:03 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 02:06:03 -0600
Subject: [ofa-general] [PATCH][3] opensm: per mesh node information
Message-ID: <000501c943d4$57b3f8f0$071bead0$@com>

Sasha,

This is the third patch implementing the mesh analysis algorithm

This patch
      - creates per mesh node (e.g. switch) data structure mesh_node_t
      - adds a pointer to mesh_node_t in the switch_t structure
      - implements create and cleanup methods for node_t
      - calls these in switch_create and swich_delete in *lash.c

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/include/opensm/osm_mesh.h
b/opensm/include/opensm/osm_mesh.h
index 8313614..78af086 100644
--- a/opensm/include/opensm/osm_mesh.h
+++ b/opensm/include/opensm/osm_mesh.h
@@ -40,6 +40,39 @@
 #define OSM_UCAST_MESH_H
 
 struct _lash;
+struct _switch;
+
+enum mesh_node_type {
+	mesh_type_none,
+	mesh_type_cartesian,
+};
+
+/*
+ * per switch to switch link info
+ */
+typedef struct _link {
+	int switch_id;
+	int link_id;
+	int *ports;
+	int num_ports;
+	int next_port;
+} link_t;
+
+/*
+ * per switch node mesh info
+ */
+typedef struct _mesh_node {
+	unsigned int num_links;		/* number of 'links' to adjacent
switches */
+	link_t **links;			/* per link information */
+	int *axes;			/* used to hold and reorder assigned
axes */
+	int *coord;			/* mesh coordinates of switch */
+	int **matrix;			/* distances between adjacent
switches */
+	int *poly;			/* characteristic polynomial of
matrix */
+					/* used as an invariant
classification */
+	enum mesh_node_type type;
+	int dimension;			/* apparent dimension of mesh around
node */
+	int temp;			/* temporary holder for distance
info */
+} mesh_node_t;
 
 /*
  * per fabric mesh info
@@ -55,4 +88,7 @@ typedef struct _mesh {
 void osm_mesh_cleanup(struct _lash *p_lash);
 int osm_do_mesh_analysis(struct _lash *p_lash);
 
+void osm_mesh_node_cleanup(struct _switch *sw);
+int osm_mesh_node_create(struct _lash *p_lash, struct _switch *sw);
+
 #endif
diff --git a/opensm/include/opensm/osm_ucast_lash.h
b/opensm/include/opensm/osm_ucast_lash.h
index 1ae3bb6..c037571 100644
--- a/opensm/include/opensm/osm_ucast_lash.h
+++ b/opensm/include/opensm/osm_ucast_lash.h
@@ -81,6 +81,7 @@ typedef struct _switch {
 		unsigned out_link;
 		unsigned lane;
 	} *routing_table;
+	mesh_node_t *node;
 	unsigned int num_connections;
 	int *virtual_physical_port_table;
 	int *phys_connections;
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index c97925b..6ef397c 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -98,7 +98,7 @@ static int mesh_create(lash_t *p_lash)
 }
 
 /*
- * do_mesh_analysis
+ * osm_do_mesh_analysis
  */
 int osm_do_mesh_analysis(lash_t *p_lash)
 {
@@ -121,3 +121,83 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 
 	return ret;
 }
+
+/*
+ * osm_mesh_node_cleanup - cleanup per switch resources
+ */
+void osm_mesh_node_cleanup(switch_t *sw)
+{
+	int i;
+	mesh_node_t *node = sw->node;
+	unsigned num_ports = sw->p_sw->num_ports;
+
+	if (node) {
+		if (node->links) {
+			for (i = 0; i < num_ports; i++) {
+				if (node->links[i]) {
+					if (node->links[i]->ports)
+						free(node->links[i]->ports);
+					free(node->links[i]);
+				}
+			}
+			free(node->links);
+		}
+
+		if (node->poly)
+			free(node->poly);
+
+		if (node->matrix) {
+			for (i = 0; i < node->num_links; i++) {
+				if (node->matrix[i])
+					free(node->matrix[i]);
+			}
+			free(node->matrix);
+		}
+
+		if (node->axes)
+			free(node->axes);
+
+		free(node);
+
+		sw->node = NULL;
+	}
+}
+
+/*
+ * osm_mesh_node_create - allocate per switch resources
+ */
+int osm_mesh_node_create(lash_t *p_lash, switch_t *sw)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	int i;
+	mesh_node_t *node;
+	unsigned num_ports = sw->p_sw->num_ports;
+
+	if (!(node = sw->node = calloc(1, sizeof(mesh_node_t)))) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating mesh node -
out of memory\n");
+		return -1;
+	}
+
+	if (!(node->links = calloc(num_ports, sizeof(link_t *))))
+		goto err;
+
+	for (i = 0; i < num_ports; i++) {
+		if (!(node->links[i] = calloc(1, sizeof(link_t))) ||
+		    !(node->links[i]->ports = calloc(num_ports,
sizeof(int))))
+			goto err;
+	}
+
+	if (!(node->axes = calloc(num_ports, sizeof(int))))
+		goto err;
+
+	for (i = 0; i < num_ports; i++) {
+		node->links[i]->switch_id = NONE;
+	}
+
+	return 0;
+
+err:
+	OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating mesh node - out of
memory\n");
+	osm_mesh_node_cleanup(sw);
+	return -1;
+}
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 3577cca..b9394af 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -651,6 +651,9 @@ static switch_t *switch_create(lash_t * p_lash, unsigned
id, osm_switch_t * p_sw
 		sw->phys_connections[i] = NONE;
 	}
 
+	if (osm_mesh_node_create(p_lash, sw))
+		return -1;
+
 	sw->p_sw = p_sw;
 	if (p_sw)
 		p_sw->priv = sw;
@@ -660,6 +663,8 @@ static switch_t *switch_create(lash_t * p_lash, unsigned
id, osm_switch_t * p_sw
 
 static void switch_delete(switch_t * sw)
 {
+	osm_mesh_node_cleanup(sw);
+
 	if (sw->dij_channels)
 		free(sw->dij_channels);
 	if (sw->virtual_physical_port_table)


From tziporet at mellanox.co.il  Tue Nov 11 01:02:45 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 11 Nov 2008 11:02:45 +0200
Subject: [ofa-general] OFED Nov 10 2008 meeting minutes on OFED 1.4 status
In-Reply-To: <458BC6B0F287034F92FE78908BD01CE84EF33FC0@mtlexch01.mtl.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0FE6FE@mtlexch01.mtl.com>

OFED Nov 10 2008 meeting minutes on OFED 1.4 status:

Meeting minutes on the web:
http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/

Meeting Summary:
==============
* RC4 will be released on Tuesday Nov 11 (today)
* RC5 will be released next week on Monday Nov 17
* GA is planned for Nov 24
* All must send update for the documents and release notes for RC5
* We may need a short meeting to follow up on RC5 and the GA release
next Monday - stay tuned

Details:
=======
* Since there are some critical bugs we must have RC5.
* We do not wish to delay the release to Dec. thus we all must focus on
these bugs and do everything to resolve on time

Bugs to be fixed in RC5: 

1323    	blo  	stefan.roscher at de.ibm.com  	IB/ehca:
possibility of kernel panic under certain circu... - in rc4
1370 	blo 	vlad at mellanox.co.il 		Ping over IPoIB I/F
fails after ifconfig down and up - ongoing
1364 	cri 	swise at opengridcomputing.com 	system hang on rmmod
cxgb3 in rhel4.7 - Steve please update
1365 	cri 	swise at opengridcomputing.com 	Panic on loading
iw_cxgb3 in RHEL 4.6 - Steve please update
1366 	cri 	swise at opengridcomputing.com 	Panic during boot-up
after an OFED install in RHEL 4.5 - Steve please update
1242 	cri 	yannick.cote at qlogic.com 	kernel panic while
running mpi2007 against ofed1.4 -- ib_... - ongoing
1289 	maj 	amirv at mellanox.co.il 		Ib and ipoib doesnt
respond while running multiple tests ... -ongoing
1349 	maj 	amirv at mellanox.co.il 		Kernel panic on sdp -
ongoing
1336 	maj 	vlad at mellanox.co.il 		Can't to unloading the
mlx4_ib module on ppc64 - we will try to reproduce


Tziporet


From FENKES at de.ibm.com  Tue Nov 11 01:04:04 2008
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Tue, 11 Nov 2008 10:04:04 +0100
Subject: [ofa-general] Re: [PATCH] IB/ehca: Fix suppression of port
	activation events
In-Reply-To: <adaod0nqpx4.fsf@cisco.com>
References: <200806061835.43802.fenkes@de.ibm.com>	<48499C11.7030504@gmail.com>
	<200811071742.51867.fenkes@de.ibm.com> <adaod0nqpx4.fsf@cisco.com>
Message-ID: <OFA167ACE2.B60DCF5C-ONC12574FE.00319074-C12574FE.0031CE5F@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 10.11.2008 21:36:23:

>  > A previous fix introduced a regression where port activation events 
were
>  > dropped unconditionally if port autodetection was not enabled. Fixed.
> 
> Is this a fix to "IB/ehca: Remove reference to special QP in case of
> port activation failure"?  Because if so I can roll it into that patch,
> since Linus hasn't pulled it yet.

Yes, that would be splendid, thank you!

Cheers,
  Joachim


From vlad at lists.openfabrics.org  Tue Nov 11 03:20:07 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 11 Nov 2008 03:20:07 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081111-0200 daily build status
Message-ID: <20081111112007.D76FEE60BD5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Tue Nov 11 04:08:43 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 11 Nov 2008 14:08:43 +0200
Subject: [ofa-general] Re: [PATCH] opensm: fix iser service-id used for SL
	assignment
In-Reply-To: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811061456540.3153@zuben.voltaire.com>
Message-ID: <20081111120843.GC3927@sashak.voltaire.com>

On 14:57 Thu 06 Nov     , Or Gerlitz wrote:
> RFC3720 says:
> 
> The well-known user TCP port number for iSCSI connections assigned by IANA is 3260
> and this is the default iSCSI port. Implementations needing a system TCP port number
> may use port 860, the port assigned by IANA as the iSCSI system port; however in
> order to use port 860, it MUST be explicitly specified - implementations MUST NOT
> default to use of port 860, as 3260 is the only allowed default.
> 
> Hence the SID used by iser is 0x0000000001060CBC and not 0x000000000106035C
> 
> Signed-off-by: Or Gerlitz  <ogerlitz at voltaire.com>
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Tue Nov 11 04:08:22 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 11 Nov 2008 14:08:22 +0200
Subject: [ofa-general] [PATCH] opensm/Makefile.am: install
	QoS_management_in_OpenSM.txt
Message-ID: <491975B6.4070105@dev.mellanox.co.il>

Hi Sasha,

Following the patch from yesterday - adding
QoS_management_in_OpenSM.txt to tarball.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/Makefile.am |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/Makefile.am b/opensm/Makefile.am
index f8b66b3..02c693d 100644
--- a/opensm/Makefile.am
+++ b/opensm/Makefile.am
@@ -21,7 +21,7 @@ endif
 man_MANS = man/opensm.8 man/osmtest.8

 various_scripts = $(wildcard scripts/*)
-docs = doc/performance-manager-HOWTO.txt
+docs = doc/performance-manager-HOWTO.txt doc/QoS_management_in_OpenSM.txt

 EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs)

-- 
1.5.1.4


From sashak at voltaire.com  Tue Nov 11 04:16:54 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 11 Nov 2008 14:16:54 +0200
Subject: [ofa-general] Re: [PATCH] opensm/Makefile.am: install
	QoS_management_in_OpenSM.txt
In-Reply-To: <491975B6.4070105@dev.mellanox.co.il>
References: <491975B6.4070105@dev.mellanox.co.il>
Message-ID: <20081111121654.GF3927@sashak.voltaire.com>

On 14:08 Tue 11 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Following the patch from yesterday - adding
> QoS_management_in_OpenSM.txt to tarball.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Nice finding. Applied. Thanks.

Sasha

> ---
>  opensm/Makefile.am |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/Makefile.am b/opensm/Makefile.am
> index f8b66b3..02c693d 100644
> --- a/opensm/Makefile.am
> +++ b/opensm/Makefile.am
> @@ -21,7 +21,7 @@ endif
>  man_MANS = man/opensm.8 man/osmtest.8
> 
>  various_scripts = $(wildcard scripts/*)
> -docs = doc/performance-manager-HOWTO.txt
> +docs = doc/performance-manager-HOWTO.txt doc/QoS_management_in_OpenSM.txt
> 
>  EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs)
> 
> -- 
> 1.5.1.4
> 


From halr at obsidianresearch.com  Tue Nov 11 06:23:34 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Tue, 11 Nov 2008 07:23:34 -0700
Subject: [ofa-general] [PATCH] OpenSM/osm_subnet.c: Fix log_max_size
	conversion to MB
Message-ID: <49199566.2010505@obsidianresearch.com>

Sasha,

This patch fixes the conversion of log_max_size to MB introduced by 
commit 9954ead20c84586c6daaec5a1fba835eda0b4738

It does not address the overflow issues introduced by that change though.

-- Hal
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-osm-logfilesize1
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081111/2e017b1d/attachment.ksh>

From vlad at mellanox.co.il  Tue Nov 11 07:27:46 2008
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 11 Nov 2008 17:27:46 +0200
Subject: [ofa-general] OFED-1.4-rc4 is available
Message-ID: <1226417266.18330.77.camel@vlad-laptop>

Hi, 
OFED-1.4-rc4 release is available on 
http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc4.tgz 


To get BUILD_ID run ofed_info 

Please report any issues in bugzilla https://bugs.openfabrics.org/ for
OFED 1.4 

Vladimir & Tziporet

======================================================================== 

Release information: 
------------------------------ 
Linux Operating Systems: 
       - RedHat EL4 up4:       2.6.9-42.ELsmp      * 
       - RedHat EL4 up5:       2.6.9-55.ELsmp 
       - RedHat EL4 up6:       2.6.9-67.ELsmp 
       - RedHat EL4 up7:       2.6.9-78.ELsmp 
       - RedHat EL5:           2.6.18-8.el5 
       - RedHat EL5 up1:       2.6.18-53.el5 
       - RedHat EL5 up2:       2.6.18-92.el5 
       - CentOS 5.2:           2.6.18-92.el5 
       - Fedora C9:            2.6.25-14.fc9         * 
       - SLES10:               2.6.16.21-0.8-smp 
       - SLES10 SP1:           2.6.16.46-0.12-smp 
       - SLES10 SP1 up1:       2.6.16.53-0.16-smp 
       - SLES10 SP2:           2.6.16.60-0.21-smp 
       - OpenSuSE 10.3:        2.6.22.5-31          * 
       - kernel.org:           2.6.26 and 2.6.27 

     * Minimal QA for these versions 

Systems: 
       * x86_64 
       * x86 
       * ia64 
       * ppc64 


Main Changes from OFED-1.4-rc3
==============================
- Updated MPI packages: mvapich-1.1.0-3128, mvapich2-1.2-1
- Updated bonding package: ib-bonding-0.9.0-33
- Updated uDAPL: compat-dapl-1.2.12-1, dapl-2.0.15-1
- Updated management packages:
      opensm-3.2.3, infiniband-diags-1.4.2,
      libibcommon-1.1.2, libibmad-1.2.2, libibumad-1.2.2
- NFS-RDMA to work on 2.6.26 and 2.6.27
- Cleanup compilation warning 

- 46 bugs fixed (see attached for details) 

- Kernel git tree changes: 


Tasks that should be completed for the rc5: 
================================ 
1. High priority bug fixes
2. Documentation update
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4-rc4-fixed-bugs.csv
Type: text/csv
Size: 5834 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081111/fee4ef7f/attachment.csv>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed_kernel-1.4-rc3_rc4.log
Type: text/x-log
Size: 40345 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081111/fee4ef7f/attachment.bin>

From Thomas.Talpey at netapp.com  Tue Nov 11 08:02:18 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 11 Nov 2008 11:02:18 -0500
Subject: [ofa-general] NFS-RDMA (OFED1.4) with standard
  distributions ?
In-Reply-To: <7391130E01ED404FBD7A3C86731EEB7D20ECAB457D@GVW1087EXB.amer
	icas.hpqcorp.net>
References: <7391130E01ED404FBD7A3C86731EEB7D20EC0F8737@GVW1087EXB.americas.hpqcorp.net>
	<49160618.3050409@nasa.gov>
	<7391130E01ED404FBD7A3C86731EEB7D20ECAB457D@GVW1087EXB.americas.hpqcorp.net>
Message-ID: <RTPCLUEXC2-PRDzfoV10000010e@RTPMVEXC1-PRD.hq.netapp.com>

At 11:27 AM 11/10/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>That's great, thanks.
>
>I ran some tests with the 2.6.27 kernel as server and client, and 
>basically it works fine.
>
>I could not find yet any situation where NFS-RDMA would outperform 
>NFS/IPoIB, at least when you compare apples to apples (same clients, 
>same server, same protocol, and not just write to/read from the 
>caches), and it even seems to have severe performance issues for 
>reading with files larger than the memory size of the client and the server.
>Hopefully this will improve when more users will be able to give 
>valuable feedback...

I have a couple of questions, and perhaps suggestions as well.
First the questions...

- Have you tried with a 2.6.28-rc4 client and server at all? There are
a number of significant NFS/RDMA improvements queued in kernel.org,
especially around RDMA memory registration as well as RDMA operation
scheduling. We've seen some significant throughput improvement even
for basic tunings.

- What type of storage are you using at the server, and have you
attempted to tune the server at all? For example, if you are storage
(spindle) limited, no network tuning is likely to help and you should
address that first. Also, there are tunings such as nfsd thread count,
export options, and adapter choice that can make a large difference.

Bottom line, you should be able to reach multi-hundred-MB/sec of
read/write throughput with NFS/RDMA, but there may be issues on
specific systems, or perhaps with the OFED1.4 code, that need to
be accounted for. If possible, you may want to set expectations
based on mainline, then try to duplicate them in the OFED backport.
The current OFED NFS/RDMA support is still evolving, while we consider
the mainline kernel.org version to be rather solid.

Tom.

>
>Fred.
>
>-----Original Message-----
>From: Jeff Becker [mailto:Jeffrey.C.Becker at nasa.gov]
>Sent: Saturday, 08 November, 2008 22:35
>To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
>Cc: general at lists.openfabrics.org
>Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
>
>Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>> Is there any chance that the new NFS-RDMA features coming with OFED
>> 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
>Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27 
>and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be 
>done for OFED 1.4.1. Thanks.
>
>-jeff
>
>> Did anybody test this, or would pretend it is supposed to work ?
>>
>> I mean without building a 2.6.27 or equivalent kernel on top of it,
>> keeping almost full support from the vendors.
>>
>> Enhanced kernel modules may not be sufficient to work around the
>> limitations of old kernels...
>>
>>
>>


From rpearson at systemfabricworks.com  Tue Nov 11 08:44:50 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 10:44:50 -0600
Subject: [ofa-general] [PATCH][4] opensm: vector and matrix utilities
Message-ID: <003201c9441c$d23ce8f0$76b6bad0$@com>

Sasha,

Here is the fourth patch in a series implementing the mesh analysis
algorithm.

This patch implements
      - create and cleanup methods for polynomial with integer coefficients
      - create and cleanup methods for square matrix with integer
coefficients
      - create and cleanup methods for square matrix with polynomial
coefficients
      - routine to compute the determinant of a matrix with polynomial
coefficients

(Note the determinant is restricted to computing the characteristic
polynomial)

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 6ef397c..5dee1d0 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -49,6 +49,295 @@
 #include <opensm/osm_ucast_lash.h>
 
 /*
+ * poly_alloc
+ * 
+ * allocate a polynomial of degree n
+ */
+static int *poly_alloc(lash_t *p_lash, int n)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	int *p;
+
+	if (!(p = calloc(n+1, sizeof(int)))) {
+		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating poly - out
of memory\n");
+	}
+
+	return p;
+}
+
+/*
+ * poly_diff
+ *
+ * return a nonzero value if polynomials differ else 0
+ */
+static int poly_diff(int n, int *p, switch_t *s)
+{
+	int i;
+
+	if (s->node->num_links != n)
+		return 1;
+
+	for (i = 0; i <= n; i++) {
+		if (s->node->poly[i] != p[i])
+			return 1;
+	}
+
+	return 0;
+}
+
+/*
+ * m_free
+ *
+ * free a square matrix of rank l
+ */
+static void m_free(int **m, int l)
+{
+	int i;
+
+	if (m) {
+		for (i = 0; i < l; i++) {
+			if (m[i])
+				free(m[i]);
+		}
+		free(m);
+	}
+}
+
+/*
+ * m_alloc
+ *
+ * allocate a square matrix of rank l
+ */
+static int **m_alloc(lash_t *p_lash, int l)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	int i;
+	int **m = NULL;
+
+	do {
+		if (!(m = calloc(l, sizeof(int *))))
+			break;
+
+		for (i = 0; i < l; i++) {
+			if (!(m[i] = calloc(l, sizeof(int))))
+				break;
+		}
+		if (i != l)
+			break;
+
+		return m;
+	} while(0);
+
+	OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating matrix - out of
memory\n");
+
+	m_free(m, l);
+	return NULL;
+}
+
+/*
+ * pm_free
+ *
+ * free a square matrix of rank l of polynomials
+ */
+static void pm_free(int ***m, int l)
+{
+	int i, j;
+
+	if (m) {
+		for (i = 0; i < l; i++) {
+			if (m[i]) {
+				for (j = 0; j < l; j++) {
+					if (m[i][j])
+						free(m[i][j]);
+				}
+				free(m[i]);
+			}
+		}
+		free(m);
+	}
+}
+
+/*
+ * pm_alloc
+ *
+ * allocate a square matrix of rank l of polynomials of degree n
+ */
+static int ***pm_alloc(lash_t *p_lash, int l, int n)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	int i, j;
+	int ***m = NULL;
+
+	do {
+		if (!(m = calloc(l, sizeof(int **))))
+			break;
+
+		for (i = 0; i < l; i++) {
+			if (!(m[i] = calloc(l, sizeof(int *))))
+				break;
+
+			for (j = 0; j < l; j++) {
+				if (!(m[i][j] = calloc(n+1, sizeof(int))))
+					break;
+			}
+			if (j != l)
+				break;
+		}
+		if (i != l)
+			break;
+
+		return m;
+	} while(0);
+
+	OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating matrix - out of
memory\n");
+
+	pm_free(m, l);
+	return NULL;
+}
+
+static int determinant(lash_t *p_lash, int n, int rank, int ***m, int *p);
+
+/*
+ * sub_determinant
+ *
+ * compute the determinant of a submatrix of matrix of rank l of
polynomials of degree n
+ * with row and col removed in poly. caller must free poly
+ */
+static int sub_determinant(lash_t *p_lash, int n, int l, int row, int col,
int ***matrix, int **poly)
+{
+	int ret = -1;
+	int ***m = NULL;
+	int *p = NULL;
+	int i, j, k, x, y;
+	int rank = l - 1;
+
+	do {
+		if (!(p = poly_alloc(p_lash, n))) {
+			break;
+		}
+
+		if (rank <= 0) {
+			p[0] = 1;
+			ret = 0;
+			break;
+		}
+
+		if (!(m = pm_alloc(p_lash, rank, n))) {
+			free(p);
+			p = NULL;
+			break;
+		}
+
+		x = 0;
+		for (i = 0; i < l; i++) {
+			if (i == row)
+				continue;
+
+			y = 0;
+			for (j = 0; j < l; j++) {
+				if (j == col)
+					continue;
+
+				for (k = 0; k <= n; k++)
+					m[x][y][k] = matrix[i][j][k];
+
+				y++;
+			}
+			x++;
+		}
+
+		if (determinant(p_lash, n, rank, m, p)) {
+			free(p);
+			p = NULL;
+			break;
+		}
+
+		ret = 0;
+	} while(0);
+
+	pm_free(m, rank);
+	*poly = p;
+	return ret;
+}
+
+/*
+ * determinant
+ *
+ * compute the determinant of matrix m of rank of polynomials of degree deg
+ * and add the result to polynomial p allocated by caller
+ */
+static int determinant(lash_t *p_lash, int deg, int rank, int ***m, int *p)
+{
+	int i, j, k;
+	int *q;
+	int sign = 1;
+
+	/*
+	 * handle simple case of 1x1 matrix
+	 */
+	if (rank == 1) {
+		for (i = 0; i <= deg; i++)
+			p[i] += m[0][0][i];
+	}
+
+	/*
+	 * handle simple case of 2x2 matrix
+	 */
+	else if (rank == 2) {
+		for (i = 0; i <= deg; i++) {
+			if (m[0][0][i] == 0)
+				continue;
+
+			for (j = 0; j <= deg; j++) {
+				if (m[1][1][j] == 0)
+					continue;
+
+				p[i+j] += m[0][0][i]*m[1][1][j];
+			}
+		}
+
+		for (i = 0; i <= deg; i++) {
+			if (m[0][1][i] == 0)
+				continue;
+
+			for (j = 0; j <= deg; j++) {
+				if (m[1][0][j] == 0)
+					continue;
+
+				p[i+j] -= m[0][1][i]*m[1][0][j];
+			}
+		}
+	}
+
+	/*
+	 * handle the general case
+	 */
+	else {
+		for (i = 0; i < rank; i++) {
+			if (sub_determinant(p_lash, deg, rank, 0, i, m, &q))
+				return -1;
+
+			for (j = 0; j <= deg; j++) {
+				if (m[0][i][j] == 0)
+					continue;
+
+				for (k = 0; k <= deg; k++) {
+					if (q[k] == 0)
+						continue;
+
+					p[j+k] += sign*m[0][i][j]*q[k];
+				}
+			}
+
+			free(q);
+			sign = -sign;
+		}
+	}
+
+	return 0;
+}
+
+/*
  * osm_mesh_cleanup - free per mesh resources
  */
 void osm_mesh_cleanup(lash_t *p_lash)


From rpearson at systemfabricworks.com  Tue Nov 11 08:59:58 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 10:59:58 -0600
Subject: [ofa-general] [PATCH][5] opensm: compute local geometry
Message-ID: <003301c9441e$eed2f480$cc78dd80$@com>

Sasha,

Here is the fifth patch implementing the mesh analysis algorithm.

This patch implements
      - routine to compute characteristics polynomial of a matrix
      - routine to compute the local 'metric' around each switch
      - routine to classify switches into a histogram of local geometry
classes

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 7434fee..9254de3 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -338,6 +338,172 @@ static int determinant(lash_t *p_lash, int deg, int
rank, int ***m, int *p)
 }
 
 /*
+ * char_poly
+ *
+ * compute the characteristic polynomial of matrix of rank
+ * by computing the determinant of m-x*I and return in poly
+ * as an array. caller must free poly
+ */
+static int char_poly(lash_t *p_lash, int rank, int **matrix, int **poly)
+{
+	int ret = -1;
+	int i, j;
+	int ***m = NULL;
+	int *p = NULL;
+	int deg = rank;
+
+	do {
+		if (!(p = poly_alloc(p_lash, deg))) {
+			break;
+		}
+
+		if (!(m = pm_alloc(p_lash, rank, deg))) {
+			free(p);
+			p = NULL;
+			break;
+		}
+
+		for (i = 0; i < rank; i++) {
+			for (j = 0; j < rank; j++) {
+				m[i][j][0] = matrix[i][j];
+			}
+			m[i][i][1] = -1;
+		}
+
+		if (determinant(p_lash, deg, rank, m, p)) {
+			free(p);
+			p = NULL;
+			break;
+		}
+
+		ret = 0;
+	} while(0);
+
+	pm_free(m, rank);
+	*poly = p;
+	return ret;
+}
+
+/*
+ * get_switch_metric
+ *
+ * compute the matrix of minimum distances between each of
+ * the adjacent switch nodes to sw along paths
+ * that do not go through sw. do calculation by
+ * relaxation method
+ * allocate space for the matrix and save in node_t structure
+ */
+static int get_switch_metric(lash_t *p_lash, int sw)
+{
+	int ret = -1;
+	int i, j, change;
+	int sw1, sw2, sw3;
+	switch_t *s = p_lash->switches[sw];
+	switch_t *s1, *s2, *s3;
+	int **m;
+	mesh_node_t *node = s->node;
+	int num_links = node->num_links;
+
+	do {
+		if (!(m = m_alloc(p_lash, num_links)))
+			break;
+
+		for (i = 0; i < num_links; i++) {
+			sw1 = node->links[i]->switch_id;
+			s1 = p_lash->switches[sw1];
+
+			/* make all distances big except s1 to itself */
+			for (sw2 = 0; sw2 < p_lash->num_switches; sw2++)
+				p_lash->switches[sw2]->node->temp =
0x7fffffff;
+
+			s1->node->temp = 0;
+
+			do {
+				change = 0;
+
+				for (sw2 = 0; sw2 < p_lash->num_switches;
sw2++) {
+					s2 = p_lash->switches[sw2];
+					if (s2->node->temp == 0x7fffffff)
+						continue;
+					for (j = 0; j < s2->node->num_links;
j++) {
+						sw3 =
s2->node->links[j]->switch_id;
+						s3 = p_lash->switches[sw3];
+
+						if (sw3 == sw)
+							continue;
+
+						if ((s2->node->temp + 1) <
s3->node->temp) {
+							s3->node->temp =
s2->node->temp + 1;
+							change++;
+						}
+					}
+				}
+			} while(change);
+
+			for (j = 0; j < num_links; j++) {
+				sw2 = node->links[j]->switch_id;
+				s2 = p_lash->switches[sw2];
+				m[i][j] = s2->node->temp;
+			}
+		}
+
+		if (char_poly(p_lash, num_links, m, &node->poly)) {
+			m_free(m, num_links);
+			m = NULL;
+			break;
+		}
+
+		ret = 0;
+	} while(0);
+
+	node->matrix = m;
+	return ret;
+}
+
+/*
+ * classify_switch
+ *
+ * add switch to histogram of switch types
+ */
+static void classify_switch(lash_t *p_lash, int sw)
+{
+	int i;
+	switch_t *s = p_lash->switches[sw];
+	switch_t *s1;
+	mesh_t *mesh = p_lash->mesh;
+
+	for (i = 0; i < mesh->num_class; i++) {
+		s1 = p_lash->switches[mesh->class_type[i]];
+	
+		if (poly_diff(s->node->num_links, s->node->poly, s1))
+			continue;
+
+		mesh->class_count[i]++;
+		return;
+	}
+
+	mesh->class_type[mesh->num_class] = sw;
+	mesh->class_count[mesh->num_class] = 1;
+	mesh->num_class++;
+	return;
+}
+
+/*
+ * get_local_geometry
+ *
+ * analyze the local geometry around each switch
+ */
+static void get_local_geometry(lash_t *p_lash)
+{
+	int sw;
+
+	for (sw = 0; sw < p_lash->num_switches; sw++) {
+		get_switch_metric(p_lash, sw);
+		classify_switch(p_lash, sw);
+	}
+}
+
+/*
  * osm_mesh_cleanup - free per mesh resources
  */
 void osm_mesh_cleanup(lash_t *p_lash)
@@ -404,6 +570,12 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 		return -1;
 	}
 
+	/*
+	 * get local metric and invariant for each switch
+	 * also classify each switch
+	 */
+	get_local_geometry(p_lash);
+
 	printf("lash: do_mesh_analysis stub called\n");
 
 	OSM_LOG_EXIT(p_log);


From rpearson at systemfabricworks.com  Tue Nov 11 09:10:31 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 11:10:31 -0600
Subject: [ofa-general] ***SPAM*** 
Message-ID: <003701c94420$67840f80$368c2e80$@com>

Sasha,

Here is the sixth patch implementing the mesh analysis algorithm.

This patch implements
      - a table of polynomials for all 2D and 3D regular Cartesian meshes
      - a routine to classify each switch based on the table

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 9254de3..30d09c2 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -48,6 +48,76 @@
 #include <opensm/osm_mesh.h>
 #include <opensm/osm_ucast_lash.h>
 
+#define MAX_DIMENSION (4)
+#define MAX_DEGREE (10)
+
+/*
+ * characteristic polynomials for 2d and 3d regular tori
+ * since 4 == 2x2 we choose to take 2x2
+ */
+struct _mesh_info {
+	int dimension;			/* dimension of the torus */
+	int size[MAX_DIMENSION];	/* size of the torus */
+	int degree;			/* degree of polynomial */
+	int poly[MAX_DEGREE+1];		/* polynomial */
+} mesh_info[] = {
+	{0, {0},       0, {0},					},
+
+	{2, {2, 2},    2, {-4, 0, 1},				},
+	{2, {3, 2},    3, {8, 9, 0, -1},			},
+	//{2, {4, 2},    3, {16, 12, 0, -1},			},
+	{2, {5, 2},    3, {24, 17, 0, -1},			},
+	{2, {6, 2},    3, {32, 24, 0, -1},			},
+	{2, {3, 3},    4, {-15, -32, -18, 0, 1},		},
+	//{2, {4, 3},    4, {-28, -48, -21, 0, 1},		},
+	{2, {5, 3},    4, {-39, -64, -26, 0, 1},		},
+	{2, {6, 3},    4, {-48, -80, -33, 0, 1},		},
+	//{2, {4, 4},    4, {-48, -64, -24, 0, 1},		},
+	//{2, {5, 4},    4, {-60, -80, -29, 0, 1},		},
+	//{2, {6, 4},    4, {-64, -96, -36, 0, 1},		},
+	{2, {5, 5},    4, {-63, -96, -34, 0, 1},		},
+	{2, {6, 5},    4, {-48, -112, -41, 0, 1},		},
+	{2, {6, 6},    4, {0, -128, -48, 0, 1},			},
+
+	{3, {2, 2, 2}, 3, {16, 12, 0, -1},			},
+	{3, {3, 2, 2}, 4, {-28, -48, -21, 0, 1},		},
+	{3, {4, 2, 2}, 4, {-48, -64, -24, 0, 1},		},
+	{3, {5, 2, 2}, 4, {-60, -80, -29, 0, 1},		},
+	{3, {6, 2, 2}, 4, {-64, -96, -36, 0, 1},		},
+	{3, {3, 3, 2}, 5, {48, 127, 112, 34, 0, -1},		},
+	{3, {4, 3, 2}, 5, {80, 180, 136, 37, 0, -1},		},
+	{3, {5, 3, 2}, 5, {96, 215, 160, 42, 0, -1},		},
+	{3, {6, 3, 2}, 5, {96, 232, 184, 49, 0, -1},		},
+	{3, {4, 4, 2}, 5, {128, 240, 160, 40, 0, -1},		},
+	{3, {5, 4, 2}, 5, {144, 276, 184, 45, 0, -1},		},
+	{3, {6, 4, 2}, 5, {128, 288, 208, 52, 0, -1},		},
+	{3, {5, 5, 2}, 5, {144, 303, 208, 50, 0, -1},		},
+	{3, {6, 5, 2}, 5, {96, 296, 232, 57, 0, -1},		},
+	{3, {6, 6, 2}, 5, {0, 256, 256, 64, 0, -1},		},
+	{3, {3, 3, 3}, 6, {-81, -288, -381, -224, -51, 0, 1},	},
+	{3, {4, 3, 3}, 6, {-132, -416, -487, -256, -54, 0, 1},	},
+	{3, {5, 3, 3}, 6, {-153, -480, -557, -288, -59, 0, 1},	},
+	{3, {6, 3, 3}, 6, {-144, -480, -591, -320, -66, 0, 1},	},
+	{3, {4, 4, 3}, 6, {-208, -576, -600, -288, -57, 0, 1},	},
+	{3, {5, 4, 3}, 6, {-228, -640, -671, -320, -62, 0, 1},	},
+	{3, {6, 4, 3}, 6, {-192, -608, -700, -352, -69, 0, 1},	},
+	{3, {5, 5, 3}, 6, {-225, -672, -733, -352, -67, 0, 1},	},
+	{3, {6, 5, 3}, 6, {-144, -576, -743, -384, -74, 0, 1},	},
+	{3, {6, 6, 3}, 6, {0, -384, -720, -416, -81, 0, 1},	},
+	{3, {4, 4, 4}, 6, {-320, -768, -720, -320, -60, 0, 1},	},
+	{3, {5, 4, 4}, 6, {-336, -832, -792, -352, -65, 0, 1},	},
+	{3, {6, 4, 4}, 6, {-256, -768, -816, -384, -72, 0, 1},	},
+	{3, {5, 5, 4}, 6, {-324, -864, -855, -384, -70, 0, 1},	},
+	{3, {6, 5, 4}, 6, {-192, -736, -860, -416, -77, 0, 1},	},
+	{3, {6, 6, 4}, 6, {0, -512, -832, -448, -84, 0, 1},	},
+	{3, {5, 5, 5}, 6, {-297, -864, -909, -416, -75, 0, 1},	},
+	{3, {6, 5, 5}, 6, {-144, -672, -895, -448, -82, 0, 1},	},
+	{3, {6, 6, 5}, 6, {0, -384, -848, -480, -89, 0, 1},	},
+	{3, {6, 6, 6}, 6, {0, 0, -768, -512, -96, 0, 1},	},
+
+	{-1, {0,}, 0, {0, },					},
+};
+
 /*
  * poly_alloc
  * 
@@ -489,6 +559,30 @@ static void classify_switch(lash_t *p_lash, int sw)
 }
 
 /*
+ * classify_mesh_type
+ *
+ * try to look up node polynomial in table
+ */
+static void classify_mesh_type(lash_t *p_lash, int sw)
+{
+	int i;
+	switch_t *s = p_lash->switches[sw];
+	struct _mesh_info *t;
+
+	for (i = 1; (t = &mesh_info[i])->dimension != -1; i++) {
+		if (poly_diff(t->degree, t->poly, s))
+			continue;
+
+		s->node->type = i;
+		s->node->dimension = t->dimension;
+		return;
+	}
+
+	s->node->type = 0;
+	return;
+}
+
+/*
  * get_local_geometry
  *
  * analyze the local geometry around each switch
@@ -500,6 +594,7 @@ static void get_local_geometry(lash_t *p_lash)
 	for (sw = 0; sw < p_lash->num_switches; sw++) {
 		get_switch_metric(p_lash, sw);
 		classify_switch(p_lash, sw);
+		classify_mesh_type(p_lash, sw);
 	}
 }
 

From rpearson at systemfabricworks.com  Tue Nov 11 09:28:40 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 11:28:40 -0600
Subject: [ofa-general] [PATCH][7] opensm: build global geometry
Message-ID: <004401c94422$f04684e0$d0d38ea0$@com>

Sasha,

Here is the seventh patch implementing the mesh analysis algorithm.

This patch implements
      - routine to induce axes on mesh starting from seed node
      - code to report results of local analysis (should have been in
patch6)

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index 30d09c2..65afae6 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -599,6 +599,239 @@ static void get_local_geometry(lash_t *p_lash)
 }
 
 /*
+ * seed_axes
+ *
+ * assign axes to the links of the seed switch
+ * assumes switch is of type cartesian mesh
+ * axes are numbered 1 to n i.e. +x => 1 -x => 2 etc.
+ * this assumes that if all distances are 2 that
+ * an axis has only 2 nodes so +A and -A collapse to +A
+ */
+static void seed_axes(lash_t *p_lash, int sw)
+{
+	mesh_node_t *node = p_lash->switches[sw]->node;
+	int n = node->num_links;
+	int i, j, c;
+
+	for (c = 1; c <= 2*node->dimension; c++) {
+		/*
+		 * find the next unassigned axis
+		 */
+		for (i = 0; i < n; i++) {
+			if (!node->axes[i])
+				break;
+		}
+
+		node->axes[i] = c++;
+
+		/*
+		 * find the matching opposite direction
+		 */
+		for (j = 0; j < n; j++) {
+			if (node->axes[j] || j == i)
+				continue;
+
+			if (node->matrix[i][j] != 2)
+				break;
+		}
+
+		if (j != n) {
+			node->axes[j] = c;
+		}
+	}
+}
+
+/*
+ * opposite
+ *
+ * compute the opposite of axis for switch
+ */
+static inline int opposite(switch_t *s, int axis)
+{
+	int i, j;
+	int negaxis = 1 + (1 ^ (axis - 1));
+
+	for (i = 0; i < s->node->num_links; i++) {
+		if (s->node->axes[i] == axis) {
+			for (j = 0; j < s->node->num_links; j++) {
+				if (j == i)
+					continue;
+				if (s->node->matrix[i][j] != 2)
+					return negaxis;
+			}
+
+			return axis;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * make_geometry
+ *
+ * induce a geometry on the switches
+ */
+static void make_geometry(lash_t *p_lash, int sw)
+{
+	osm_log_t *p_log = &p_lash->p_osm->log;
+	int num_switches = p_lash->num_switches;
+	int sw1, sw2;
+	switch_t *s, *s1, *s2, *seed;
+	int i, j, k, l, n, m;
+	int change;
+
+	/*
+	 * assign axes to seed switch
+	 */
+	seed_axes(p_lash, sw);
+	seed = p_lash->switches[sw];
+
+	/*
+	 * induce axes in other switches until
+	 * there is no more change
+	 */
+	do {
+		change = 0;
+
+		/* phase 1 opposites */
+		for (sw1 = 0; sw1 < num_switches; sw1++) {
+			s1 = p_lash->switches[sw1];
+			n = s1->node->num_links;
+
+			for (i = 0; i < n; i++) {
+				if (!s1->node->axes[i])
+					continue;
+
+				/*
+				 * can't tell across if more than one
+				 * likely looking link
+				 */
+				m = 0;
+				for (j = 0; j < n; j++) {
+					if (j == i)
+						continue;
+
+					if (s1->node->matrix[i][j] != 2)
+						m++;
+				}
+
+				if (m != 1) {
+					continue;
+				}
+
+				for (j = 0; j < n; j++) {
+					if (j == i)
+						continue;
+
+					if (s1->node->matrix[i][j] != 2) {
+						if (s1->node->axes[j]) {
+							if
(s1->node->axes[j] != opposite(seed, s1->node->axes[i])) {
+
OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 1 mismatch\n");
+							}
+						} else {
+							s1->node->axes[j] =
opposite(seed, s1->node->axes[i]);
+							change++;
+						}
+					}
+				}
+			}
+		}
+
+		/* phase 2 switch to switch */
+		for (sw1 = 0; sw1 < num_switches; sw1++) {
+			s1 = p_lash->switches[sw1];
+			n = s1->node->num_links;
+
+			for (i = 0; i < n; i++) {
+				int l2 = s1->node->links[i]->link_id;
+
+				if (!s1->node->axes[i])
+					continue;
+
+				if (l2 == -1) {
+					printf("ERROR no reverse link\n");
+					continue;
+				}
+
+				sw2 = s1->node->links[i]->switch_id;
+				s2 = p_lash->switches[sw2];
+
+				if (!s2->node->axes[l2]) {
+					/*
+					 * set axis to opposite of
s1->axes[i]
+					 */
+					s2->node->axes[l2] = opposite(seed,
s1->node->axes[i]);
+					change++;
+				} else {
+					if (s2->node->axes[l2] !=
opposite(seed, s1->node->axes[i])) {
+						OSM_LOG(p_log,
OSM_LOG_DEBUG, "phase 2 mismatch\n");
+					}
+				}
+			}
+		}
+
+		/* Phase 3 corners */
+		for (sw1 = 0; sw1 < num_switches; sw1++) {
+			s = p_lash->switches[sw1];
+			n = s->node->num_links;
+
+			for (i = 0; i < n; i++) {
+				if (!s->node->axes[i])
+					continue;
+
+				for (j = 0; j < n; j++) {
+					if (i == j || !s->node->axes[j] ||
s->node->matrix[i][j] != 2)
+						continue;
+
+					s1 =
p_lash->switches[s->node->links[i]->switch_id];
+					s2 =
p_lash->switches[s->node->links[j]->switch_id];
+
+					/*
+					 * find switch (other than s1) that
neighbors i and j
+					 * have in common
+					 */
+					for (k = 0; k < s1->node->num_links;
k++) {
+						if
(s1->node->links[k]->switch_id == sw1)
+							continue;
+
+						for (l = 0; l <
s2->node->num_links; l++) {
+							if
(s2->node->links[l]->switch_id == sw1)
+								continue;
+
+							if
(s1->node->links[k]->switch_id == s2->node->links[l]->switch_id) {
+								if
(s1->node->axes[k]) {
+									if
(s1->node->axes[k] != s->node->axes[j]) {
+
OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 3 mismatch\n");
+									}
+								} else {
+
s1->node->axes[k] = s->node->axes[j];
+
change++;
+								}
+
+								if
(s2->node->axes[l]) {
+									if
(s2->node->axes[l] != s->node->axes[i]) {
+
OSM_LOG(p_log, OSM_LOG_DEBUG, "phase 3 mismatch\n");
+									}
+								} else {
+
s2->node->axes[l] = s->node->axes[i];
+
change++;
+								}
+								goto next_j;
+							}
+						}
+					}
+next_j:
+					;
+				}
+			}
+		}
+	} while(change);
+
+	return;
+}
+
+/*
  * osm_mesh_cleanup - free per mesh resources
  */
 void osm_mesh_cleanup(lash_t *p_lash)
@@ -652,8 +885,13 @@ static int mesh_create(lash_t *p_lash)
  */
 int osm_do_mesh_analysis(lash_t *p_lash)
 {
-	int ret = 0;
 	osm_log_t *p_log = &p_lash->p_osm->log;
+	int max_class = -1;
+	int max_class_num = 0;
+	int max_class_type = -1;
+	int i;
+	mesh_t *mesh;
+	switch_t *s;
 
 	OSM_LOG_ENTER(p_log);
 
@@ -671,11 +909,43 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 	 */
 	get_local_geometry(p_lash);
 
-	printf("lash: do_mesh_analysis stub called\n");
+	/*
+	 * find dominant switch class
+	 */
+	for (i = 0; i < mesh->num_class; i++) {
+		if (mesh->class_count[i] > max_class_num) {
+			max_class = i;
+			max_class_num = mesh->class_count[i];
+			max_class_type = mesh->class_type[i];
+		}
+	}
+
+	s = p_lash->switches[max_class_type];
+
+	printf("lash: found %d node type%s\n", mesh->num_class,
(mesh->num_class == 1)? "" : "s");
+	printf("lash: %snode type is ", (mesh->num_class == 1)? "" : "most
common ");
+
+	if (s->node->type) {
+		struct _mesh_info *t = &mesh_info[s->node->type];
+
+		for (i = 0; i < t->dimension; i++) {
+			printf("%s%d%s", i? "X" : "", t->size[i],
+				(t->size[i] == 6)? "+" : "");
+		}
+		printf(" mesh\n");
+
+		p_lash->mesh->dimension = t->dimension;
+	} else {
+		printf("unknown geometry\n");
+	}
+
+	if (s->node->type) {
+		make_geometry(p_lash, max_class_type);
+	}
 
 	OSM_LOG_EXIT(p_log);
 
-	return ret;
+	return 0;
 }
 
 /*


From sashak at voltaire.com  Tue Nov 11 09:28:58 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 11 Nov 2008 19:28:58 +0200
Subject: [ofa-general] Re: [PATCH] OpenSM/osm_subnet.c: Fix log_max_size
	conversion to MB
In-Reply-To: <49199566.2010505@obsidianresearch.com>
References: <49199566.2010505@obsidianresearch.com>
Message-ID: <20081111172858.GF30865@sashak.voltaire.com>

On 07:23 Tue 11 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> This patch fixes the conversion of log_max_size to MB introduced by commit 
> 9954ead20c84586c6daaec5a1fba835eda0b4738
>
> It does not address the overflow issues introduced by that change though.
>
> -- Hal

> OpenSM/osm_subnet.c: Convert log_max_size to MB
> 
> Fixes commit 9954ead20c84586c6daaec5a1fba835eda0b4738
> which should preceed commit 12b0e65b2dd198c1764ffb23dd8d6572f0fac5e6
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Nice catch. Applied. Thanks.

Sasha

> 
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 750bdc6..5447e95 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1278,7 +1278,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
>  		opts_unpack_uint32("log_max_size",
>  				   p_key, p_val,
>  				   (void *) & p_opts->log_max_size);
> -		p_opts->log_max_size * 1024 *1024; /* convert to MB */
> +		p_opts->log_max_size *= 1024 * 1024; /* convert to MB */
>  
>  		opts_unpack_charp("partition_config_file",
>  				  p_key, p_val, &p_opts->partition_config_file);


From rpearson at systemfabricworks.com  Tue Nov 11 09:37:14 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 11:37:14 -0600
Subject: [ofa-general] [PATCH][8] opensm: measure size and reorder links
Message-ID: <004501c94424$23551620$69ff4260$@com>

Sasha,

 
Here is the eighth patch implementing the mesh analysis algorithm.

 
This patch implements

      - routine to reorder links and measure the size of the mesh

 
Regards,

 
Bob Pearson

 
Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>

----

diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c

index 65afae6..a248522 100644

--- a/opensm/opensm/osm_mesh.c

+++ b/opensm/opensm/osm_mesh.c

@@ -832,6 +832,183 @@ next_j:

 }

 
 /*

+ * return |a| < |b|

+ */

+static inline int ltmag(int a, int b)

+{

+     int a1 = (a >= 0)? a : -a;

+     int b1 = (b >= 0)? b : -b;

+

+     return (a1 < b1) || (a1 == b1 && a > b);

+}

+

+/*

+ * reorder_links

+ *

+ * reorder the links out of a switch in sign/dimension order

+ */

+static int reorder_links(lash_t *p_lash, int sw)

+{

+     osm_log_t *p_log = &p_lash->p_osm->log;

+     switch_t *s = p_lash->switches[sw];

+     mesh_node_t *node = s->node;

+     int n = node->num_links;

+     link_t **links;

+     int *axes;

+     int i, j;

+     int c;

+     int next = 0;

+

+     if (!(links = calloc(n, sizeof(link_t *)))) {

+           OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating temp array -
out of memory\n");

+           return -1;

+     }

+

+     if (!(axes = calloc(n, sizeof(int)))) {

+           free(links);

+           OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating temp array -
out of memory\n");

+           return -1;

+     }

+

+     /*

+     * find the links with axes

+     */

+     for (j = 1; j <= 2*node->dimension; j++) {

+           c = j;

+           if (node->coord[(c-1)/2] > 0)

+                 c = opposite(s, c);

+

+           for (i = 0; i < n; i++) {

+                 if (!node->links[i])

+                       continue;

+                 if (node->axes[i] == c) {

+                       links[next] = node->links[i];

+                       axes[next] = node->axes[i];

+                       node->links[i] = NULL;

+                       next++;

+                 }

+           }

+     }

+

+     /*

+     * get the rest

+     */

+     for (i = 0; i < n; i++) {

+           if (!node->links[i])

+                 continue;

+

+           links[next] = node->links[i];

+           axes[next] = node->axes[i];

+           node->links[i] = NULL;

+           next++;

+     }

+

+     for (i = 0; i < n; i++) {

+           node->links[i] = links[i];

+           node->axes[i] = axes[i];

+     }

+

+     free(links);

+     free(axes);

+

+     return 0;

+}

+

+/*

+ * measure geometry

+ */

+static int measure_geometry(lash_t *p_lash, int seed)

+{

+     int i, j, k;

+     int sw;

+     switch_t *s, *s1;

+     int change;

+     int dimension = p_lash->mesh->dimension;

+     int num_switches = p_lash->num_switches;

+     int assigned_axes = 0, unassigned_axes = 0;

+     int *max, *min;

+

+     for (sw = 0; sw < num_switches; sw++) {

+           s = p_lash->switches[sw];

+

+           s->node->coord = calloc(dimension, sizeof(int));

+           for (i = 0; i < dimension; i++)

+                 s->node->coord[i] = (sw == seed)? 0 : 0x7fffffff;

+

+           for (i = 0; i < s->node->num_links; i++)

+                 if (s->node->axes[i] == 0)

+                       unassigned_axes++;

+                 else

+                       assigned_axes++;

+     }

+

+     printf("lash: %d/%d unassigned/assigned axes\n", unassigned_axes,
assigned_axes);

+

+     do {

+           change = 0;

+

+           for (sw = 0; sw < num_switches; sw++) {

+                 s = p_lash->switches[sw];

+

+                 if (s->node->coord[0] == 0x7fffffff)

+                       continue;

+

+                 for (j = 0; j < s->node->num_links; j++) {

+                       if (!s->node->axes[j])

+                             continue;

+

+                       s1 = p_lash->switches[s->node->links[j]->switch_id];

+

+                       for (k = 0; k < dimension; k++) {

+                             int coord = s->node->coord[k];

+                             int axis = s->node->axes[j] - 1;

+

+                             if (k == axis/2)

+                                   coord += (axis & 1)? -1 : +1;

+

+                             if (ltmag(coord, s1->node->coord[k])) {

+                                   s1->node->coord[k] = coord;

+                                   change++;

+                             }

+                       }

+                 }

+           }

+     } while (change);

+

+     for (sw = 0; sw < num_switches; sw++) {

+           if (reorder_links(p_lash, sw))

+                 return -1;

+     }

+

+     max = calloc(dimension, sizeof(int));

+     min = calloc(dimension, sizeof(int));

+     p_lash->mesh->size = calloc(dimension, sizeof(int));

+

+     for (i = 0; i < dimension; i++) {

+           max[i] = -0x7fffffff;

+           min[i] = 0x7fffffff;

+     }

+

+     for (sw = 0; sw < num_switches; sw++) {

+           s = p_lash->switches[sw];

+

+           for (i = 0; i < dimension; i++) {

+                 if (s->node->coord[i] == 0x7fffffff)

+                       continue;

+                 if (s->node->coord[i] > max[i])

+                       max[i] = s->node->coord[i];

+                 if (s->node->coord[i] < min[i])

+                       min[i] = s->node->coord[i];

+           }

+     }

+

+     for (i = 0; i < dimension; i++)

+           p_lash->mesh->size[i] = max[i] - min[i] + 1;

+

+     return 0;

+}

+

+/*

  * osm_mesh_cleanup - free per mesh resources

  */

 void osm_mesh_cleanup(lash_t *p_lash)

@@ -941,6 +1118,14 @@ int osm_do_mesh_analysis(lash_t *p_lash)

 
      if (s->node->type) {

            make_geometry(p_lash, max_class_type);

+

+           if (measure_geometry(p_lash, max_class_type))

+                 return -1;

+

+           printf("lash: found ");

+           for (i = 0; i < mesh->dimension; i++)

+                 printf("%s%d", i? "X" : "", mesh->size[i]);

+           printf(" mesh\n");

      }

 
      OSM_LOG_EXIT(p_log);

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081111/2ead8059/attachment.html>

From frederic.ciesielski at hp.com  Tue Nov 11 10:06:21 2008
From: frederic.ciesielski at hp.com (Ciesielski, Frederic (EMEA HPC&OSLO CC))
Date: Tue, 11 Nov 2008 18:06:21 +0000
Subject: FW: [ofa-general] NFS-RDMA (OFED1.4) with standard  distributions ?
Message-ID: <7391130E01ED404FBD7A3C86731EEB7D20ECAB4C0F@GVW1087EXB.americas.hpqcorp.net>

Well, I did not plan to test all the possible versions of the kernel; for sure improvements are on their way, what just confirms the assumption that this 'technology' is not mature yet.

With IPoIB an NFS server can easily export (for instance) up to 1.2GB/s (at least this is what I can measure), with the data in the page cache. No problem up to that point at least.
I clearly understand the theoretical benefits of RDMA and it's a clear improvement over TCP, for MPI. However, the drastic change for MPI is even more on the latency side, though the peak message bandwidth is also improved as one might expect for NFS.
Registration/deregistration issues are also well-known to the MPI developpers, and all this is certainly not that easy to manage in other areas.

Still, NFS-RDMA remains NFS. If the bottleneck is not in the transport, nothing will be improved by RDMA from the performance point of view.
Even worse, what I saw with the 2.6.27 kernel + OFED1.4-rc3 is the inability of NFS-RDMA to match the performance of NFS-TCP for some patterns of IOzone, with a filesystem able to sustain itself several hundreds of MB/s (using exactly the same hardware and software in both cases). We are far from a pure IB bandwidth issue here, we are just facing an issue in how the requests are handled probably, perhaps when paging occurs, I can't tell.
I could not find any tuning to solve the more obvious problem, i.e. the low bandwidth for reading, except mounting with '-o rsize=4096'; probably not what people expect, as this will have other effects. Anyway this does improve only the sequential read bandwidth.
But of course I will repeat my tests with the latest release of everything when I have time, still making sure I compare apples to apples...
Again, I'm sure improvements are on their way !

Fred.


-----Original Message-----
From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com]
Sent: Tuesday, 11 November, 2008 17:02
To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
Cc: Jeff Becker; general at lists.openfabrics.org
Subject: RE: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?

At 11:27 AM 11/10/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>That's great, thanks.
>
>I ran some tests with the 2.6.27 kernel as server and client, and
>basically it works fine.
>
>I could not find yet any situation where NFS-RDMA would outperform
>NFS/IPoIB, at least when you compare apples to apples (same clients,
>same server, same protocol, and not just write to/read from the
>caches), and it even seems to have severe performance issues for
>reading with files larger than the memory size of the client and the server.
>Hopefully this will improve when more users will be able to give
>valuable feedback...

I have a couple of questions, and perhaps suggestions as well.
First the questions...

- Have you tried with a 2.6.28-rc4 client and server at all? There are a number of significant NFS/RDMA improvements queued in kernel.org, especially around RDMA memory registration as well as RDMA operation scheduling. We've seen some significant throughput improvement even for basic tunings.

- What type of storage are you using at the server, and have you attempted to tune the server at all? For example, if you are storage
(spindle) limited, no network tuning is likely to help and you should address that first. Also, there are tunings such as nfsd thread count, export options, and adapter choice that can make a large difference.

Bottom line, you should be able to reach multi-hundred-MB/sec of read/write throughput with NFS/RDMA, but there may be issues on specific systems, or perhaps with the OFED1.4 code, that need to be accounted for. If possible, you may want to set expectations based on mainline, then try to duplicate them in the OFED backport.
The current OFED NFS/RDMA support is still evolving, while we consider the mainline kernel.org version to be rather solid.

Tom.

>
>Fred.
>
>-----Original Message-----
>From: Jeff Becker [mailto:Jeffrey.C.Becker at nasa.gov]
>Sent: Saturday, 08 November, 2008 22:35
>To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
>Cc: general at lists.openfabrics.org
>Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
>
>Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>> Is there any chance that the new NFS-RDMA features coming with OFED
>> 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
>Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27
>and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be
>done for OFED 1.4.1. Thanks.
>
>-jeff
>
>> Did anybody test this, or would pretend it is supposed to work ?
>>
>> I mean without building a 2.6.27 or equivalent kernel on top of it,
>> keeping almost full support from the vendors.
>>
>> Enhanced kernel modules may not be sufficient to work around the
>> limitations of old kernels...
>>
>>
>>


From Thomas.Talpey at netapp.com  Tue Nov 11 10:57:04 2008
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Tue, 11 Nov 2008 13:57:04 -0500
Subject: FW: [ofa-general] NFS-RDMA (OFED1.4) with standard 
	distributions ?
In-Reply-To: <7391130E01ED404FBD7A3C86731EEB7D20ECAB4C0F@GVW1087EXB.amer
	icas.hpqcorp.net>
References: <7391130E01ED404FBD7A3C86731EEB7D20ECAB4C0F@GVW1087EXB.americas.hpqcorp.net>
Message-ID: <RTPCLUEXC2-PRD1XvpF0000011c@RTPMVEXC1-PRD.hq.netapp.com>

At 01:06 PM 11/11/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>Well, I did not plan to test all the possible versions of the kernel; 
>for sure improvements are on their way, what just confirms the 
>assumption that this 'technology' is not mature yet.

First, let's be sure to separate NFS/RDMA OFED issues from core NFS/RDMA.
The OFED1.4 release is the first to support NFS/RDMA, and there are
certainly issues remaining in this new backport. Depending on which kernel
you're targeting, there can be other issues - SLES10 is 2.6.16-based, for
example, and RHEL5 is 2.6.18. The NFS code itself (not just NFS/RDMA) has
evolved significantly since then, and continues to do so.

>With IPoIB an NFS server can easily export (for instance) up to 
>1.2GB/s (at least this is what I can measure), with the data in the 
>page cache. No problem up to that point at least.

This is impressive, by the way. I have not seen any results with NFS/IPoIB
at this level. Most client machines run out of CPU far before this.

>I clearly understand the theoretical benefits of RDMA and it's a clear 
>improvement over TCP, for MPI. However, the drastic change for MPI is 
>even more on the latency side, though the peak message bandwidth is 
>also improved as one might expect for NFS.
>Registration/deregistration issues are also well-known to the MPI 
>developpers, and all this is certainly not that easy to manage in other areas.
>
>Still, NFS-RDMA remains NFS. If the bottleneck is not in the 
>transport, nothing will be improved by RDMA from the performance point of view.
>Even worse, what I saw with the 2.6.27 kernel + OFED1.4-rc3 is the 
>inability of NFS-RDMA to match the performance of NFS-TCP for some 
>patterns of IOzone, with a filesystem able to sustain itself several 
>hundreds of MB/s (using exactly the same hardware and software in both 
>cases). We are far from a pure IB bandwidth issue here, we are just 
>facing an issue in how the requests are handled probably, perhaps when 
>paging occurs, I can't tell.

I'd be very interested in any analysis of this which you may have done. One
thought that comes to mind is the possibility that your server's filesystem
performs less well at the 32KB read/write sizes that the NFS/RDMA client is
currently limited to. If you were measuring large-sequential workloads, then
you might be able to measure a difference, particularly when exporting the
filesystem in the default "sync" mode. NFS/TCP can send up to 1MB writes.
This is something we plan to address now that the FRMR memory registration
mode is available.

>I could not find any tuning to solve the more obvious problem, i.e. 
>the low bandwidth for reading, except mounting with '-o rsize=4096'; 

Ouch! That will severely limit the client, forcing it to send MANY more RPC
requests. Did performance increase with this setting? For iozone with what
options?

>probably not what people expect, as this will have other effects. 
>Anyway this does improve only the sequential read bandwidth.
>But of course I will repeat my tests with the latest release of 
>everything when I have time, still making sure I compare apples to apples...
>Again, I'm sure improvements are on their way !

I would look forward to seeing your opinions of the new code, particularly for
the server performance. Thanks for the info so far!

Tom.


>
>Fred.
>
>
>-----Original Message-----
>From: Talpey, Thomas [mailto:Thomas.Talpey at netapp.com]
>Sent: Tuesday, 11 November, 2008 17:02
>To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
>Cc: Jeff Becker; general at lists.openfabrics.org
>Subject: RE: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
>
>At 11:27 AM 11/10/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>>That's great, thanks.
>>
>>I ran some tests with the 2.6.27 kernel as server and client, and
>>basically it works fine.
>>
>>I could not find yet any situation where NFS-RDMA would outperform
>>NFS/IPoIB, at least when you compare apples to apples (same clients,
>>same server, same protocol, and not just write to/read from the
>>caches), and it even seems to have severe performance issues for
>>reading with files larger than the memory size of the client and the server.
>>Hopefully this will improve when more users will be able to give
>>valuable feedback...
>
>I have a couple of questions, and perhaps suggestions as well.
>First the questions...
>
>- Have you tried with a 2.6.28-rc4 client and server at all? There are 
>a number of significant NFS/RDMA improvements queued in kernel.org, 
>especially around RDMA memory registration as well as RDMA operation 
>scheduling. We've seen some significant throughput improvement even 
>for basic tunings.
>
>- What type of storage are you using at the server, and have you 
>attempted to tune the server at all? For example, if you are storage
>(spindle) limited, no network tuning is likely to help and you should 
>address that first. Also, there are tunings such as nfsd thread count, 
>export options, and adapter choice that can make a large difference.
>
>Bottom line, you should be able to reach multi-hundred-MB/sec of 
>read/write throughput with NFS/RDMA, but there may be issues on 
>specific systems, or perhaps with the OFED1.4 code, that need to be 
>accounted for. If possible, you may want to set expectations based on 
>mainline, then try to duplicate them in the OFED backport.
>The current OFED NFS/RDMA support is still evolving, while we consider 
>the mainline kernel.org version to be rather solid.
>
>Tom.
>
>>
>>Fred.
>>
>>-----Original Message-----
>>From: Jeff Becker [mailto:Jeffrey.C.Becker at nasa.gov]
>>Sent: Saturday, 08 November, 2008 22:35
>>To: Ciesielski, Frederic (EMEA HPC&OSLO CC)
>>Cc: general at lists.openfabrics.org
>>Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
>>
>>Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>>> Is there any chance that the new NFS-RDMA features coming with OFED
>>> 1.4 work with standard and current distributions, like RHEL5, SLES10 ?
>>Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27
>>and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be
>>done for OFED 1.4.1. Thanks.
>>
>>-jeff
>>
>>> Did anybody test this, or would pretend it is supposed to work ?
>>>
>>> I mean without building a 2.6.27 or equivalent kernel on top of it,
>>> keeping almost full support from the vendors.
>>>
>>> Enhanced kernel modules may not be sufficient to work around the
>>> limitations of old kernels...
>>>
>>>
>>>


From sashak at voltaire.com  Tue Nov 11 11:19:58 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 11 Nov 2008 21:19:58 +0200
Subject: [ofa-general] Re: [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <1225404078.1197.533.camel@cardanus.llnl.gov>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
Message-ID: <20081111191958.GA8894@sashak.voltaire.com>

Hi Al,

On 15:01 Thu 30 Oct     , Al Chu wrote:
> 
> I found a bunch of qos config parsing issues, listed below:
> 
> 1)
> 
> If the user sets the qos default fields (i.e. qos_high_limit,
> qos_vlarb_high. etc.), but do not have the qos_ca, qos_swe, qos_rtr,
> etc. equivalent fields listed (i.e. qos_ca_high_limit,
> qos_sw0_vlarb_high), the values set in teh qos default fields are not
> loaded into the CAs, switches, etc.  The reason is in qos_build_config()
> we load defaults like this:
> 
> p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> 
> but we always set the fields to something non-NULL.
> 
> static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
> {
>         opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS;
>         opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
>         opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH;
>         opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW;
>         opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
> }

Yes, we are setting this to the default qos set (if not explicitly
specified by user). So finally we always have valid set. No?

> 2)
> 
> In qos_build_config() we load the high_limit like this:
> 
> cfg->vl_high_limit = (uint8_t) opt->high_limit;
> 
> So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> options to "go back to" the default high_limit.  It just assumes that
> whatever is input (or was set by default) is what you should use.

Right. What is a limitation here? That an user cannot set this to
"no value"? But she/he can just skip it.

> 3)
> 
> Some fields like qos_vlarb_high are assumed to be correctly set and can
> segfault opensm.

qos_build_config() assumes that valid parameters are used. And we are
using this this way (I hope :)) (finally it is not library API).

> The attached patch fixes these up.  Obviously there's tons of ways to
> do this.  I decided to ...
> 
> A) only initialization qos_options to the real defaults
> 
> B) init all qos_*_options to sentinel values (-1, NULL, etc.) to
> indicate it should use the configured defaults if they aren't set by the
> user.  The high_limit was changed from an unsigned to an int b/c 0 is a
> valid high_limit value.
> 
> C) verify that the default qos inputs are definitely correct (i.e. can't
> be NULL).  Reset to hard coded defaults if need be.
> 
> D) load the default vs. non-default appropriately in QoS.

And I see that we have here much more sometimes not-trivial flows and
default values are spread over many places... :(

Sasha

> 
> Al
> 
> P.S.  This patch does not rely on my previous "remove qos_max_vls
> config" patch.  I assume we're keeping the max_vls fields in this patch.
> 
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From 00a15a1797b79fd5e3298d98742b6da3613fb9c3 Mon Sep 17 00:00:00 2001
> From: root <root at wopri.(none)>
> Date: Thu, 30 Oct 2008 09:32:29 -0700
> Subject: [PATCH] fix qos config parsing bugs
> 
> 
> Signed-off-by: root <root at wopri.(none)>
> ---
>  opensm/include/opensm/osm_subnet.h |   12 +-
>  opensm/opensm/osm_qos.c            |    6 +-
>  opensm/opensm/osm_subnet.c         |  467 ++++++++++++++++++++++--------------
>  3 files changed, 293 insertions(+), 192 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index 7259587..11063b7 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -99,7 +99,7 @@ struct osm_qos_policy;
>  */
>  typedef struct osm_qos_options {
>  	unsigned max_vls;
> -	unsigned high_limit;
> +	int high_limit;
>  	char *vlarb_high;
>  	char *vlarb_low;
>  	char *sl2vl;
> @@ -108,20 +108,20 @@ typedef struct osm_qos_options {
>  * FIELDS
>  *
>  *	max_vls
> -*		The number of maximum VLs on the Subnet
> +*		The number of maximum VLs on the Subnet (0 == use default)
>  *
>  *	high_limit
>  *		The limit of High Priority component of VL Arbitration
> -*		table (IBA 7.6.9)
> +*		table (IBA 7.6.9) (-1 == use default)
>  *
>  *	vlarb_high
> -*		High priority VL Arbitration table template.
> +*		High priority VL Arbitration table template. (NULL == use default)
>  *
>  *	vlarb_low
> -*		Low priority VL Arbitration table template.
> +*		Low priority VL Arbitration table template. (NULL == use default)
>  *
>  *	sl2vl
> -*		SL2VL Mapping table (IBA 7.6.6) template.
> +*		SL2VL Mapping table (IBA 7.6.6) template. (NULL == use default)
>  *
>  *********/
>  
> diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
> index 1679ae0..b451c25 100644
> --- a/opensm/opensm/osm_qos.c
> +++ b/opensm/opensm/osm_qos.c
> @@ -382,7 +382,11 @@ static void qos_build_config(struct qos_config *cfg,
>  	memset(cfg, 0, sizeof(*cfg));
>  
>  	cfg->max_vls = opt->max_vls > 0 ? opt->max_vls : dflt->max_vls;
> -	cfg->vl_high_limit = (uint8_t) opt->high_limit;
> +
> +	if (opt->high_limit >= 0)
> +		cfg->vl_high_limit = (uint8_t) opt->high_limit;
> +	else
> +		cfg->vl_high_limit = (uint8_t) dflt->high_limit;
>  
>  	p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
>  	for (i = 0; i < 2 * IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; i++) {
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 0422d0f..ab2ff9c 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -370,6 +370,15 @@ static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
>  	opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
>  }
>  
> +static void subn_init_qos_options(IN osm_qos_options_t * opt)
> +{
> +	opt->max_vls = 0;
> +	opt->high_limit = -1;
> +	opt->vlarb_high = NULL;
> +	opt->vlarb_low = NULL;
> +	opt->sl2vl = NULL;
> +}
> +
>  /**********************************************************************
>   **********************************************************************/
>  void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
> @@ -458,10 +467,10 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>  	p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE;
>  	p_opt->consolidate_ipv6_snm_req = FALSE;
>  	subn_set_default_qos_options(&p_opt->qos_options);
> -	subn_set_default_qos_options(&p_opt->qos_ca_options);
> -	subn_set_default_qos_options(&p_opt->qos_sw0_options);
> -	subn_set_default_qos_options(&p_opt->qos_swe_options);
> -	subn_set_default_qos_options(&p_opt->qos_rtr_options);
> +	subn_init_qos_options(&p_opt->qos_ca_options);
> +	subn_init_qos_options(&p_opt->qos_sw0_options);
> +	subn_init_qos_options(&p_opt->qos_swe_options);
> +	subn_init_qos_options(&p_opt->qos_rtr_options);
>  }
>  
>  /**********************************************************************
> @@ -497,6 +506,7 @@ opts_unpack_net64(IN char *p_req_key,
>  	}
>  }
>  
> +
>  /**********************************************************************
>   **********************************************************************/
>  static void
> @@ -511,6 +521,20 @@ opts_unpack_uint32(IN char *p_req_key,
>  		}
>  	}
>  }
> +/**********************************************************************
> + **********************************************************************/
> +static void
> +opts_unpack_int32(IN char *p_req_key,
> +		  IN char *p_key, IN char *p_val_str, IN int32_t * p_val)
> +{
> +	if (!strcmp(p_req_key, p_key)) {
> +		int32_t val = strtol(p_val_str, NULL, 0);
> +		if (val != *p_val) {
> +			log_config_value(p_key, "%d", val);
> +			*p_val = val;
> +		}
> +	}
> +}
>  
>  /**********************************************************************
>   **********************************************************************/
> @@ -641,7 +665,7 @@ subn_parse_qos_options(IN const char *prefix,
>  	snprintf(name, sizeof(name), "%s_max_vls", prefix);
>  	opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls);
>  	snprintf(name, sizeof(name), "%s_high_limit", prefix);
> -	opts_unpack_uint32(name, p_key, p_val_str, &opt->high_limit);
> +	opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit);
>  	snprintf(name, sizeof(name), "%s_vlarb_high", prefix);
>  	opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high);
>  	snprintf(name, sizeof(name), "%s_vlarb_low", prefix);
> @@ -653,7 +677,9 @@ subn_parse_qos_options(IN const char *prefix,
>  static int
>  subn_dump_qos_options(FILE * file,
>  		      const char *set_name,
> -		      const char *prefix, osm_qos_options_t * opt)
> +		      const char *prefix, 
> +		      osm_qos_options_t * opt,
> +		      osm_qos_options_t * dflt)
>  {
>  	return fprintf(file, "# %s\n"
>  		       "%s_max_vls %u\n"
> @@ -662,10 +688,11 @@ subn_dump_qos_options(FILE * file,
>  		       "%s_vlarb_low %s\n"
>  		       "%s_sl2vl %s\n",
>  		       set_name,
> -		       prefix, opt->max_vls,
> -		       prefix, opt->high_limit,
> -		       prefix, opt->vlarb_high,
> -		       prefix, opt->vlarb_low, prefix, opt->sl2vl);
> +		       prefix, opt->max_vls > 0 ? opt->max_vls : dflt->max_vls,
> +		       prefix, opt->high_limit >= 0 ? opt->high_limit : dflt->high_limit,
> +		       prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high,
> +		       prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low, 
> +		       prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
>  }
>  
>  /**********************************************************************
> @@ -833,169 +860,182 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
>  /**********************************************************************
>   **********************************************************************/
>  
> -static void subn_verify_max_vls(IN unsigned *max_vls, IN char *key)
> +static void subn_verify_max_vls(IN unsigned *max_vls, IN char *key, IN unsigned dflt)
>  {
>  	char buff[128];
>  
> -	if (*max_vls > 15) {
> +	if (!(*max_vls) || *max_vls > 15) {
>  		sprintf(buff, " Invalid Cached Option:%s=%u:"
> -			"Using Default:%u\n",
> -			key, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
> +			"Using Default\n",
> +			key, *max_vls);
>  		printf(buff);
>  		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> -		*max_vls = OSM_DEFAULT_QOS_MAX_VLS;
> +		*max_vls = dflt;
>  	}
>  }
>  
> -static void subn_verify_high_limit(IN unsigned *high_limit, IN char *key)
> +static void subn_verify_high_limit(IN int *high_limit, IN char *key, IN int dflt)
>  {
>  	char buff[128];
>  
> -	if (*high_limit > 255) {
> -		sprintf(buff, " Invalid Cached Option:%s=%u:"
> -			"Using Default:%u\n",
> -			key, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
> +	if (*high_limit < 0 || *high_limit > 255) {
> +		sprintf(buff, " Invalid Cached Option:%s=%d:"
> +			"Using Default\n", key, *high_limit);
>  		printf(buff);
>  		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> -		*high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
> +		*high_limit = dflt;
>  	}
>  }
>  
> -static void subn_verify_vlarb(IN char *vlarb, IN char *key)
> +static void subn_verify_vlarb(IN char **vlarb, IN char *key, IN char *dflt)
>  {
> -	if (vlarb) {
> -		char buff[128];
> -		char *str, *tok, *end, *ptr;
> -		int count = 0;
> -
> -		str = (char *)malloc(strlen(vlarb) + 1);
> -		strcpy(str, vlarb);
> -
> -		tok = strtok_r(str, ",\n", &ptr);
> -		while (tok) {
> -			char *vl_str, *weight_str;
> -
> -			vl_str = tok;
> -			weight_str = strchr(tok, ':');
> -
> -			if (weight_str) {
> -				long vl, weight;
> -
> -				*weight_str = '\0';
> -				weight_str++;
> -
> -				vl = strtol(vl_str, &end, 0);
> -
> -				if (*end) {
> -					sprintf(buff,
> -						" Warning: Cached Option %s:vl=%s improperly formatted\n",
> -						key, vl_str);
> -					printf(buff);
> -					cl_log_event("OpenSM", CL_LOG_INFO,
> -						     buff, NULL, 0);
> -				} else if (vl < 0 || vl > 14) {
> -					sprintf(buff,
> -						" Warning: Cached Option %s:vl=%ld out of range\n",
> -						key, vl);
> -					printf(buff);
> -					cl_log_event("OpenSM", CL_LOG_INFO,
> -						     buff, NULL, 0);
> -				}
> -
> -				weight = strtol(weight_str, &end, 0);
> -
> -				if (*end) {
> -					sprintf(buff,
> -						" Warning: Cached Option %s:weight=%s improperly formatted\n",
> -						key, weight_str);
> -					printf(buff);
> -					cl_log_event("OpenSM", CL_LOG_INFO,
> -						     buff, NULL, 0);
> -				} else if (weight < 0 || weight > 255) {
> -					sprintf(buff,
> -						" Warning: Cached Option %s:weight=%ld out of range\n",
> -						key, weight);
> -					printf(buff);
> -					cl_log_event("OpenSM", CL_LOG_INFO,
> -						     buff, NULL, 0);
> -				}
> -			} else {
> -				sprintf(buff,
> -					" Warning: Cached Option %s:vl:weight=%s improperly formatted\n",
> -					key, tok);
> -				printf(buff);
> -				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> -					     0);
> -			}
> +	char buff[128];
> +	char *str, *tok, *end, *ptr;
> +	int count = 0;
>  
> -			count++;
> -			tok = strtok_r(NULL, ",\n", &ptr);
> -		}
> +	if (*vlarb == NULL) {
> +		sprintf(buff, " Invalid Cached Option:%s:"
> +			"Using Default\n", key);
> +		printf(buff);
> +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> +		(*vlarb) = dflt;			
> +		return;
> +	}
>  
> -		if (count > 64) {
> -			sprintf(buff,
> -				" Warning: Cached Option %s: > 64 listed: "
> -				"excess vl:weight pairs will be dropped\n",
> -				key);
> -			printf(buff);
> -			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> -		}
> +	str = (char *)malloc(strlen(*vlarb) + 1);
> +	strcpy(str, *vlarb);
>  
> -		free(str);
> -	}
> -}
> +	tok = strtok_r(str, ",\n", &ptr);
> +	while (tok) {
> +		char *vl_str, *weight_str;
>  
> -static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
> -{
> -	if (sl2vl) {
> -		char buff[128];
> -		char *str, *tok, *end, *ptr;
> -		int count = 0;
> +		vl_str = tok;
> +		weight_str = strchr(tok, ':');
>  
> -		str = (char *)malloc(strlen(sl2vl) + 1);
> -		strcpy(str, sl2vl);
> +		if (weight_str) {
> +			long vl, weight;
>  
> -		tok = strtok_r(str, ",\n", &ptr);
> -		while (tok) {
> -			long vl = strtol(tok, &end, 0);
> +			*weight_str = '\0';
> +			weight_str++;
> +
> +			vl = strtol(vl_str, &end, 0);
>  
>  			if (*end) {
>  				sprintf(buff,
>  					" Warning: Cached Option %s:vl=%s improperly formatted\n",
> -					key, tok);
> +					key, vl_str);
>  				printf(buff);
> -				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> -					     0);
> -			} else if (vl < 0 || vl > 15) {
> +				cl_log_event("OpenSM", CL_LOG_INFO,
> +					     buff, NULL, 0);
> +			} else if (vl < 0 || vl > 14) {
>  				sprintf(buff,
>  					" Warning: Cached Option %s:vl=%ld out of range\n",
>  					key, vl);
>  				printf(buff);
> -				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> -					     0);
> +				cl_log_event("OpenSM", CL_LOG_INFO,
> +					     buff, NULL, 0);
>  			}
>  
> -			count++;
> -			tok = strtok_r(NULL, ",\n", &ptr);
> -		}
> +			weight = strtol(weight_str, &end, 0);
>  
> -		if (count < 16) {
> +			if (*end) {
> +				sprintf(buff,
> +					" Warning: Cached Option %s:weight=%s improperly formatted\n",
> +					key, weight_str);
> +				printf(buff);
> +				cl_log_event("OpenSM", CL_LOG_INFO,
> +					     buff, NULL, 0);
> +			} else if (weight < 0 || weight > 255) {
> +				sprintf(buff,
> +					" Warning: Cached Option %s:weight=%ld out of range\n",
> +					key, weight);
> +				printf(buff);
> +				cl_log_event("OpenSM", CL_LOG_INFO,
> +					     buff, NULL, 0);
> +			}
> +		} else {
>  			sprintf(buff,
> -				" Warning: Cached Option %s: < 16 VLs listed\n",
> -				key);
> +				" Warning: Cached Option %s:vl:weight=%s improperly formatted\n",
> +				key, tok);
>  			printf(buff);
> -			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> +			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> +				     0);
>  		}
> -		if (count > 16) {
> +
> +		count++;
> +		tok = strtok_r(NULL, ",\n", &ptr);
> +	}
> +
> +	if (count > 64) {
> +		sprintf(buff,
> +			" Warning: Cached Option %s: > 64 listed: "
> +			"excess vl:weight pairs will be dropped\n",
> +			key);
> +		printf(buff);
> +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> +	}
> +
> +	free(str);
> +}
> +
> +static void subn_verify_sl2vl(IN char **sl2vl, IN char *key, IN char *dflt)
> +{
> +	char buff[128];
> +	char *str, *tok, *end, *ptr;
> +	int count = 0;
> +
> +	if (*sl2vl == NULL) {
> +		sprintf(buff, " Invalid Cached Option:%s:"
> +			"Using Default\n", key);
> +		printf(buff);
> +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> +		(*sl2vl) = dflt;			
> +		return;
> +	}
> +
> +	str = (char *)malloc(strlen(*sl2vl) + 1);
> +	strcpy(str, *sl2vl);
> +
> +	tok = strtok_r(str, ",\n", &ptr);
> +	while (tok) {
> +		long vl = strtol(tok, &end, 0);
> +
> +		if (*end) {
>  			sprintf(buff,
> -				" Warning: Cached Option %s: > 16 listed: "
> -				"excess VLs will be dropped\n", key);
> +				" Warning: Cached Option %s:vl=%s improperly formatted\n",
> +				key, tok);
>  			printf(buff);
> -			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> +			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> +				     0);
> +		} else if (vl < 0 || vl > 15) {
> +			sprintf(buff,
> +				" Warning: Cached Option %s:vl=%ld out of range\n",
> +				key, vl);
> +			printf(buff);
> +			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> +				     0);
>  		}
>  
> -		free(str);
> +		count++;
> +		tok = strtok_r(NULL, ",\n", &ptr);
> +	}
> +
> +	if (count < 16) {
> +		sprintf(buff,
> +			" Warning: Cached Option %s: < 16 VLs listed\n",
> +			key);
> +		printf(buff);
> +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
>  	}
> +	if (count > 16) {
> +		sprintf(buff,
> +			" Warning: Cached Option %s: > 16 listed: "
> +			"excess VLs will be dropped\n", key);
> +		printf(buff);
> +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> +	}
> +
> +	free(str);
>  }
>  
>  static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
> @@ -1046,61 +1086,113 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
>  	}
>  
>  	if (p_opts->qos) {
> +		/* the default options in qos_options must be correct.
> +		 * every other one need not be, b/c those will default
> +		 * back to whatever is in qos_options.
> +		 */
> +
>  		subn_verify_max_vls(&(p_opts->qos_options.max_vls),
> -				    "qos_max_vls");
> -		subn_verify_max_vls(&(p_opts->qos_ca_options.max_vls),
> -				    "qos_ca_max_vls");
> -		subn_verify_max_vls(&(p_opts->qos_sw0_options.max_vls),
> -				    "qos_sw0_max_vls");
> -		subn_verify_max_vls(&(p_opts->qos_swe_options.max_vls),
> -				    "qos_swe_max_vls");
> -		subn_verify_max_vls(&(p_opts->qos_rtr_options.max_vls),
> -				    "qos_rtr_max_vls");
> +				    "qos_max_vls",
> +				    OSM_DEFAULT_MAX_OP_VLS);
> +		if (p_opts->qos_ca_options.max_vls)
> +			subn_verify_max_vls(&(p_opts->qos_ca_options.max_vls),
> +					    "qos_ca_max_vls",
> +					    0);
> +		if (p_opts->qos_sw0_options.max_vls)
> +			subn_verify_max_vls(&(p_opts->qos_sw0_options.max_vls),
> +					    "qos_sw0_max_vls",
> +					    0);
> +		if (p_opts->qos_swe_options.max_vls)
> +			subn_verify_max_vls(&(p_opts->qos_swe_options.max_vls),
> +					    "qos_swe_max_vls",
> +					    0);
> +		if (p_opts->qos_rtr_options.max_vls)
> +			subn_verify_max_vls(&(p_opts->qos_rtr_options.max_vls),
> +					    "qos_rtr_max_vls",
> +					    0);
>  
>  		subn_verify_high_limit(&(p_opts->qos_options.high_limit),
> -				       "qos_high_limit");
> -		subn_verify_high_limit(&(p_opts->qos_ca_options.high_limit),
> -				       "qos_ca_high_limit");
> -		subn_verify_high_limit(&
> -				       (p_opts->qos_sw0_options.high_limit),
> -				       "qos_sw0_high_limit");
> -		subn_verify_high_limit(&
> -				       (p_opts->qos_swe_options.high_limit),
> -				       "qos_swe_high_limit");
> -		subn_verify_high_limit(&
> -				       (p_opts->qos_rtr_options.high_limit),
> -				       "qos_rtr_high_limit");
> -
> -		subn_verify_vlarb(p_opts->qos_options.vlarb_low,
> -				  "qos_vlarb_low");
> -		subn_verify_vlarb(p_opts->qos_ca_options.vlarb_low,
> -				  "qos_ca_vlarb_low");
> -		subn_verify_vlarb(p_opts->qos_sw0_options.vlarb_low,
> -				  "qos_sw0_vlarb_low");
> -		subn_verify_vlarb(p_opts->qos_swe_options.vlarb_low,
> -				  "qos_swe_vlarb_low");
> -		subn_verify_vlarb(p_opts->qos_rtr_options.vlarb_low,
> -				  "qos_rtr_vlarb_low");
> -
> -		subn_verify_vlarb(p_opts->qos_options.vlarb_high,
> -				  "qos_vlarb_high");
> -		subn_verify_vlarb(p_opts->qos_ca_options.vlarb_high,
> -				  "qos_ca_vlarb_high");
> -		subn_verify_vlarb(p_opts->qos_sw0_options.vlarb_high,
> -				  "qos_sw0_vlarb_high");
> -		subn_verify_vlarb(p_opts->qos_swe_options.vlarb_high,
> -				  "qos_swe_vlarb_high");
> -		subn_verify_vlarb(p_opts->qos_rtr_options.vlarb_high,
> -				  "qos_rtr_vlarb_high");
> -
> -		subn_verify_sl2vl(p_opts->qos_options.sl2vl, "qos_sl2vl");
> -		subn_verify_sl2vl(p_opts->qos_ca_options.sl2vl, "qos_ca_sl2vl");
> -		subn_verify_sl2vl(p_opts->qos_sw0_options.sl2vl,
> -				  "qos_sw0_sl2vl");
> -		subn_verify_sl2vl(p_opts->qos_swe_options.sl2vl,
> -				  "qos_swe_sl2vl");
> -		subn_verify_sl2vl(p_opts->qos_rtr_options.sl2vl,
> -				  "qos_rtr_sl2vl");
> +				       "qos_high_limit",
> +				       OSM_DEFAULT_QOS_HIGH_LIMIT);
> +		if (p_opts->qos_ca_options.high_limit >= 0)
> +			subn_verify_high_limit(&(p_opts->qos_ca_options.high_limit),
> +					       "qos_ca_high_limit",
> +					       -1);
> +		if (p_opts->qos_sw0_options.high_limit >= 0)
> +			subn_verify_high_limit(&
> +					       (p_opts->qos_sw0_options.high_limit),
> +					       "qos_sw0_high_limit",
> +					       -1);
> +		if (p_opts->qos_swe_options.high_limit >= 0)
> +			subn_verify_high_limit(&
> +					       (p_opts->qos_swe_options.high_limit),
> +					       "qos_swe_high_limit",
> +					       -1);
> +		if (p_opts->qos_rtr_options.high_limit >= 0)
> +			subn_verify_high_limit(&
> +					       (p_opts->qos_rtr_options.high_limit),
> +					       "qos_rtr_high_limit",
> +					       -1);
> +
> +		subn_verify_vlarb(&(p_opts->qos_options.vlarb_low),
> +				  "qos_vlarb_low",
> +				  OSM_DEFAULT_QOS_VLARB_LOW);
> +		if (p_opts->qos_ca_options.vlarb_low)
> +			subn_verify_vlarb(&(p_opts->qos_ca_options.vlarb_low),
> +					  "qos_ca_vlarb_low",
> +					  NULL);
> +		if (p_opts->qos_sw0_options.vlarb_low)
> +			subn_verify_vlarb(&(p_opts->qos_sw0_options.vlarb_low),
> +					  "qos_sw0_vlarb_low",
> +					  NULL);
> +		if (p_opts->qos_swe_options.vlarb_low)
> +			subn_verify_vlarb(&(p_opts->qos_swe_options.vlarb_low),
> +					  "qos_swe_vlarb_low",
> +					  NULL);
> +		if (p_opts->qos_rtr_options.vlarb_low)
> +			subn_verify_vlarb(&(p_opts->qos_rtr_options.vlarb_low),
> +					  "qos_rtr_vlarb_low",
> +					  NULL);
> +
> +		subn_verify_vlarb(&(p_opts->qos_options.vlarb_high),
> +				  "qos_vlarb_high",
> +				  OSM_DEFAULT_QOS_VLARB_HIGH);
> +		if (p_opts->qos_ca_options.vlarb_high)
> +			subn_verify_vlarb(&(p_opts->qos_ca_options.vlarb_high),
> +					  "qos_ca_vlarb_high",
> +					  NULL);
> +		if (p_opts->qos_sw0_options.vlarb_high)
> +			subn_verify_vlarb(&(p_opts->qos_sw0_options.vlarb_high),
> +					  "qos_sw0_vlarb_high",
> +					  NULL);
> +		if (p_opts->qos_swe_options.vlarb_high)
> +			subn_verify_vlarb(&(p_opts->qos_swe_options.vlarb_high),
> +					  "qos_swe_vlarb_high",
> +					  NULL);
> +		if (p_opts->qos_rtr_options.vlarb_high)
> +			subn_verify_vlarb(&(p_opts->qos_rtr_options.vlarb_high),
> +					  "qos_rtr_vlarb_high",
> +					  NULL);
> +
> +		subn_verify_sl2vl(&(p_opts->qos_options.sl2vl), 
> +				  "qos_sl2vl",
> +				  OSM_DEFAULT_QOS_SL2VL);
> +		if (p_opts->qos_ca_options.sl2vl)
> +			subn_verify_sl2vl(&(p_opts->qos_ca_options.sl2vl), 
> +					  "qos_ca_sl2vl",
> +					  NULL);
> +		if (p_opts->qos_sw0_options.sl2vl)
> +			subn_verify_sl2vl(&(p_opts->qos_sw0_options.sl2vl),
> +					  "qos_sw0_sl2vl",
> +					  NULL);
> +		if (p_opts->qos_swe_options.sl2vl)
> +			subn_verify_sl2vl(&(p_opts->qos_swe_options.sl2vl),
> +					  "qos_swe_sl2vl",
> +					  NULL);
> +		if (p_opts->qos_rtr_options.sl2vl)
> +			subn_verify_sl2vl(&(p_opts->qos_rtr_options.sl2vl),
> +					  "qos_rtr_sl2vl",
> +					  NULL);
>  	}
>  #ifdef ENABLE_OSM_PERF_MGR
>  	if (p_opts->perfmgr_sweep_time_s < 1) {
> @@ -1714,23 +1806,28 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
>  
>  	subn_dump_qos_options(opts_file,
>  			      "QoS default options", "qos",
> +			      &p_opts->qos_options,
>  			      &p_opts->qos_options);
>  	fprintf(opts_file, "\n");
>  	subn_dump_qos_options(opts_file,
>  			      "QoS CA options", "qos_ca",
> -			      &p_opts->qos_ca_options);
> +			      &p_opts->qos_ca_options,
> +			      &p_opts->qos_options);
>  	fprintf(opts_file, "\n");
>  	subn_dump_qos_options(opts_file,
>  			      "QoS Switch Port 0 options", "qos_sw0",
> -			      &p_opts->qos_sw0_options);
> +			      &p_opts->qos_sw0_options,
> +			      &p_opts->qos_options);
>  	fprintf(opts_file, "\n");
>  	subn_dump_qos_options(opts_file,
>  			      "QoS Switch external ports options", "qos_swe",
> -			      &p_opts->qos_swe_options);
> +			      &p_opts->qos_swe_options,
> +			      &p_opts->qos_options);
>  	fprintf(opts_file, "\n");
>  	subn_dump_qos_options(opts_file,
>  			      "QoS Router ports options", "qos_rtr",
> -			      &p_opts->qos_rtr_options);
> +			      &p_opts->qos_rtr_options,
> +			      &p_opts->qos_options);
>  	fprintf(opts_file, "\n");
>  
>  	fprintf(opts_file,
> -- 
> 1.5.4.5
> 


From landman at scalableinformatics.com  Tue Nov 11 12:17:56 2008
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 11 Nov 2008 15:17:56 -0500
Subject: FW: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?
In-Reply-To: <7391130E01ED404FBD7A3C86731EEB7D20ECAB4C0F@GVW1087EXB.americas.hpqcorp.net>
References: <7391130E01ED404FBD7A3C86731EEB7D20ECAB4C0F@GVW1087EXB.americas.hpqcorp.net>
Message-ID: <4919E874.4090409@scalableinformatics.com>


Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
> Well, I did not plan to test all the possible versions of the kernel;
> for sure improvements are on their way, what just confirms the
> assumption that this 'technology' is not mature yet.
> 
> With IPoIB an NFS server can easily export (for instance) up to
> 1.2GB/s (at least this is what I can measure), with the data in the
> page cache. No problem up to that point at least. I clearly

True ... but not so interesting to the actual data read/write case when 
it has to get back to spinning disk.

> understand the theoretical benefits of RDMA and it's a clear
> improvement over TCP, for MPI. However, the drastic change for MPI is
> even more on the latency side, though the peak message bandwidth is
> also improved as one might expect for NFS. 

Again, true, though NFS has to walk through transport protocol layers as 
well as NFS application layers.  This additional effort reduces 
performance considerably.

Add to this that you need (sadly) a copy of a buffer between the network 
stack and the disk stack.  RDMA reduces one of these copies, but but as 
far as I know, it doesn't talk directly to the disks (you can do 
something like this with SCST in the block modes if you don't mind iSCSI).

> Registration/deregistration issues are also well-known to the MPI
> developpers, and all this is certainly not that easy to manage in
> other areas.
> 
> Still, NFS-RDMA remains NFS. If the bottleneck is not in the
> transport, nothing will be improved by RDMA from the performance
> point of view. Even worse, what I saw with the 2.6.27 kernel +
> OFED1.4-rc3 is the inability of NFS-RDMA to match the performance of
> NFS-TCP for some patterns of IOzone, with a filesystem able to

Hmmm.... Most of the (default) IOzone measurements we have done (and 
seen published) are bound almost entirely by system ram cache.  Indeed, 
we have had to go into the code and alter some of the constants to allow 
us to test greater than 16 MB records, and greater than 16 GB files. 
Otherwise all we measure is cache speed.

Could you elaborate on system parameters, and what measurements weren't 
up to par, as well as what options you used?

We see NFSoverRDMA on SDR achieving about 400 MB/s while NFS over IPoIB 
on the same hardware (identical actually) is about 200 MB/s on reads. 
With DDR IB, we ran a test between a pair of our JackRabbit machines, 
and found a sustained ~500-550 MB/s read, and about 400 MB/s or so 
write.  The underlying file system could handle well over 1 GB/s.

NFS over IPoIB wasn't close.

> sustain itself several hundreds of MB/s (using exactly the same
> hardware and software in both cases). We are far from a pure IB
> bandwidth issue here, we are just facing an issue in how the requests
> are handled probably, perhaps when paging occurs, I can't tell. I

I don't think this is the limitation.  I think it is more along the 
lines of copying buffers between different stacks ... kernel buffer to 
user space program and then back to kernel for net->ram->disk and 
vice-versa.

There are other issues as well which could be causing performance 
degradation, specifically on payload size.

FWIW:  This is a 2.6.27.5 kernel.

> could not find any tuning to solve the more obvious problem, i.e. the
> low bandwidth for reading, except mounting with '-o rsize=4096';
> probably not what people expect, as this will have other effects.
> Anyway this does improve only the sequential read bandwidth. But of
> course I will repeat my tests with the latest release of everything
> when I have time, still making sure I compare apples to apples... 
> Again, I'm sure improvements are on their way !
> 
> Fred.
> 
> 
> -----Original Message----- From: Talpey, Thomas
> [mailto:Thomas.Talpey at netapp.com] Sent: Tuesday, 11 November, 2008
> 17:02 To: Ciesielski, Frederic (EMEA HPC&OSLO CC) Cc: Jeff Becker;
> general at lists.openfabrics.org Subject: RE: [ofa-general] NFS-RDMA
> (OFED1.4) with standard distributions ?
> 
> At 11:27 AM 11/10/2008, Ciesielski, Frederic (EMEA HPC&OSLO CC)
> wrote:
>> That's great, thanks.
>> 
>> I ran some tests with the 2.6.27 kernel as server and client, and 
>> basically it works fine.
>> 
>> I could not find yet any situation where NFS-RDMA would outperform 
>> NFS/IPoIB, at least when you compare apples to apples (same
>> clients, same server, same protocol, and not just write to/read
>> from the caches), and it even seems to have severe performance
>> issues for reading with files larger than the memory size of the
>> client and the server. Hopefully this will improve when more users
>> will be able to give valuable feedback...
> 
> I have a couple of questions, and perhaps suggestions as well. First
> the questions...
> 
> - Have you tried with a 2.6.28-rc4 client and server at all? There
> are a number of significant NFS/RDMA improvements queued in
> kernel.org, especially around RDMA memory registration as well as
> RDMA operation scheduling. We've seen some significant throughput
> improvement even for basic tunings.
> 
> - What type of storage are you using at the server, and have you
> attempted to tune the server at all? For example, if you are storage 
> (spindle) limited, no network tuning is likely to help and you should
> address that first. Also, there are tunings such as nfsd thread
> count, export options, and adapter choice that can make a large
> difference.
> 
> Bottom line, you should be able to reach multi-hundred-MB/sec of
> read/write throughput with NFS/RDMA, but there may be issues on
> specific systems, or perhaps with the OFED1.4 code, that need to be
> accounted for. If possible, you may want to set expectations based on
> mainline, then try to duplicate them in the OFED backport. The
> current OFED NFS/RDMA support is still evolving, while we consider
> the mainline kernel.org version to be rather solid.
> 
> Tom.
> 
>> Fred.
>> 
>> -----Original Message----- From: Jeff Becker
>> [mailto:Jeffrey.C.Becker at nasa.gov] Sent: Saturday, 08 November,
>> 2008 22:35 To: Ciesielski, Frederic (EMEA HPC&OSLO CC) Cc:
>> general at lists.openfabrics.org Subject: Re: [ofa-general] NFS-RDMA
>> (OFED1.4) with standard distributions ?
>> 
>> Ciesielski, Frederic (EMEA HPC&OSLO CC) wrote:
>>> Is there any chance that the new NFS-RDMA features coming with
>>> OFED 1.4 work with standard and current distributions, like
>>> RHEL5, SLES10 ?
>> Not yet, but I'm working on it. I intend for NFSRDMA to work on
>> 2.6.27 and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will
>> likely be done for OFED 1.4.1. Thanks.
>> 
>> -jeff
>> 
>>> Did anybody test this, or would pretend it is supposed to work ?
>>> 
>>> I mean without building a 2.6.27 or equivalent kernel on top of
>>> it, keeping almost full support from the vendors.
>>> 
>>> Enhanced kernel modules may not be sufficient to work around the 
>>> limitations of old kernels...
>>> 
>>> 
>>> 
> 
> _______________________________________________ general mailing list 
> general at lists.openfabrics.org 
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From sashak at voltaire.com  Tue Nov 11 12:26:48 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 11 Nov 2008 22:26:48 +0200
Subject: [ofa-general] Re: [opensm patch][2/2] verify config inputs
	when config file is rescanned
In-Reply-To: <1226353273.13603.39.camel@cardanus.llnl.gov>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
	<1226351730.13603.27.camel@cardanus.llnl.gov>
	<1226353273.13603.39.camel@cardanus.llnl.gov>
Message-ID: <20081111202648.GB8894@sashak.voltaire.com>

On 13:41 Mon 10 Nov     , Al Chu wrote:
> Hey Sasha,
> 
> Sorry, repost, w/ the right Author.
> 
> Al
> 
> On Mon, 2008-11-10 at 13:15 -0800, Al Chu wrote:
> > On Mon, 2008-11-10 at 23:02 +0200, Sasha Khapyorsky wrote:
> > > Hi Al,
> > > 
> > > On 15:01 Thu 30 Oct     , Al Chu wrote:
> > > > Hey Sasha,
> > > > 
> > > > I noticed that after the config file is rescanned, the new potential
> > > > inputs aren't checked for validity.  Patch is attached.
> > > > 
> > > > Al
> > > > 
> > > > -- 
> > > > Albert Chu
> > > > chu11 at llnl.gov
> > > > Computer Scientist
> > > > High Performance Systems Division
> > > > Lawrence Livermore National Laboratory
> > > 
> > > > From edfcd2de96c3525d1609b4c0f03c17ecc0495c18 Mon Sep 17 00:00:00 2001
> > > > From: root <root at wopri.(none)>
> > > > Date: Thu, 30 Oct 2008 13:58:55 -0700
> > > > Subject: [PATCH] verify rescanned config input
> > > > 
> > > > 
> > > > Signed-off-by: root <root at wopri.(none)>
> > >                  ^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > > I'm fine with this patch, but could you fix S-O-B line? Thanks.
> > 
> > Oops.  New one is attached (I'll repost the [1/2] patch too).
> > 
> > Al
> > 
> > > Sasha
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> -- 
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory

> From a9f7ea0b667ff32a029593e954286c349fe499e7 Mon Sep 17 00:00:00 2001
> From: Albert Chu <chu11 at llnl.gov>
> Date: Mon, 10 Nov 2008 13:10:25 -0800
> Subject: [PATCH] verify rescanned config input
> 
> 
> Signed-off-by: Albert Chu <chu11 at llnl.gov>

Applied. Thanks.

Sasha


From rpearson at systemfabricworks.com  Tue Nov 11 12:33:52 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 14:33:52 -0600
Subject: [ofa-general] [PATCH][9] opensm: lash preparation
Message-ID: <008701c9443c$cfc1f050$6f45d0f0$@com>

Sasha,

Here is the ninth patch implementing the mesh analysis algorithm.

This patch makes some minor cleanups in osm_ucast_lash.c in preparation for
next steps.
The main change is to minimize the occurrences of phys_connections.
Also there are a few nits:
      - delete banner for local variables that moved to ...lash.h
      - fix bad return value of osm_mesh_node_create fails
      - clear sw->p_sw->priv on switch cleanup
      - fix spelling error in comment
      - discover_network_properties returns an error which was not checked

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index b9394af..95dbcc2 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -55,10 +55,6 @@
 #include <opensm/osm_mesh.h>
 #include <opensm/osm_ucast_lash.h>
 
-/* //////////////////////////// */
-/*  Local types                 */
-/* //////////////////////////// */
-
 static cdg_vertex_t *create_cdg_vertex(unsigned num_switches)
 {
 	cdg_vertex_t *cdg_vertex = (cdg_vertex_t *)
malloc(sizeof(cdg_vertex_t));
@@ -150,6 +146,11 @@ static int cycle_exists(cdg_vertex_t * start,
cdg_vertex_t * current,
 	return cycle_found;
 }
 
+static inline int get_next_switch(lash_t *p_lash, int sw, int link)
+{
+	return p_lash->switches[sw]->phys_connections[link];
+}
+
 static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw,
 					       int dest_switch, int lane)
 {
@@ -161,7 +162,7 @@ static void remove_semipermanent_depend_for_sp(lash_t *
p_lash, int sw,
 	int found;
 
 	output_link = switches[sw]->routing_table[dest_switch].out_link;
-	i_next_switch = switches[sw]->phys_connections[output_link];
+	i_next_switch = get_next_switch(p_lash, sw, output_link);
 
 	while (sw != dest_switch) {
 		v = cdg_vertex_matrix[lane][sw][i_next_switch];
@@ -177,8 +178,7 @@ static void remove_semipermanent_depend_for_sp(lash_t *
p_lash, int sw,
 			if (i_next_switch != dest_switch) {
 				next_link =
 
switches[i_next_switch]->routing_table[dest_switch].out_link;
-				i_next_next_switch =
-
switches[i_next_switch]->phys_connections[next_link];
+				i_next_next_switch = get_next_switch(p_lash,
i_next_switch, next_link);
 				found = 0;
 
 				for (i = 0; i < v->num_dependencies; i++)
@@ -211,8 +211,7 @@ static void remove_semipermanent_depend_for_sp(lash_t *
p_lash, int sw,
 		output_link =
switches[sw]->routing_table[dest_switch].out_link;
 
 		if (sw != dest_switch)
-			i_next_switch =
-			    switches[sw]->phys_connections[output_link];
+			i_next_switch = get_next_switch(p_lash, sw,
output_link);
 	}
 }
 
@@ -312,7 +311,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw,
int dest_switch,
 	cdg_vertex_t *v, *prev = NULL;
 
 	output_link = switches[sw]->routing_table[dest_switch].out_link;
-	next_switch = switches[sw]->phys_connections[output_link];
+	next_switch = get_next_switch(p_lash, sw, output_link);
 
 	while (sw != dest_switch) {
 
@@ -368,7 +367,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw,
int dest_switch,
 
 		if (sw != dest_switch) {
 			CL_ASSERT(output_link != NONE);
-			next_switch =
switches[sw]->phys_connections[output_link];
+			next_switch = get_next_switch(p_lash, sw,
output_link);
 		}
 
 		prev = v;
@@ -384,7 +383,7 @@ static void set_temp_depend_to_permanent_for_sp(lash_t *
p_lash, int sw,
 	cdg_vertex_t *v;
 
 	output_link = switches[sw]->routing_table[dest_switch].out_link;
-	next_switch = switches[sw]->phys_connections[output_link];
+	next_switch = get_next_switch(p_lash, sw, output_link);
 
 	while (sw != dest_switch) {
 		v = cdg_vertex_matrix[lane][sw][next_switch];
@@ -399,8 +398,7 @@ static void set_temp_depend_to_permanent_for_sp(lash_t *
p_lash, int sw,
 		output_link =
switches[sw]->routing_table[dest_switch].out_link;
 
 		if (sw != dest_switch)
-			next_switch =
-			    switches[sw]->phys_connections[output_link];
+			next_switch = get_next_switch(p_lash, sw,
output_link);
 	}
 
 }
@@ -414,7 +412,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash,
int sw, int dest_switch,
 	cdg_vertex_t *v;
 
 	output_link = switches[sw]->routing_table[dest_switch].out_link;
-	next_switch = switches[sw]->phys_connections[output_link];
+	next_switch = get_next_switch(p_lash, sw, output_link);
 
 	while (sw != dest_switch) {
 		v = cdg_vertex_matrix[lane][sw][next_switch];
@@ -439,8 +437,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash,
int sw, int dest_switch,
 		output_link =
switches[sw]->routing_table[dest_switch].out_link;
 
 		if (sw != dest_switch)
-			next_switch =
-			    switches[sw]->phys_connections[output_link];
+			next_switch = get_next_switch(p_lash, sw,
output_link);
 
 	}
 }
@@ -502,10 +499,10 @@ static void balance_virtual_lanes(lash_t * p_lash,
unsigned lanes_needed)
 		generate_cdg_for_sp(p_lash, dest, src, min_filled_lane);
 
 		output_link =
p_lash->switches[src]->routing_table[dest].out_link;
-		next_switch =
p_lash->switches[src]->phys_connections[output_link];
+		next_switch = get_next_switch(p_lash, src, output_link);
 
 		output_link2 =
p_lash->switches[dest]->routing_table[src].out_link;
-		next_switch2 =
p_lash->switches[dest]->phys_connections[output_link2];
+		next_switch2 = get_next_switch(p_lash, dest, output_link2);
 
 
CL_ASSERT(cdg_vertex_matrix[min_filled_lane][src][next_switch] != NULL);
 
CL_ASSERT(cdg_vertex_matrix[min_filled_lane][dest][next_switch2] != NULL);
@@ -652,7 +649,7 @@ static switch_t *switch_create(lash_t * p_lash, unsigned
id, osm_switch_t * p_sw
 	}
 
 	if (osm_mesh_node_create(p_lash, sw))
-		return -1;
+		return NULL;
 
 	sw->p_sw = p_sw;
 	if (p_sw)
@@ -673,6 +670,8 @@ static void switch_delete(switch_t * sw)
 		free(sw->phys_connections);
 	if (sw->routing_table)
 		free(sw->routing_table);
+	if (sw->p_sw)
+		sw->p_sw->priv = NULL;
 	free(sw);
 }
 
@@ -875,9 +874,8 @@ static int lash_core(lash_t * p_lash)
 					output_link2 =
 
switches[dest_switch]->routing_table[i].out_link;
 
-					i_next_switch =
switches[i]->phys_connections[output_link];
-					i_next_switch2 =
-
switches[dest_switch]->phys_connections[output_link2];
+					i_next_switch =
get_next_switch(p_lash, i, output_link);
+					i_next_switch2 =
get_next_switch(p_lash, dest_switch, output_link2);
 
 					CL_ASSERT(p_lash->
 
cdg_vertex_matrix[v_lane][i][i_next_switch] !=
@@ -1205,7 +1203,7 @@ static void process_switches(lash_t * p_lash)
 	osm_switch_t *p_sw, *p_next_sw;
 	osm_subn_t *p_subn = &p_lash->p_osm->subn;
 
-	/* Go through each swithc and process it. i.e build the connection
+	/* Go through each switch and process it. i.e build the connection
 	   structure required by LASH */
 	p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl);
 	while (p_next_sw != (osm_switch_t *)
cl_qmap_end(&p_subn->sw_guid_tbl)) {
@@ -1229,7 +1227,9 @@ static int lash_process(void *context)
 	// everything starts here
 	lash_cleanup(p_lash);
 
-	discover_network_properties(p_lash);
+	return_status = discover_network_properties(p_lash);
+	if (return_status != IB_SUCCESS)
+		goto Exit;
 
 	return_status = init_lash_structures(p_lash);
 	if (return_status != IB_SUCCESS)


From sashak at voltaire.com  Tue Nov 11 12:42:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 11 Nov 2008 22:42:14 +0200
Subject: [ofa-general] Re: [opensm patch] support dump_conf command in
	opensm console
In-Reply-To: <1226353351.13603.42.camel@cardanus.llnl.gov>
References: <1225759191.7307.9.camel@cardanus.llnl.gov>
	<20081109172518.GG30588@sashak.voltaire.com>
	<1226338962.13603.21.camel@cardanus.llnl.gov>
	<1226351033.13603.23.camel@cardanus.llnl.gov>
	<1226353351.13603.42.camel@cardanus.llnl.gov>
Message-ID: <20081111204214.GC8894@sashak.voltaire.com>

On 13:42 Mon 10 Nov     , Al Chu wrote:
> Hey Sasha,
> 
> Sorry.  Repost patch w/ the right Author.
> 
> Al
> 
> On Mon, 2008-11-10 at 13:03 -0800, Al Chu wrote:
> > Hey Sasha,
> > 
> > Attached is the re-worked patch.  Assumes changes from my "fix qos
> > config parsing bugs" patch are accepted.
> > 
> > Al
> > 
> > On Mon, 2008-11-10 at 09:42 -0800, Al Chu wrote:
> > > Hey Sasha,
> > > 
> > > On Sun, 2008-11-09 at 19:25 +0200, Sasha Khapyorsky wrote:
> > > > Hi Al,
> > > > 
> > > > On 16:39 Mon 03 Nov     , Al Chu wrote:
> > > > > Hey Sasha,
> > > > > 
> > > > > When config files are rescanned and loaded, there's no way to know if
> > > > > the right configuration was actually reloaded or not.  A console command
> > > > > to dump the current config is a useful way to verify the loading of new
> > > > > configs or not.
> > > > > 
> > > > > This patch assumes the fixes from my "fix qos config parsing bugs" is
> > > > > accepted.
> > > > 
> > > > Didn't pass over it, sorry about delay.
> > > > 
> > > > > 
> > > > > Al
> > > > > 
> > > > > -- 
> > > > > Albert Chu
> > > > > chu11 at llnl.gov
> > > > > Computer Scientist
> > > > > High Performance Systems Division
> > > > > Lawrence Livermore National Laboratory
> > > > 
> > > > > From 249607e47ec7ef1b92f9578cece90460418d12b8 Mon Sep 17 00:00:00 2001
> > > > > From: Albert Chu <chu11 at llnl.gov>
> > > > > Date: Mon, 3 Nov 2008 16:22:29 -0800
> > > > > Subject: [PATCH] support dump_conf console command
> > > > > 
> > > > > 
> > > > > Signed-off-by: Albert Chu <chu11 at llnl.gov>

Rebased against master and applied. Thanks.

Sasha


From rpearson at systemfabricworks.com  Tue Nov 11 13:32:56 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 15:32:56 -0600
Subject: [ofa-general] [PATCH][10] opensm: hook mesh code into lash code
Message-ID: <009d01c94445$10387f20$30a97d60$@com>

Sasha,

Here is the tenth patch implementing the mesh analysis algorithm.

This patch
      - hooks mesh code into lash
      - replaces sw->phys_connections by the equivalent switch->node->links
      - replaces sw->num_connections by the equivalent
switch->node->num_links
      - replaces sw->virtual_physical_port_table by
switch->node->links[]->ports

When the do_mesh_analysis flag is not set there is no change to the function
except
To replace the variables with variables in node that have the same size. In
this
Case the port table in link_t will always have just one port.

When the do_mesh_analysis flag is set multiple physical links will collapse
to a
Single logical link with a port list with more than one element.

      - fixed bug, mesh not set in osm_do_mesh_analysis
      - rewrote connect switches to use variables in node
      - in log Lane requirements (%d) exceed available lanes (%d)
        Arguments were reversed, fixed
      - compute physical egress port in routine get_next_port
        Which will use round robin if there are more than one
        Physical links between switches

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/include/opensm/osm_ucast_lash.h
b/opensm/include/opensm/osm_ucast_lash.h
index c037571..f3bde5d 100644
--- a/opensm/include/opensm/osm_ucast_lash.h
+++ b/opensm/include/opensm/osm_ucast_lash.h
@@ -82,9 +82,6 @@ typedef struct _switch {
 		unsigned lane;
 	} *routing_table;
 	mesh_node_t *node;
-	unsigned int num_connections;
-	int *virtual_physical_port_table;
-	int *phys_connections;
 } switch_t;
 
 typedef struct _lash {
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index a248522..fea9237 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -1080,6 +1080,8 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 		return -1;
 	}
 
+	mesh = p_lash->mesh;
+
 	/*
 	 * get local metric and invariant for each switch
 	 * also classify each switch
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 95dbcc2..34a4a62 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -67,16 +67,53 @@ static cdg_vertex_t *create_cdg_vertex(unsigned
num_switches)
 static void connect_switches(lash_t * p_lash, int sw1, int sw2, int
phy_port_1)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
-	unsigned num = p_lash->switches[sw1]->num_connections;
+	unsigned num = p_lash->switches[sw1]->node->num_links;
+	switch_t *s1 = p_lash->switches[sw1];
+	mesh_node_t *node = s1->node;
+	switch_t *s2;
+	link_t *l;
+	int i;
+
+	/*
+	 * if doing mesh analysis:
+	 *  - do not consider connections to self
+	 *  - collapse multiple connections between
+	 *    pair of switches to a single locical link
+	 */
+	if (p_lash->p_osm->subn.opt.do_mesh_analysis) {
+		if (sw1 == sw2)
+			return;
+
+		/* see if we are alredy linked to sw2 */
+		for (i = 0; i < num; i++) {
+			l = node->links[i];
+
+			if (node->links[i]->switch_id == sw2) {
+				l->ports[l->num_ports++] = phy_port_1;
+				return;
+			}
+		}
+	}
 
-	p_lash->switches[sw1]->phys_connections[num] = sw2;
-	p_lash->switches[sw1]->virtual_physical_port_table[num] =
phy_port_1;
-	p_lash->switches[sw1]->num_connections++;
+	l = node->links[num];
+	l->switch_id = sw2;
+	l->link_id = -1;
+	l->ports[l->num_ports++] = phy_port_1;
+
+	s2 = p_lash->switches[sw2];
+	for (i = 0; i < s2->node->num_links; i++) {
+		if (s2->node->links[i]->switch_id == sw1) {
+			s2->node->links[i]->link_id = num;
+			l->link_id = i;
+			break;
+		}
+	}
+
+	node->num_links++;
 
 	OSM_LOG(p_log, OSM_LOG_VERBOSE,
 		"LASH connect: %d, %d, %d\n", sw1, sw2,
 		phy_port_1);
-
 }
 
 static osm_switch_t *get_osm_switch_from_port(osm_port_t * port)
@@ -148,7 +185,7 @@ static int cycle_exists(cdg_vertex_t * start,
cdg_vertex_t * current,
 
 static inline int get_next_switch(lash_t *p_lash, int sw, int link)
 {
-	return p_lash->switches[sw]->phys_connections[link];
+	return p_lash->switches[sw]->node->links[link]->switch_id;
 }
 
 static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw,
@@ -233,8 +270,8 @@ static int get_phys_connection(switch_t *sw, int
switch_to)
 {
 	unsigned int i = 0;
 
-	for (i = 0; i < sw->num_connections; i++)
-		if (sw->phys_connections[i] == switch_to)
+	for (i = 0; i < sw->node->num_links; i++)
+		if (sw->node->links[i]->switch_id == switch_to)
 			return i;
 	return i;
 }
@@ -252,8 +289,8 @@ static void shortest_path(lash_t * p_lash, int ir)
 
 	while (!cl_is_list_empty(&bfsq)) {
 		dequeue(&bfsq, &sw);
-		for (i = 0; i < sw->num_connections; i++) {
-			swi = switches[sw->phys_connections[i]];
+		for (i = 0; i < sw->node->num_links; i++) {
+			swi = switches[sw->node->links[i]->switch_id];
 			if (swi->q_state == UNQUEUED) {
 				enqueue(&bfsq, swi);
 				sw->dij_channels[sw->used_channels++] =
swi->id;
@@ -614,25 +651,8 @@ static switch_t *switch_create(lash_t * p_lash,
unsigned id, osm_switch_t * p_sw
 		return NULL;
 	}
 
-	sw->virtual_physical_port_table = malloc(num_ports * sizeof(int));
-	if (!sw->virtual_physical_port_table) {
-		free(sw->dij_channels);
-		free(sw);
-		return NULL;
-	}
-
-	sw->phys_connections = malloc(num_ports * sizeof(int));
-	if (!sw->phys_connections) {
-		free(sw->virtual_physical_port_table);
-		free(sw->dij_channels);
-		free(sw);
-		return NULL;
-	}
-
 	sw->routing_table = malloc(num_switches *
sizeof(sw->routing_table[0]));
 	if (!sw->routing_table) {
-		free(sw->phys_connections);
-		free(sw->virtual_physical_port_table);
 		free(sw->dij_channels);
 		free(sw);
 		return NULL;
@@ -643,11 +663,6 @@ static switch_t *switch_create(lash_t * p_lash,
unsigned id, osm_switch_t * p_sw
 		sw->routing_table[i].lane = NONE;
 	}
 
-	for (i = 0; i < num_ports; i++) {
-		sw->virtual_physical_port_table[i] = -1;
-		sw->phys_connections[i] = NONE;
-	}
-
 	if (osm_mesh_node_create(p_lash, sw))
 		return NULL;
 
@@ -664,10 +679,6 @@ static void switch_delete(switch_t * sw)
 
 	if (sw->dij_channels)
 		free(sw->dij_channels);
-	if (sw->virtual_physical_port_table)
-		free(sw->virtual_physical_port_table);
-	if (sw->phys_connections)
-		free(sw->phys_connections);
 	if (sw->routing_table)
 		free(sw->routing_table);
 	if (sw->p_sw)
@@ -972,7 +983,7 @@ Error_Not_Enough_Lanes:
 	status = -1;
 	OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D02: "
 		"Lane requirements (%d) exceed available lanes (%d)\n",
-		p_lash->vl_min, lanes_needed);
+		lanes_needed, p_lash->vl_min);
 Exit:
 	if (switch_bitmap)
 		free(switch_bitmap);
@@ -985,6 +996,21 @@ static unsigned get_lash_id(osm_switch_t * p_sw)
 	return ((switch_t *) p_sw->priv)->id;
 }
 
+int get_next_port(switch_t *sw, int link)
+{
+	link_t *l = sw->node->links[link];
+	int port = l->next_port++;
+
+	/*
+	 * note if not doing mesh analysis
+	 * then num_ports is always 1
+	 */
+	if (l->next_port >= l->num_ports)
+		l->next_port = 0;
+
+	return l->ports[port];
+}
+
 static void populate_fwd_tbls(lash_t * p_lash)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
@@ -1036,9 +1062,7 @@ static void populate_fwd_tbls(lash_t * p_lash)
 				    (uint8_t) sw->
 
routing_table[dst_lash_switch_id].out_link;
 				uint8_t physical_egress_port =
-				    (uint8_t) sw->
-				    virtual_physical_port_table
-				    [lash_egress_port];
+					get_next_port(sw, lash_egress_port);
 
 				p_sw->lft_buf[lid] = physical_egress_port;
 				OSM_LOG(p_log, OSM_LOG_VERBOSE,


From rpearson at systemfabricworks.com  Tue Nov 11 14:41:04 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 16:41:04 -0600
Subject: [ofa-general] [PATCH][10] opensm: hook mesh code into lash (updated)
Message-ID: <00ad01c9444e$96e5f300$c4b1d900$@com>

Sasha,

Here is the tenth patch implementing the mesh analysis algorithm.
I am resending it because I inadvertently left a bug in the last version.

This patch
      - hooks mesh code into lash
      - replaces sw->phys_connections by the equivalent switch->node->links
      - replaces sw->num_connections by the equivalent
switch->node->num_links
      - replaces sw->virtual_physical_port_table by
switch->node->links[]->ports

When the do_mesh_analysis flag is not set there is no change to the function
except To replace the variables with variables in node that have the same
size. In this Case the port table in link_t will always have just one port.

When the do_mesh_analysis flag is set multiple physical links will collapse
to a Single logical link with a port list with more than one element.

      - fixed bug, mesh not set in osm_do_mesh_analysis
      - rewrote connect switches to use variables in node
      - in log Lane requirements (%d) exceed available lanes (%d)
        Arguments were reversed, fixed
      - compute physical egress port in routine get_next_port
        Which will use round robin if there are more than one
        Physical links between switches
      - changed printf's to OSM_LOG's in mesh.c

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
----
diff --git a/opensm/include/opensm/osm_ucast_lash.h
b/opensm/include/opensm/osm_ucast_lash.h
index c037571..f3bde5d 100644
--- a/opensm/include/opensm/osm_ucast_lash.h
+++ b/opensm/include/opensm/osm_ucast_lash.h
@@ -82,9 +82,6 @@ typedef struct _switch {
 		unsigned lane;
 	} *routing_table;
 	mesh_node_t *node;
-	unsigned int num_connections;
-	int *virtual_physical_port_table;
-	int *phys_connections;
 } switch_t;
 
 typedef struct _lash {
diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
index a248522..dbe3eeb 100644
--- a/opensm/opensm/osm_mesh.c
+++ b/opensm/opensm/osm_mesh.c
@@ -750,7 +750,7 @@ static void make_geometry(lash_t *p_lash, int sw)
 					continue;
 
 				if (l2 == -1) {
-					printf("ERROR no reverse link\n");
+					OSM_LOG(p_log, OSM_LOG_DEBUG, "ERROR
no reverse link\n");
 					continue;
 				}
 
@@ -919,6 +919,7 @@ static int reorder_links(lash_t *p_lash, int sw)
  */
 static int measure_geometry(lash_t *p_lash, int seed)
 {
+	osm_log_t *p_log = &p_lash->p_osm->log;
 	int i, j, k;
 	int sw;
 	switch_t *s, *s1;
@@ -942,7 +943,7 @@ static int measure_geometry(lash_t *p_lash, int seed)
 				assigned_axes++;
 	}
 
-	printf("lash: %d/%d unassigned/assigned axes\n", unassigned_axes,
assigned_axes);
+	OSM_LOG(p_log, OSM_LOG_DEBUG, "%d/%d unassigned/assigned axes\n",
unassigned_axes, assigned_axes);
 
 	do {
 		change = 0;
@@ -1069,8 +1070,7 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 	int i;
 	mesh_t *mesh;
 	switch_t *s;
-
-	OSM_LOG_ENTER(p_log);
+	char buf[256], *p;
 
 	/*
 	 * allocate per mesh data structures
@@ -1080,6 +1080,8 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 		return -1;
 	}
 
+	mesh = p_lash->mesh;
+
 	/*
 	 * get local metric and invariant for each switch
 	 * also classify each switch
@@ -1099,36 +1101,41 @@ int osm_do_mesh_analysis(lash_t *p_lash)
 
 	s = p_lash->switches[max_class_type];
 
-	printf("lash: found %d node type%s\n", mesh->num_class,
(mesh->num_class == 1)? "" : "s");
-	printf("lash: %snode type is ", (mesh->num_class == 1)? "" : "most
common ");
+	OSM_LOG(p_log, OSM_LOG_INFO, "found %d node type%s\n",
mesh->num_class, (mesh->num_class == 1)? "" : "s");
+
+	p = buf;
+	p += sprintf( p, "%snode type is ", (mesh->num_class == 1)? "" :
"most common ");
 
 	if (s->node->type) {
 		struct _mesh_info *t = &mesh_info[s->node->type];
 
 		for (i = 0; i < t->dimension; i++) {
-			printf("%s%d%s", i? "X" : "", t->size[i],
+			p += sprintf(p, "%s%d%s", i? " x " : "", t->size[i],
 				(t->size[i] == 6)? "+" : "");
 		}
-		printf(" mesh\n");
+		p += sprintf(p, " mesh\n");
 
 		p_lash->mesh->dimension = t->dimension;
 	} else {
-		printf("unknown geometry\n");
+		p += sprintf(p, "unknown geometry\n");
 	}
 
+	OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf);
+
 	if (s->node->type) {
 		make_geometry(p_lash, max_class_type);
 
 		if (measure_geometry(p_lash, max_class_type))
 			return -1;
 
-		printf("lash: found ");
+		p = buf;
+		p += sprintf(p, "found ");
 		for (i = 0; i < mesh->dimension; i++)
-			printf("%s%d", i? "X" : "", mesh->size[i]);
-		printf(" mesh\n");
-	}
+			p += sprintf(p, "%s%d", i? " x " : "",
mesh->size[i]);
+		p += sprintf(p, " mesh\n");
 
-	OSM_LOG_EXIT(p_log);
+		OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf);
+	}
 
 	return 0;
 }
diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
index 95dbcc2..660ad56 100644
--- a/opensm/opensm/osm_ucast_lash.c
+++ b/opensm/opensm/osm_ucast_lash.c
@@ -67,16 +67,53 @@ static cdg_vertex_t *create_cdg_vertex(unsigned
num_switches)
 static void connect_switches(lash_t * p_lash, int sw1, int sw2, int
phy_port_1)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
-	unsigned num = p_lash->switches[sw1]->num_connections;
+	unsigned num = p_lash->switches[sw1]->node->num_links;
+	switch_t *s1 = p_lash->switches[sw1];
+	mesh_node_t *node = s1->node;
+	switch_t *s2;
+	link_t *l;
+	int i;
+
+	/*
+	 * if doing mesh analysis:
+	 *  - do not consider connections to self
+	 *  - collapse multiple connections between
+	 *    pair of switches to a single locical link
+	 */
+	if (p_lash->p_osm->subn.opt.do_mesh_analysis) {
+		if (sw1 == sw2)
+			return;
+
+		/* see if we are alredy linked to sw2 */
+		for (i = 0; i < num; i++) {
+			l = node->links[i];
+
+			if (node->links[i]->switch_id == sw2) {
+				l->ports[l->num_ports++] = phy_port_1;
+				return;
+			}
+		}
+	}
+
+	l = node->links[num];
+	l->switch_id = sw2;
+	l->link_id = -1;
+	l->ports[l->num_ports++] = phy_port_1;
+
+	s2 = p_lash->switches[sw2];
+	for (i = 0; i < s2->node->num_links; i++) {
+		if (s2->node->links[i]->switch_id == sw1) {
+			s2->node->links[i]->link_id = num;
+			l->link_id = i;
+			break;
+		}
+	}
 
-	p_lash->switches[sw1]->phys_connections[num] = sw2;
-	p_lash->switches[sw1]->virtual_physical_port_table[num] =
phy_port_1;
-	p_lash->switches[sw1]->num_connections++;
+	node->num_links++;
 
 	OSM_LOG(p_log, OSM_LOG_VERBOSE,
 		"LASH connect: %d, %d, %d\n", sw1, sw2,
 		phy_port_1);
-
 }
 
 static osm_switch_t *get_osm_switch_from_port(osm_port_t * port)
@@ -148,7 +185,7 @@ static int cycle_exists(cdg_vertex_t * start,
cdg_vertex_t * current,
 
 static inline int get_next_switch(lash_t *p_lash, int sw, int link)
 {
-	return p_lash->switches[sw]->phys_connections[link];
+	return p_lash->switches[sw]->node->links[link]->switch_id;
 }
 
 static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw,
@@ -233,8 +270,8 @@ static int get_phys_connection(switch_t *sw, int
switch_to)
 {
 	unsigned int i = 0;
 
-	for (i = 0; i < sw->num_connections; i++)
-		if (sw->phys_connections[i] == switch_to)
+	for (i = 0; i < sw->node->num_links; i++)
+		if (sw->node->links[i]->switch_id == switch_to)
 			return i;
 	return i;
 }
@@ -252,8 +289,8 @@ static void shortest_path(lash_t * p_lash, int ir)
 
 	while (!cl_is_list_empty(&bfsq)) {
 		dequeue(&bfsq, &sw);
-		for (i = 0; i < sw->num_connections; i++) {
-			swi = switches[sw->phys_connections[i]];
+		for (i = 0; i < sw->node->num_links; i++) {
+			swi = switches[sw->node->links[i]->switch_id];
 			if (swi->q_state == UNQUEUED) {
 				enqueue(&bfsq, swi);
 				sw->dij_channels[sw->used_channels++] =
swi->id;
@@ -614,25 +651,8 @@ static switch_t *switch_create(lash_t * p_lash,
unsigned id, osm_switch_t * p_sw
 		return NULL;
 	}
 
-	sw->virtual_physical_port_table = malloc(num_ports * sizeof(int));
-	if (!sw->virtual_physical_port_table) {
-		free(sw->dij_channels);
-		free(sw);
-		return NULL;
-	}
-
-	sw->phys_connections = malloc(num_ports * sizeof(int));
-	if (!sw->phys_connections) {
-		free(sw->virtual_physical_port_table);
-		free(sw->dij_channels);
-		free(sw);
-		return NULL;
-	}
-
 	sw->routing_table = malloc(num_switches *
sizeof(sw->routing_table[0]));
 	if (!sw->routing_table) {
-		free(sw->phys_connections);
-		free(sw->virtual_physical_port_table);
 		free(sw->dij_channels);
 		free(sw);
 		return NULL;
@@ -643,18 +663,13 @@ static switch_t *switch_create(lash_t * p_lash,
unsigned id, osm_switch_t * p_sw
 		sw->routing_table[i].lane = NONE;
 	}
 
-	for (i = 0; i < num_ports; i++) {
-		sw->virtual_physical_port_table[i] = -1;
-		sw->phys_connections[i] = NONE;
-	}
-
-	if (osm_mesh_node_create(p_lash, sw))
-		return NULL;
-
 	sw->p_sw = p_sw;
 	if (p_sw)
 		p_sw->priv = sw;
 
+	if (osm_mesh_node_create(p_lash, sw))
+		return NULL;
+
 	return sw;
 }
 
@@ -664,10 +679,6 @@ static void switch_delete(switch_t * sw)
 
 	if (sw->dij_channels)
 		free(sw->dij_channels);
-	if (sw->virtual_physical_port_table)
-		free(sw->virtual_physical_port_table);
-	if (sw->phys_connections)
-		free(sw->phys_connections);
 	if (sw->routing_table)
 		free(sw->routing_table);
 	if (sw->p_sw)
@@ -972,7 +983,7 @@ Error_Not_Enough_Lanes:
 	status = -1;
 	OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D02: "
 		"Lane requirements (%d) exceed available lanes (%d)\n",
-		p_lash->vl_min, lanes_needed);
+		lanes_needed, p_lash->vl_min);
 Exit:
 	if (switch_bitmap)
 		free(switch_bitmap);
@@ -985,6 +996,21 @@ static unsigned get_lash_id(osm_switch_t * p_sw)
 	return ((switch_t *) p_sw->priv)->id;
 }
 
+int get_next_port(switch_t *sw, int link)
+{
+	link_t *l = sw->node->links[link];
+	int port = l->next_port++;
+
+	/*
+	 * note if not doing mesh analysis
+	 * then num_ports is always 1
+	 */
+	if (l->next_port >= l->num_ports)
+		l->next_port = 0;
+
+	return l->ports[port];
+}
+
 static void populate_fwd_tbls(lash_t * p_lash)
 {
 	osm_log_t *p_log = &p_lash->p_osm->log;
@@ -1036,9 +1062,7 @@ static void populate_fwd_tbls(lash_t * p_lash)
 				    (uint8_t) sw->
 
routing_table[dst_lash_switch_id].out_link;
 				uint8_t physical_egress_port =
-				    (uint8_t) sw->
-				    virtual_physical_port_table
-				    [lash_egress_port];
+					get_next_port(sw, lash_egress_port);
 
 				p_sw->lft_buf[lid] = physical_egress_port;
 				OSM_LOG(p_log, OSM_LOG_VERBOSE,


From rpearson at systemfabricworks.com  Tue Nov 11 14:44:08 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Tue, 11 Nov 2008 16:44:08 -0600
Subject: [ofa-general] mesh analysis patch done.
Message-ID: <00ae01c9444f$02702960$07507c20$@com>

Forgot to mention that the 10th patch was the last one.
Take a look when you get a chance.

Regards,

Bob Pearson


From chu11 at llnl.gov  Tue Nov 11 15:57:52 2008
From: chu11 at llnl.gov (Al Chu)
Date: Tue, 11 Nov 2008 15:57:52 -0800
Subject: [ofa-general] Re: [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <20081111191958.GA8894@sashak.voltaire.com>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
	<20081111191958.GA8894@sashak.voltaire.com>
Message-ID: <1226447872.6239.2.camel@cardanus.llnl.gov>

Hey Sasha,

On Tue, 2008-11-11 at 21:19 +0200, Sasha Khapyorsky wrote: 
> Hi Al,
> 
> On 15:01 Thu 30 Oct     , Al Chu wrote:
> > 
> > I found a bunch of qos config parsing issues, listed below:
> > 
> > 1)
> > 
> > If the user sets the qos default fields (i.e. qos_high_limit,
> > qos_vlarb_high. etc.), but do not have the qos_ca, qos_swe, qos_rtr,
> > etc. equivalent fields listed (i.e. qos_ca_high_limit,
> > qos_sw0_vlarb_high), the values set in teh qos default fields are not
> > loaded into the CAs, switches, etc.  The reason is in qos_build_config()
> > we load defaults like this:
> > 
> > p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> > 
> > but we always set the fields to something non-NULL.
> > 
> > static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
> > {
> >         opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS;
> >         opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
> >         opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH;
> >         opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW;
> >         opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
> > }
> 
> Yes, we are setting this to the default qos set (if not explicitly
> specified by user). So finally we always have valid set. No?

Sorry, I may have not explained it well. Lets say I do this in the
config file.

qos_vlarb_high FOOBAR
# qos_ca_vlarb_high BLAH
qos_swe_vlarb_high XYZZY

I currently expect qos_ca_vlarb_high to use the value of FOOBAR because
I commented out the field.  But it uses OSM_DEFAULT_QOS_HIGH_LIMIT
instead.  The reason is because qos_build_config() checks for NULL to
use default vs. non-default values.

p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;

Under the above situation where I've commented out veral fields, opt-
>vlarb_high is always non-NULL b/c it was set to
OSM_DEFAULT_QOS_HIGH_LIMIT. Thus OSM_DEFAULT_QOS_HIGH_LIMIT is used
instead of FOOBAR.

> > 2)
> > 
> > In qos_build_config() we load the high_limit like this:
> > 
> > cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > 
> > So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> > options to "go back to" the default high_limit.  It just assumes that
> > whatever is input (or was set by default) is what you should use.
> 
> Right. What is a limitation here? That an user cannot set this to
> "no value"? But she/he can just skip it.

Similar to the above issue, lets say I want to do:

qos_high_limit 8
# qos_ca_high_limit 15
# qos_swe_high_limit 15

I want qos_ca_high_limit and qos_swe_high_limit to use whatever I set in
qos_high_limit.  But the code doesn't allow for this.

> 
> > 3)
> > 
> > Some fields like qos_vlarb_high are assumed to be correctly set and can
> > segfault opensm.
> 
> qos_build_config() assumes that valid parameters are used. And we are
> using this this way (I hope :)) (finally it is not library API).

I think the issue is the osm_subnet.c code did not properly check all
inputs, and subsequently some inputs used in qos_build_config() were
bad.  I think

qos_vlarb_high (null)

was something I tried that opensm seg-faulted on.  

> > The attached patch fixes these up.  Obviously there's tons of ways to
> > do this.  I decided to ...
> > 
> > A) only initialization qos_options to the real defaults
> > 
> > B) init all qos_*_options to sentinel values (-1, NULL, etc.) to
> > indicate it should use the configured defaults if they aren't set by the
> > user.  The high_limit was changed from an unsigned to an int b/c 0 is a
> > valid high_limit value.
> > 
> > C) verify that the default qos inputs are definitely correct (i.e. can't
> > be NULL).  Reset to hard coded defaults if need be.
> > 
> > D) load the default vs. non-default appropriately in QoS.
> 
> And I see that we have here much more sometimes not-trivial flows and
> default values are spread over many places... :(

I will admit its possible that I'm fixing something that shouldn't be
fixed in the code but only in the documentation.  Currently, the
documentation indicates to me the behavior I describe above. Should we
instead tell the user they must set each of the qos_ca*, qos_swe*, etc.
fields respectively and cannot assume the "default" fields can be used
to set those other fields?  Perhaps we should just remove those
"default" fields??

Al

> Sasha
> 
> > 
> > Al
> > 
> > P.S.  This patch does not rely on my previous "remove qos_max_vls
> > config" patch.  I assume we're keeping the max_vls fields in this patch.
> > 
> > -- 
> > Albert Chu
> > chu11 at llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> 
> > From 00a15a1797b79fd5e3298d98742b6da3613fb9c3 Mon Sep 17 00:00:00 2001
> > From: root <root at wopri.(none)>
> > Date: Thu, 30 Oct 2008 09:32:29 -0700
> > Subject: [PATCH] fix qos config parsing bugs
> > 
> > 
> > Signed-off-by: root <root at wopri.(none)>
> > ---
> >  opensm/include/opensm/osm_subnet.h |   12 +-
> >  opensm/opensm/osm_qos.c            |    6 +-
> >  opensm/opensm/osm_subnet.c         |  467 ++++++++++++++++++++++--------------
> >  3 files changed, 293 insertions(+), 192 deletions(-)
> > 
> > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> > index 7259587..11063b7 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -99,7 +99,7 @@ struct osm_qos_policy;
> >  */
> >  typedef struct osm_qos_options {
> >  	unsigned max_vls;
> > -	unsigned high_limit;
> > +	int high_limit;
> >  	char *vlarb_high;
> >  	char *vlarb_low;
> >  	char *sl2vl;
> > @@ -108,20 +108,20 @@ typedef struct osm_qos_options {
> >  * FIELDS
> >  *
> >  *	max_vls
> > -*		The number of maximum VLs on the Subnet
> > +*		The number of maximum VLs on the Subnet (0 == use default)
> >  *
> >  *	high_limit
> >  *		The limit of High Priority component of VL Arbitration
> > -*		table (IBA 7.6.9)
> > +*		table (IBA 7.6.9) (-1 == use default)
> >  *
> >  *	vlarb_high
> > -*		High priority VL Arbitration table template.
> > +*		High priority VL Arbitration table template. (NULL == use default)
> >  *
> >  *	vlarb_low
> > -*		Low priority VL Arbitration table template.
> > +*		Low priority VL Arbitration table template. (NULL == use default)
> >  *
> >  *	sl2vl
> > -*		SL2VL Mapping table (IBA 7.6.6) template.
> > +*		SL2VL Mapping table (IBA 7.6.6) template. (NULL == use default)
> >  *
> >  *********/
> >  
> > diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
> > index 1679ae0..b451c25 100644
> > --- a/opensm/opensm/osm_qos.c
> > +++ b/opensm/opensm/osm_qos.c
> > @@ -382,7 +382,11 @@ static void qos_build_config(struct qos_config *cfg,
> >  	memset(cfg, 0, sizeof(*cfg));
> >  
> >  	cfg->max_vls = opt->max_vls > 0 ? opt->max_vls : dflt->max_vls;
> > -	cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > +
> > +	if (opt->high_limit >= 0)
> > +		cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > +	else
> > +		cfg->vl_high_limit = (uint8_t) dflt->high_limit;
> >  
> >  	p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> >  	for (i = 0; i < 2 * IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; i++) {
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> > index 0422d0f..ab2ff9c 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -370,6 +370,15 @@ static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
> >  	opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
> >  }
> >  
> > +static void subn_init_qos_options(IN osm_qos_options_t * opt)
> > +{
> > +	opt->max_vls = 0;
> > +	opt->high_limit = -1;
> > +	opt->vlarb_high = NULL;
> > +	opt->vlarb_low = NULL;
> > +	opt->sl2vl = NULL;
> > +}
> > +
> >  /**********************************************************************
> >   **********************************************************************/
> >  void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
> > @@ -458,10 +467,10 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
> >  	p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE;
> >  	p_opt->consolidate_ipv6_snm_req = FALSE;
> >  	subn_set_default_qos_options(&p_opt->qos_options);
> > -	subn_set_default_qos_options(&p_opt->qos_ca_options);
> > -	subn_set_default_qos_options(&p_opt->qos_sw0_options);
> > -	subn_set_default_qos_options(&p_opt->qos_swe_options);
> > -	subn_set_default_qos_options(&p_opt->qos_rtr_options);
> > +	subn_init_qos_options(&p_opt->qos_ca_options);
> > +	subn_init_qos_options(&p_opt->qos_sw0_options);
> > +	subn_init_qos_options(&p_opt->qos_swe_options);
> > +	subn_init_qos_options(&p_opt->qos_rtr_options);
> >  }
> >  
> >  /**********************************************************************
> > @@ -497,6 +506,7 @@ opts_unpack_net64(IN char *p_req_key,
> >  	}
> >  }
> >  
> > +
> >  /**********************************************************************
> >   **********************************************************************/
> >  static void
> > @@ -511,6 +521,20 @@ opts_unpack_uint32(IN char *p_req_key,
> >  		}
> >  	}
> >  }
> > +/**********************************************************************
> > + **********************************************************************/
> > +static void
> > +opts_unpack_int32(IN char *p_req_key,
> > +		  IN char *p_key, IN char *p_val_str, IN int32_t * p_val)
> > +{
> > +	if (!strcmp(p_req_key, p_key)) {
> > +		int32_t val = strtol(p_val_str, NULL, 0);
> > +		if (val != *p_val) {
> > +			log_config_value(p_key, "%d", val);
> > +			*p_val = val;
> > +		}
> > +	}
> > +}
> >  
> >  /**********************************************************************
> >   **********************************************************************/
> > @@ -641,7 +665,7 @@ subn_parse_qos_options(IN const char *prefix,
> >  	snprintf(name, sizeof(name), "%s_max_vls", prefix);
> >  	opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls);
> >  	snprintf(name, sizeof(name), "%s_high_limit", prefix);
> > -	opts_unpack_uint32(name, p_key, p_val_str, &opt->high_limit);
> > +	opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit);
> >  	snprintf(name, sizeof(name), "%s_vlarb_high", prefix);
> >  	opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high);
> >  	snprintf(name, sizeof(name), "%s_vlarb_low", prefix);
> > @@ -653,7 +677,9 @@ subn_parse_qos_options(IN const char *prefix,
> >  static int
> >  subn_dump_qos_options(FILE * file,
> >  		      const char *set_name,
> > -		      const char *prefix, osm_qos_options_t * opt)
> > +		      const char *prefix, 
> > +		      osm_qos_options_t * opt,
> > +		      osm_qos_options_t * dflt)
> >  {
> >  	return fprintf(file, "# %s\n"
> >  		       "%s_max_vls %u\n"
> > @@ -662,10 +688,11 @@ subn_dump_qos_options(FILE * file,
> >  		       "%s_vlarb_low %s\n"
> >  		       "%s_sl2vl %s\n",
> >  		       set_name,
> > -		       prefix, opt->max_vls,
> > -		       prefix, opt->high_limit,
> > -		       prefix, opt->vlarb_high,
> > -		       prefix, opt->vlarb_low, prefix, opt->sl2vl);
> > +		       prefix, opt->max_vls > 0 ? opt->max_vls : dflt->max_vls,
> > +		       prefix, opt->high_limit >= 0 ? opt->high_limit : dflt->high_limit,
> > +		       prefix, opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high,
> > +		       prefix, opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low, 
> > +		       prefix, opt->sl2vl ? opt->sl2vl : dflt->sl2vl);
> >  }
> >  
> >  /**********************************************************************
> > @@ -833,169 +860,182 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
> >  /**********************************************************************
> >   **********************************************************************/
> >  
> > -static void subn_verify_max_vls(IN unsigned *max_vls, IN char *key)
> > +static void subn_verify_max_vls(IN unsigned *max_vls, IN char *key, IN unsigned dflt)
> >  {
> >  	char buff[128];
> >  
> > -	if (*max_vls > 15) {
> > +	if (!(*max_vls) || *max_vls > 15) {
> >  		sprintf(buff, " Invalid Cached Option:%s=%u:"
> > -			"Using Default:%u\n",
> > -			key, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
> > +			"Using Default\n",
> > +			key, *max_vls);
> >  		printf(buff);
> >  		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > -		*max_vls = OSM_DEFAULT_QOS_MAX_VLS;
> > +		*max_vls = dflt;
> >  	}
> >  }
> >  
> > -static void subn_verify_high_limit(IN unsigned *high_limit, IN char *key)
> > +static void subn_verify_high_limit(IN int *high_limit, IN char *key, IN int dflt)
> >  {
> >  	char buff[128];
> >  
> > -	if (*high_limit > 255) {
> > -		sprintf(buff, " Invalid Cached Option:%s=%u:"
> > -			"Using Default:%u\n",
> > -			key, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
> > +	if (*high_limit < 0 || *high_limit > 255) {
> > +		sprintf(buff, " Invalid Cached Option:%s=%d:"
> > +			"Using Default\n", key, *high_limit);
> >  		printf(buff);
> >  		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > -		*high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
> > +		*high_limit = dflt;
> >  	}
> >  }
> >  
> > -static void subn_verify_vlarb(IN char *vlarb, IN char *key)
> > +static void subn_verify_vlarb(IN char **vlarb, IN char *key, IN char *dflt)
> >  {
> > -	if (vlarb) {
> > -		char buff[128];
> > -		char *str, *tok, *end, *ptr;
> > -		int count = 0;
> > -
> > -		str = (char *)malloc(strlen(vlarb) + 1);
> > -		strcpy(str, vlarb);
> > -
> > -		tok = strtok_r(str, ",\n", &ptr);
> > -		while (tok) {
> > -			char *vl_str, *weight_str;
> > -
> > -			vl_str = tok;
> > -			weight_str = strchr(tok, ':');
> > -
> > -			if (weight_str) {
> > -				long vl, weight;
> > -
> > -				*weight_str = '\0';
> > -				weight_str++;
> > -
> > -				vl = strtol(vl_str, &end, 0);
> > -
> > -				if (*end) {
> > -					sprintf(buff,
> > -						" Warning: Cached Option %s:vl=%s improperly formatted\n",
> > -						key, vl_str);
> > -					printf(buff);
> > -					cl_log_event("OpenSM", CL_LOG_INFO,
> > -						     buff, NULL, 0);
> > -				} else if (vl < 0 || vl > 14) {
> > -					sprintf(buff,
> > -						" Warning: Cached Option %s:vl=%ld out of range\n",
> > -						key, vl);
> > -					printf(buff);
> > -					cl_log_event("OpenSM", CL_LOG_INFO,
> > -						     buff, NULL, 0);
> > -				}
> > -
> > -				weight = strtol(weight_str, &end, 0);
> > -
> > -				if (*end) {
> > -					sprintf(buff,
> > -						" Warning: Cached Option %s:weight=%s improperly formatted\n",
> > -						key, weight_str);
> > -					printf(buff);
> > -					cl_log_event("OpenSM", CL_LOG_INFO,
> > -						     buff, NULL, 0);
> > -				} else if (weight < 0 || weight > 255) {
> > -					sprintf(buff,
> > -						" Warning: Cached Option %s:weight=%ld out of range\n",
> > -						key, weight);
> > -					printf(buff);
> > -					cl_log_event("OpenSM", CL_LOG_INFO,
> > -						     buff, NULL, 0);
> > -				}
> > -			} else {
> > -				sprintf(buff,
> > -					" Warning: Cached Option %s:vl:weight=%s improperly formatted\n",
> > -					key, tok);
> > -				printf(buff);
> > -				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> > -					     0);
> > -			}
> > +	char buff[128];
> > +	char *str, *tok, *end, *ptr;
> > +	int count = 0;
> >  
> > -			count++;
> > -			tok = strtok_r(NULL, ",\n", &ptr);
> > -		}
> > +	if (*vlarb == NULL) {
> > +		sprintf(buff, " Invalid Cached Option:%s:"
> > +			"Using Default\n", key);
> > +		printf(buff);
> > +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > +		(*vlarb) = dflt;			
> > +		return;
> > +	}
> >  
> > -		if (count > 64) {
> > -			sprintf(buff,
> > -				" Warning: Cached Option %s: > 64 listed: "
> > -				"excess vl:weight pairs will be dropped\n",
> > -				key);
> > -			printf(buff);
> > -			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > -		}
> > +	str = (char *)malloc(strlen(*vlarb) + 1);
> > +	strcpy(str, *vlarb);
> >  
> > -		free(str);
> > -	}
> > -}
> > +	tok = strtok_r(str, ",\n", &ptr);
> > +	while (tok) {
> > +		char *vl_str, *weight_str;
> >  
> > -static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
> > -{
> > -	if (sl2vl) {
> > -		char buff[128];
> > -		char *str, *tok, *end, *ptr;
> > -		int count = 0;
> > +		vl_str = tok;
> > +		weight_str = strchr(tok, ':');
> >  
> > -		str = (char *)malloc(strlen(sl2vl) + 1);
> > -		strcpy(str, sl2vl);
> > +		if (weight_str) {
> > +			long vl, weight;
> >  
> > -		tok = strtok_r(str, ",\n", &ptr);
> > -		while (tok) {
> > -			long vl = strtol(tok, &end, 0);
> > +			*weight_str = '\0';
> > +			weight_str++;
> > +
> > +			vl = strtol(vl_str, &end, 0);
> >  
> >  			if (*end) {
> >  				sprintf(buff,
> >  					" Warning: Cached Option %s:vl=%s improperly formatted\n",
> > -					key, tok);
> > +					key, vl_str);
> >  				printf(buff);
> > -				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> > -					     0);
> > -			} else if (vl < 0 || vl > 15) {
> > +				cl_log_event("OpenSM", CL_LOG_INFO,
> > +					     buff, NULL, 0);
> > +			} else if (vl < 0 || vl > 14) {
> >  				sprintf(buff,
> >  					" Warning: Cached Option %s:vl=%ld out of range\n",
> >  					key, vl);
> >  				printf(buff);
> > -				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> > -					     0);
> > +				cl_log_event("OpenSM", CL_LOG_INFO,
> > +					     buff, NULL, 0);
> >  			}
> >  
> > -			count++;
> > -			tok = strtok_r(NULL, ",\n", &ptr);
> > -		}
> > +			weight = strtol(weight_str, &end, 0);
> >  
> > -		if (count < 16) {
> > +			if (*end) {
> > +				sprintf(buff,
> > +					" Warning: Cached Option %s:weight=%s improperly formatted\n",
> > +					key, weight_str);
> > +				printf(buff);
> > +				cl_log_event("OpenSM", CL_LOG_INFO,
> > +					     buff, NULL, 0);
> > +			} else if (weight < 0 || weight > 255) {
> > +				sprintf(buff,
> > +					" Warning: Cached Option %s:weight=%ld out of range\n",
> > +					key, weight);
> > +				printf(buff);
> > +				cl_log_event("OpenSM", CL_LOG_INFO,
> > +					     buff, NULL, 0);
> > +			}
> > +		} else {
> >  			sprintf(buff,
> > -				" Warning: Cached Option %s: < 16 VLs listed\n",
> > -				key);
> > +				" Warning: Cached Option %s:vl:weight=%s improperly formatted\n",
> > +				key, tok);
> >  			printf(buff);
> > -			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > +			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> > +				     0);
> >  		}
> > -		if (count > 16) {
> > +
> > +		count++;
> > +		tok = strtok_r(NULL, ",\n", &ptr);
> > +	}
> > +
> > +	if (count > 64) {
> > +		sprintf(buff,
> > +			" Warning: Cached Option %s: > 64 listed: "
> > +			"excess vl:weight pairs will be dropped\n",
> > +			key);
> > +		printf(buff);
> > +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > +	}
> > +
> > +	free(str);
> > +}
> > +
> > +static void subn_verify_sl2vl(IN char **sl2vl, IN char *key, IN char *dflt)
> > +{
> > +	char buff[128];
> > +	char *str, *tok, *end, *ptr;
> > +	int count = 0;
> > +
> > +	if (*sl2vl == NULL) {
> > +		sprintf(buff, " Invalid Cached Option:%s:"
> > +			"Using Default\n", key);
> > +		printf(buff);
> > +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > +		(*sl2vl) = dflt;			
> > +		return;
> > +	}
> > +
> > +	str = (char *)malloc(strlen(*sl2vl) + 1);
> > +	strcpy(str, *sl2vl);
> > +
> > +	tok = strtok_r(str, ",\n", &ptr);
> > +	while (tok) {
> > +		long vl = strtol(tok, &end, 0);
> > +
> > +		if (*end) {
> >  			sprintf(buff,
> > -				" Warning: Cached Option %s: > 16 listed: "
> > -				"excess VLs will be dropped\n", key);
> > +				" Warning: Cached Option %s:vl=%s improperly formatted\n",
> > +				key, tok);
> >  			printf(buff);
> > -			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > +			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> > +				     0);
> > +		} else if (vl < 0 || vl > 15) {
> > +			sprintf(buff,
> > +				" Warning: Cached Option %s:vl=%ld out of range\n",
> > +				key, vl);
> > +			printf(buff);
> > +			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
> > +				     0);
> >  		}
> >  
> > -		free(str);
> > +		count++;
> > +		tok = strtok_r(NULL, ",\n", &ptr);
> > +	}
> > +
> > +	if (count < 16) {
> > +		sprintf(buff,
> > +			" Warning: Cached Option %s: < 16 VLs listed\n",
> > +			key);
> > +		printf(buff);
> > +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> >  	}
> > +	if (count > 16) {
> > +		sprintf(buff,
> > +			" Warning: Cached Option %s: > 16 listed: "
> > +			"excess VLs will be dropped\n", key);
> > +		printf(buff);
> > +		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
> > +	}
> > +
> > +	free(str);
> >  }
> >  
> >  static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
> > @@ -1046,61 +1086,113 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
> >  	}
> >  
> >  	if (p_opts->qos) {
> > +		/* the default options in qos_options must be correct.
> > +		 * every other one need not be, b/c those will default
> > +		 * back to whatever is in qos_options.
> > +		 */
> > +
> >  		subn_verify_max_vls(&(p_opts->qos_options.max_vls),
> > -				    "qos_max_vls");
> > -		subn_verify_max_vls(&(p_opts->qos_ca_options.max_vls),
> > -				    "qos_ca_max_vls");
> > -		subn_verify_max_vls(&(p_opts->qos_sw0_options.max_vls),
> > -				    "qos_sw0_max_vls");
> > -		subn_verify_max_vls(&(p_opts->qos_swe_options.max_vls),
> > -				    "qos_swe_max_vls");
> > -		subn_verify_max_vls(&(p_opts->qos_rtr_options.max_vls),
> > -				    "qos_rtr_max_vls");
> > +				    "qos_max_vls",
> > +				    OSM_DEFAULT_MAX_OP_VLS);
> > +		if (p_opts->qos_ca_options.max_vls)
> > +			subn_verify_max_vls(&(p_opts->qos_ca_options.max_vls),
> > +					    "qos_ca_max_vls",
> > +					    0);
> > +		if (p_opts->qos_sw0_options.max_vls)
> > +			subn_verify_max_vls(&(p_opts->qos_sw0_options.max_vls),
> > +					    "qos_sw0_max_vls",
> > +					    0);
> > +		if (p_opts->qos_swe_options.max_vls)
> > +			subn_verify_max_vls(&(p_opts->qos_swe_options.max_vls),
> > +					    "qos_swe_max_vls",
> > +					    0);
> > +		if (p_opts->qos_rtr_options.max_vls)
> > +			subn_verify_max_vls(&(p_opts->qos_rtr_options.max_vls),
> > +					    "qos_rtr_max_vls",
> > +					    0);
> >  
> >  		subn_verify_high_limit(&(p_opts->qos_options.high_limit),
> > -				       "qos_high_limit");
> > -		subn_verify_high_limit(&(p_opts->qos_ca_options.high_limit),
> > -				       "qos_ca_high_limit");
> > -		subn_verify_high_limit(&
> > -				       (p_opts->qos_sw0_options.high_limit),
> > -				       "qos_sw0_high_limit");
> > -		subn_verify_high_limit(&
> > -				       (p_opts->qos_swe_options.high_limit),
> > -				       "qos_swe_high_limit");
> > -		subn_verify_high_limit(&
> > -				       (p_opts->qos_rtr_options.high_limit),
> > -				       "qos_rtr_high_limit");
> > -
> > -		subn_verify_vlarb(p_opts->qos_options.vlarb_low,
> > -				  "qos_vlarb_low");
> > -		subn_verify_vlarb(p_opts->qos_ca_options.vlarb_low,
> > -				  "qos_ca_vlarb_low");
> > -		subn_verify_vlarb(p_opts->qos_sw0_options.vlarb_low,
> > -				  "qos_sw0_vlarb_low");
> > -		subn_verify_vlarb(p_opts->qos_swe_options.vlarb_low,
> > -				  "qos_swe_vlarb_low");
> > -		subn_verify_vlarb(p_opts->qos_rtr_options.vlarb_low,
> > -				  "qos_rtr_vlarb_low");
> > -
> > -		subn_verify_vlarb(p_opts->qos_options.vlarb_high,
> > -				  "qos_vlarb_high");
> > -		subn_verify_vlarb(p_opts->qos_ca_options.vlarb_high,
> > -				  "qos_ca_vlarb_high");
> > -		subn_verify_vlarb(p_opts->qos_sw0_options.vlarb_high,
> > -				  "qos_sw0_vlarb_high");
> > -		subn_verify_vlarb(p_opts->qos_swe_options.vlarb_high,
> > -				  "qos_swe_vlarb_high");
> > -		subn_verify_vlarb(p_opts->qos_rtr_options.vlarb_high,
> > -				  "qos_rtr_vlarb_high");
> > -
> > -		subn_verify_sl2vl(p_opts->qos_options.sl2vl, "qos_sl2vl");
> > -		subn_verify_sl2vl(p_opts->qos_ca_options.sl2vl, "qos_ca_sl2vl");
> > -		subn_verify_sl2vl(p_opts->qos_sw0_options.sl2vl,
> > -				  "qos_sw0_sl2vl");
> > -		subn_verify_sl2vl(p_opts->qos_swe_options.sl2vl,
> > -				  "qos_swe_sl2vl");
> > -		subn_verify_sl2vl(p_opts->qos_rtr_options.sl2vl,
> > -				  "qos_rtr_sl2vl");
> > +				       "qos_high_limit",
> > +				       OSM_DEFAULT_QOS_HIGH_LIMIT);
> > +		if (p_opts->qos_ca_options.high_limit >= 0)
> > +			subn_verify_high_limit(&(p_opts->qos_ca_options.high_limit),
> > +					       "qos_ca_high_limit",
> > +					       -1);
> > +		if (p_opts->qos_sw0_options.high_limit >= 0)
> > +			subn_verify_high_limit(&
> > +					       (p_opts->qos_sw0_options.high_limit),
> > +					       "qos_sw0_high_limit",
> > +					       -1);
> > +		if (p_opts->qos_swe_options.high_limit >= 0)
> > +			subn_verify_high_limit(&
> > +					       (p_opts->qos_swe_options.high_limit),
> > +					       "qos_swe_high_limit",
> > +					       -1);
> > +		if (p_opts->qos_rtr_options.high_limit >= 0)
> > +			subn_verify_high_limit(&
> > +					       (p_opts->qos_rtr_options.high_limit),
> > +					       "qos_rtr_high_limit",
> > +					       -1);
> > +
> > +		subn_verify_vlarb(&(p_opts->qos_options.vlarb_low),
> > +				  "qos_vlarb_low",
> > +				  OSM_DEFAULT_QOS_VLARB_LOW);
> > +		if (p_opts->qos_ca_options.vlarb_low)
> > +			subn_verify_vlarb(&(p_opts->qos_ca_options.vlarb_low),
> > +					  "qos_ca_vlarb_low",
> > +					  NULL);
> > +		if (p_opts->qos_sw0_options.vlarb_low)
> > +			subn_verify_vlarb(&(p_opts->qos_sw0_options.vlarb_low),
> > +					  "qos_sw0_vlarb_low",
> > +					  NULL);
> > +		if (p_opts->qos_swe_options.vlarb_low)
> > +			subn_verify_vlarb(&(p_opts->qos_swe_options.vlarb_low),
> > +					  "qos_swe_vlarb_low",
> > +					  NULL);
> > +		if (p_opts->qos_rtr_options.vlarb_low)
> > +			subn_verify_vlarb(&(p_opts->qos_rtr_options.vlarb_low),
> > +					  "qos_rtr_vlarb_low",
> > +					  NULL);
> > +
> > +		subn_verify_vlarb(&(p_opts->qos_options.vlarb_high),
> > +				  "qos_vlarb_high",
> > +				  OSM_DEFAULT_QOS_VLARB_HIGH);
> > +		if (p_opts->qos_ca_options.vlarb_high)
> > +			subn_verify_vlarb(&(p_opts->qos_ca_options.vlarb_high),
> > +					  "qos_ca_vlarb_high",
> > +					  NULL);
> > +		if (p_opts->qos_sw0_options.vlarb_high)
> > +			subn_verify_vlarb(&(p_opts->qos_sw0_options.vlarb_high),
> > +					  "qos_sw0_vlarb_high",
> > +					  NULL);
> > +		if (p_opts->qos_swe_options.vlarb_high)
> > +			subn_verify_vlarb(&(p_opts->qos_swe_options.vlarb_high),
> > +					  "qos_swe_vlarb_high",
> > +					  NULL);
> > +		if (p_opts->qos_rtr_options.vlarb_high)
> > +			subn_verify_vlarb(&(p_opts->qos_rtr_options.vlarb_high),
> > +					  "qos_rtr_vlarb_high",
> > +					  NULL);
> > +
> > +		subn_verify_sl2vl(&(p_opts->qos_options.sl2vl), 
> > +				  "qos_sl2vl",
> > +				  OSM_DEFAULT_QOS_SL2VL);
> > +		if (p_opts->qos_ca_options.sl2vl)
> > +			subn_verify_sl2vl(&(p_opts->qos_ca_options.sl2vl), 
> > +					  "qos_ca_sl2vl",
> > +					  NULL);
> > +		if (p_opts->qos_sw0_options.sl2vl)
> > +			subn_verify_sl2vl(&(p_opts->qos_sw0_options.sl2vl),
> > +					  "qos_sw0_sl2vl",
> > +					  NULL);
> > +		if (p_opts->qos_swe_options.sl2vl)
> > +			subn_verify_sl2vl(&(p_opts->qos_swe_options.sl2vl),
> > +					  "qos_swe_sl2vl",
> > +					  NULL);
> > +		if (p_opts->qos_rtr_options.sl2vl)
> > +			subn_verify_sl2vl(&(p_opts->qos_rtr_options.sl2vl),
> > +					  "qos_rtr_sl2vl",
> > +					  NULL);
> >  	}
> >  #ifdef ENABLE_OSM_PERF_MGR
> >  	if (p_opts->perfmgr_sweep_time_s < 1) {
> > @@ -1714,23 +1806,28 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts)
> >  
> >  	subn_dump_qos_options(opts_file,
> >  			      "QoS default options", "qos",
> > +			      &p_opts->qos_options,
> >  			      &p_opts->qos_options);
> >  	fprintf(opts_file, "\n");
> >  	subn_dump_qos_options(opts_file,
> >  			      "QoS CA options", "qos_ca",
> > -			      &p_opts->qos_ca_options);
> > +			      &p_opts->qos_ca_options,
> > +			      &p_opts->qos_options);
> >  	fprintf(opts_file, "\n");
> >  	subn_dump_qos_options(opts_file,
> >  			      "QoS Switch Port 0 options", "qos_sw0",
> > -			      &p_opts->qos_sw0_options);
> > +			      &p_opts->qos_sw0_options,
> > +			      &p_opts->qos_options);
> >  	fprintf(opts_file, "\n");
> >  	subn_dump_qos_options(opts_file,
> >  			      "QoS Switch external ports options", "qos_swe",
> > -			      &p_opts->qos_swe_options);
> > +			      &p_opts->qos_swe_options,
> > +			      &p_opts->qos_options);
> >  	fprintf(opts_file, "\n");
> >  	subn_dump_qos_options(opts_file,
> >  			      "QoS Router ports options", "qos_rtr",
> > -			      &p_opts->qos_rtr_options);
> > +			      &p_opts->qos_rtr_options,
> > +			      &p_opts->qos_options);
> >  	fprintf(opts_file, "\n");
> >  
> >  	fprintf(opts_file,
> > -- 
> > 1.5.4.5
> > 
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From nicolas.morey-chaisemartin at ext.bull.net  Tue Nov 11 22:36:06 2008
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Wed, 12 Nov 2008 07:36:06 +0100
Subject: [ofa-general] opensm: Routing on non-pure Fat-Tree
Message-ID: <491A7956.2000406@ext.bull.net>

Hello,

I am conducting some tests on routing non-pure fat-tree network using 
the fat tree algorithm of OpenSM.
The network I am experimenting on is a 3 level fat tree, with a pruned 
3rd layer.
By providing the root_guid_file, the algorithm works great !

The problem is, we would like to add some service nodes directly on the 
3rd level switches.
I have added the cn_guid_file so the network is still recognize as a fat 
tree.
OpenSM once more manage to create the routing for the network. It 
provides full connectivity,
 except there are no routes between non computes nodes.
I understand that the point of setting these node as not compute node 
should intend they won't talk to each other, but we still need a bit of 
connectivity between them to exchange few datas (pings and such).
A simple min-hop or such should be enough to generate those routes.
It will probably desequilibrate the number of routes going through the 
top links, but those additional link makes virtually no traffic at all, 
so in practical it shouldn't be a problem.

Is there any reasons such a behaviour wasn't implemented yet? Should 
there be one?

Regards

Nicolas Morey-Chaisemartin


From ogerlitz at voltaire.com  Tue Nov 11 22:46:04 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 12 Nov 2008 08:46:04 +0200
Subject: [ofa-general] [PATCH] perftest: don't attach the sender QP
In-Reply-To: <Pine.LNX.4.64.0811041122560.20425@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811041122560.20425@zuben.voltaire.com>
Message-ID: <491A7BAC.5030708@voltaire.com>

Or Gerlitz wrote:
> don't attach the sender QP to the MGID
>   
Oren,

Did you had the chance to look into this patch?

Or.
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>
> Index: perftest-1.2/send_bw.c
> ===================================================================
> --- perftest-1.2.orig/send_bw.c
> +++ perftest-1.2/send_bw.c
> @@ -421,7 +421,7 @@ static struct pingpong_context *pp_init_
>  			return NULL;
>  		}
>
> -		if ((user_parm->connection_type==UD) && (user_parm->use_mcg)) {
> +		if ((user_parm->connection_type==UD) && (user_parm->use_mcg) && !user_parm->servername) {
>  			union ibv_gid gid;
>  			uint8_t mcg_gid[16] = MCG_GID;
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From vlad at lists.openfabrics.org  Wed Nov 12 03:23:53 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 12 Nov 2008 03:23:53 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081112-0200 daily build status
Message-ID: <20081112112353.69757E60D93@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From kliteyn at dev.mellanox.co.il  Wed Nov 12 07:57:58 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 12 Nov 2008 17:57:58 +0200
Subject: [ofa-general] opensm: Routing on non-pure Fat-Tree
In-Reply-To: <491A7956.2000406@ext.bull.net>
References: <491A7956.2000406@ext.bull.net>
Message-ID: <491AFD06.3010207@dev.mellanox.co.il>

Hi Nicolas,

Nicolas Morey Chaisemartin wrote:
> Hello,
> 
> I am conducting some tests on routing non-pure fat-tree network using 
> the fat tree algorithm of OpenSM.
> The network I am experimenting on is a 3 level fat tree, with a pruned 
> 3rd layer.
> By providing the root_guid_file, the algorithm works great !
> 
> The problem is, we would like to add some service nodes directly on the 
> 3rd level switches.
> I have added the cn_guid_file so the network is still recognize as a fat 
> tree.
> OpenSM once more manage to create the routing for the network. It 
> provides full connectivity,
> except there are no routes between non computes nodes.
> I understand that the point of setting these node as not compute node 
> should intend they won't talk to each other, but we still need a bit of 
> connectivity between them to exchange few datas (pings and such).
> A simple min-hop or such should be enough to generate those routes.
> It will probably desequilibrate the number of routes going through the 
> top links, but those additional link makes virtually no traffic at all, 
> so in practical it shouldn't be a problem.

Fat-tree should create full connectivity as long as there is an up/down
route between ports. Do you get connectivity between these nodes with
up/down routing algorithm?
Try running it with the same root_guid_file.

-- Yevgeny

> Is there any reasons such a behaviour wasn't implemented yet? Should 
> there be one?
> 
> Regards
> 
> Nicolas Morey-Chaisemartin
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From nicolas.morey-chaisemartin at ext.bull.net  Wed Nov 12 08:01:53 2008
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Wed, 12 Nov 2008 17:01:53 +0100
Subject: [ofa-general] opensm: Routing on non-pure Fat-Tree
In-Reply-To: <491AFD06.3010207@dev.mellanox.co.il>
References: <491A7956.2000406@ext.bull.net>
	<491AFD06.3010207@dev.mellanox.co.il>
Message-ID: <491AFDF1.2080607@ext.bull.net>

Yevgeny Kliteynik a écrit :
> Hi Nicolas,
>
> Nicolas Morey Chaisemartin wrote:
>> Hello,
>>
>> I am conducting some tests on routing non-pure fat-tree network using 
>> the fat tree algorithm of OpenSM.
>> The network I am experimenting on is a 3 level fat tree, with a 
>> pruned 3rd layer.
>> By providing the root_guid_file, the algorithm works great !
>>
>> The problem is, we would like to add some service nodes directly on 
>> the 3rd level switches.
>> I have added the cn_guid_file so the network is still recognize as a 
>> fat tree.
>> OpenSM once more manage to create the routing for the network. It 
>> provides full connectivity,
>> except there are no routes between non computes nodes.
>> I understand that the point of setting these node as not compute node 
>> should intend they won't talk to each other, but we still need a bit 
>> of connectivity between them to exchange few datas (pings and such).
>> A simple min-hop or such should be enough to generate those routes.
>> It will probably desequilibrate the number of routes going through 
>> the top links, but those additional link makes virtually no traffic 
>> at all, so in practical it shouldn't be a problem.
>
> Fat-tree should create full connectivity as long as there is an up/down
> route between ports. Do you get connectivity between these nodes with
> up/down routing algorithm?
> Try running it with the same root_guid_file.
>
> -- Yevgeny

Well the route would be more down/up compared to the rest of the transfer.
(Im not sure I was clear, but when i talk of 3rd level, I mean top 
level. 1st level begin the switches just above the compute nodes)
I'll try this tomorrow

Thanks

Nicolas
>
>> Is there any reasons such a behaviour wasn't implemented yet? Should 
>> there be one?
>>
>> Regards
>>
>> Nicolas Morey-Chaisemartin
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>
>


From michael.heinz at qlogic.com  Wed Nov 12 08:52:29 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Wed, 12 Nov 2008 10:52:29 -0600
Subject: [ofa-general]
	fork() failing in mvapich1 and mvapich2, using OFED 1.4
Message-ID: <C07C40DB2364324799506DE8FF12F8D886B885@EPEXCH1.qlogic.org>

I'm not sure when this stopped working, but I'm getting a complaint from
our QA people that our fork() test program is failing with mvapich1 and
mvapich2 when tested with OFED 1.4. When I tested with OFED 1.3.1, I got
a similar result:


[root at panic mpi_fork]$ mpirun_rsh -np 2 panic homer mpi_fork 128 1024
Exit code -3 signaled from homer
Abort signaled by rank 0: [panic:0] Got completion with error
IBV_WC_LOC_LEN_ERR, code=1, dest rank=1

Killing remote processes...MPI process terminated unexpectedly
DONE


This is the program that generates the failure:

#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <sys/wait.h>


#define MYBUFSIZE (4*1024*1028)
#define MAX_REQ_NUM 100000

char s_buf1[MYBUFSIZE];
char r_buf1[MYBUFSIZE];


MPI_Request request[MAX_REQ_NUM];
MPI_Status my_stat[MAX_REQ_NUM];

int main(int argc,char *argv[])
{
    int  myid, numprocs, i;
    int size, loop, page_size;
    char *s_buf, *r_buf;
    double t_start=0.0, t_end=0.0, t=0.0;


    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    if ( argc < 3 ) {
       fprintf(stderr, "Usage: mpi_fork loop msg_size\n");
       MPI_Finalize();
       return 0;
    }
    size=atoi(argv[2]);
    loop = atoi(argv[1]);

    if(size > MYBUFSIZE){
         fprintf(stderr, "Maximum message size is %d\n",MYBUFSIZE);
         MPI_Finalize();
         return 0;
    }

    if(loop > MAX_REQ_NUM){
         fprintf(stderr, "Maximum number of iterations is
%d\n",MAX_REQ_NUM);
         MPI_Finalize();
         return 0;
    }

    page_size = getpagesize();

    s_buf = (char*)(((unsigned long)s_buf1 + (page_size -1))/page_size *
page_size);
    r_buf = (char*)(((unsigned long)r_buf1 + (page_size -1))/page_size *
page_size);

    assert( (s_buf != NULL) && (r_buf != NULL) );

    for ( i=0; i<size; i++ ){
           s_buf[i]='a';
           r_buf[i]='b';
    }

    /*warmup */
    if (myid == 0)
    {
        for ( i=0; i< loop; i++ ) {
            MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
request+i);
        }

        MPI_Waitall(loop, request, my_stat);
        MPI_Recv(r_buf, 4, MPI_CHAR, 1, 101, MPI_COMM_WORLD,
&my_stat[0]);

    }else{
        for ( i=0; i< loop; i++ ) {
        MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
request+i);
        }
    MPI_Waitall(loop, request, my_stat);
        MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
    }
    // fork a child process and make sure it lives beyond parent
touching pages
    // if fork is not properly handled in stack, parent would get a copy
    // of its registered/locked pages (such as qp wqes) on 1st access
    // and problems such as Local Length Error would be reported by HCA
    if (fork() == 0) {
        // child exists but doesn't touch anything, parent still owns
pages
        sleep(10);
        // exec another program
        execlp("date", "date", NULL);
        // just in case exec fails
        exit(0);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if (myid == 0)
    {
        t_start=MPI_Wtime();
        for ( i=0; i< loop; i++ ) {
            MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
request+i);
        }

        MPI_Waitall(loop, request, my_stat);
        MPI_Recv(r_buf, 4, MPI_CHAR, 1, 101, MPI_COMM_WORLD,
&my_stat[0]);

        t_end=MPI_Wtime();
        t = t_end - t_start;

    }else{
        for ( i=0; i< loop; i++ ) {
        MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
request+i);
        }
    MPI_Waitall(loop, request, my_stat);
        MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
    }

    if ( myid == 0 ) {
       double tmp;
       tmp = ((size*1.0)/1.0e6)*loop;
       fprintf(stdout,"%d\t%f\n", size, tmp/t);
    }
    {
        int status;
        int ret;

        ret = wait(&status);
        if (ret == -1 || ! WIFEXITED(status) || WEXITSTATUS(status) !=
0)
        {
           fprintf(stdout,"ERROR: child failure: ret=%d, status=0x%x,
exit_status=%d\n", ret, status, WEXITSTATUS(status));
        }
    }

    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}

 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania


From koop at cse.ohio-state.edu  Wed Nov 12 09:13:00 2008
From: koop at cse.ohio-state.edu (Matthew Koop)
Date: Wed, 12 Nov 2008 12:13:00 -0500 (EST)
Subject: [ofa-general] Re: [mvapich-discuss] fork() failing in mvapich1 and
 mvapich2, using OFED 1.4
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D886B885@EPEXCH1.qlogic.org>
Message-ID: <Pine.GSO.4.40.0811121211440.7711-100000@kappa.cse.ohio-state.edu>

Hi Mike,

In order to have the fork support enabled you need to set an additional
ENV. See Section 7.1.2 in the User Guide for more information:

http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-350007.1.2

Thanks,

Matt


On Wed, 12 Nov 2008, Mike Heinz wrote:

> I'm not sure when this stopped working, but I'm getting a complaint from
> our QA people that our fork() test program is failing with mvapich1 and
> mvapich2 when tested with OFED 1.4. When I tested with OFED 1.3.1, I got
> a similar result:
>
>
> [root at panic mpi_fork]$ mpirun_rsh -np 2 panic homer mpi_fork 128 1024
> Exit code -3 signaled from homer
> Abort signaled by rank 0: [panic:0] Got completion with error
> IBV_WC_LOC_LEN_ERR, code=1, dest rank=1
>
> Killing remote processes...MPI process terminated unexpectedly
> DONE
>
>
> This is the program that generates the failure:
>
> #include <stdlib.h>
> #include <math.h>
> #include <assert.h>
> #include <sys/wait.h>
>
>
> #define MYBUFSIZE (4*1024*1028)
> #define MAX_REQ_NUM 100000
>
> char s_buf1[MYBUFSIZE];
> char r_buf1[MYBUFSIZE];
>
>
> MPI_Request request[MAX_REQ_NUM];
> MPI_Status my_stat[MAX_REQ_NUM];
>
> int main(int argc,char *argv[])
> {
>     int  myid, numprocs, i;
>     int size, loop, page_size;
>     char *s_buf, *r_buf;
>     double t_start=0.0, t_end=0.0, t=0.0;
>
>
>     MPI_Init(&argc,&argv);
>     MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>     MPI_Comm_rank(MPI_COMM_WORLD,&myid);
>
>     if ( argc < 3 ) {
>        fprintf(stderr, "Usage: mpi_fork loop msg_size\n");
>        MPI_Finalize();
>        return 0;
>     }
>     size=atoi(argv[2]);
>     loop = atoi(argv[1]);
>
>     if(size > MYBUFSIZE){
>          fprintf(stderr, "Maximum message size is %d\n",MYBUFSIZE);
>          MPI_Finalize();
>          return 0;
>     }
>
>     if(loop > MAX_REQ_NUM){
>          fprintf(stderr, "Maximum number of iterations is
> %d\n",MAX_REQ_NUM);
>          MPI_Finalize();
>          return 0;
>     }
>
>     page_size = getpagesize();
>
>     s_buf = (char*)(((unsigned long)s_buf1 + (page_size -1))/page_size *
> page_size);
>     r_buf = (char*)(((unsigned long)r_buf1 + (page_size -1))/page_size *
> page_size);
>
>     assert( (s_buf != NULL) && (r_buf != NULL) );
>
>     for ( i=0; i<size; i++ ){
>            s_buf[i]='a';
>            r_buf[i]='b';
>     }
>
>     /*warmup */
>     if (myid == 0)
>     {
>         for ( i=0; i< loop; i++ ) {
>             MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
> request+i);
>         }
>
>         MPI_Waitall(loop, request, my_stat);
>         MPI_Recv(r_buf, 4, MPI_CHAR, 1, 101, MPI_COMM_WORLD,
> &my_stat[0]);
>
>     }else{
>         for ( i=0; i< loop; i++ ) {
>         MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
> request+i);
>         }
>     MPI_Waitall(loop, request, my_stat);
>         MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
>     }
>     // fork a child process and make sure it lives beyond parent
> touching pages
>     // if fork is not properly handled in stack, parent would get a copy
>     // of its registered/locked pages (such as qp wqes) on 1st access
>     // and problems such as Local Length Error would be reported by HCA
>     if (fork() == 0) {
>         // child exists but doesn't touch anything, parent still owns
> pages
>         sleep(10);
>         // exec another program
>         execlp("date", "date", NULL);
>         // just in case exec fails
>         exit(0);
>     }
>
>     MPI_Barrier(MPI_COMM_WORLD);
>
>     if (myid == 0)
>     {
>         t_start=MPI_Wtime();
>         for ( i=0; i< loop; i++ ) {
>             MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
> request+i);
>         }
>
>         MPI_Waitall(loop, request, my_stat);
>         MPI_Recv(r_buf, 4, MPI_CHAR, 1, 101, MPI_COMM_WORLD,
> &my_stat[0]);
>
>         t_end=MPI_Wtime();
>         t = t_end - t_start;
>
>     }else{
>         for ( i=0; i< loop; i++ ) {
>         MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
> request+i);
>         }
>     MPI_Waitall(loop, request, my_stat);
>         MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
>     }
>
>     if ( myid == 0 ) {
>        double tmp;
>        tmp = ((size*1.0)/1.0e6)*loop;
>        fprintf(stdout,"%d\t%f\n", size, tmp/t);
>     }
>     {
>         int status;
>         int ret;
>
>         ret = wait(&status);
>         if (ret == -1 || ! WIFEXITED(status) || WEXITSTATUS(status) !=
> 0)
>         {
>            fprintf(stdout,"ERROR: child failure: ret=%d, status=0x%x,
> exit_status=%d\n", ret, status, WEXITSTATUS(status));
>         }
>     }
>
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     return 0;
> }
>
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


From michael.heinz at qlogic.com  Wed Nov 12 09:22:22 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Wed, 12 Nov 2008 11:22:22 -0600
Subject: [ofa-general] RE: [mvapich-discuss] fork() failing in mvapich1 and
	mvapich2, using OFED 1.4
In-Reply-To: <Pine.GSO.4.40.0811121211440.7711-100000@kappa.cse.ohio-state.edu>
References: <C07C40DB2364324799506DE8FF12F8D886B885@EPEXCH1.qlogic.org>
	<Pine.GSO.4.40.0811121211440.7711-100000@kappa.cse.ohio-state.edu>
Message-ID: <C07C40DB2364324799506DE8FF12F8D886B88F@EPEXCH1.qlogic.org>

Thanks for the reply, Matt. 


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Matthew Koop [mailto:koop at cse.ohio-state.edu] 
Sent: Wednesday, November 12, 2008 12:13 PM
To: Mike Heinz
Cc: mvapich-discuss at cse.ohio-state.edu; general at lists.openfabrics.org
Subject: Re: [mvapich-discuss] fork() failing in mvapich1 and mvapich2,
using OFED 1.4

Hi Mike,

In order to have the fork support enabled you need to set an additional
ENV. See Section 7.1.2 in the User Guide for more information:

http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-350
007.1.2

Thanks,

Matt


On Wed, 12 Nov 2008, Mike Heinz wrote:

> I'm not sure when this stopped working, but I'm getting a complaint 
> from our QA people that our fork() test program is failing with 
> mvapich1 and
> mvapich2 when tested with OFED 1.4. When I tested with OFED 1.3.1, I 
> got a similar result:
>
>
> [root at panic mpi_fork]$ mpirun_rsh -np 2 panic homer mpi_fork 128 1024 
> Exit code -3 signaled from homer Abort signaled by rank 0: [panic:0] 
> Got completion with error IBV_WC_LOC_LEN_ERR, code=1, dest rank=1
>
> Killing remote processes...MPI process terminated unexpectedly DONE
>
>
> This is the program that generates the failure:
>
> #include <stdlib.h>
> #include <math.h>
> #include <assert.h>
> #include <sys/wait.h>
>
>
> #define MYBUFSIZE (4*1024*1028)
> #define MAX_REQ_NUM 100000
>
> char s_buf1[MYBUFSIZE];
> char r_buf1[MYBUFSIZE];
>
>
> MPI_Request request[MAX_REQ_NUM];
> MPI_Status my_stat[MAX_REQ_NUM];
>
> int main(int argc,char *argv[])
> {
>     int  myid, numprocs, i;
>     int size, loop, page_size;
>     char *s_buf, *r_buf;
>     double t_start=0.0, t_end=0.0, t=0.0;
>
>
>     MPI_Init(&argc,&argv);
>     MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
>     MPI_Comm_rank(MPI_COMM_WORLD,&myid);
>
>     if ( argc < 3 ) {
>        fprintf(stderr, "Usage: mpi_fork loop msg_size\n");
>        MPI_Finalize();
>        return 0;
>     }
>     size=atoi(argv[2]);
>     loop = atoi(argv[1]);
>
>     if(size > MYBUFSIZE){
>          fprintf(stderr, "Maximum message size is %d\n",MYBUFSIZE);
>          MPI_Finalize();
>          return 0;
>     }
>
>     if(loop > MAX_REQ_NUM){
>          fprintf(stderr, "Maximum number of iterations is
> %d\n",MAX_REQ_NUM);
>          MPI_Finalize();
>          return 0;
>     }
>
>     page_size = getpagesize();
>
>     s_buf = (char*)(((unsigned long)s_buf1 + (page_size -1))/page_size
*
> page_size);
>     r_buf = (char*)(((unsigned long)r_buf1 + (page_size -1))/page_size
*
> page_size);
>
>     assert( (s_buf != NULL) && (r_buf != NULL) );
>
>     for ( i=0; i<size; i++ ){
>            s_buf[i]='a';
>            r_buf[i]='b';
>     }
>
>     /*warmup */
>     if (myid == 0)
>     {
>         for ( i=0; i< loop; i++ ) {
>             MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
> request+i);
>         }
>
>         MPI_Waitall(loop, request, my_stat);
>         MPI_Recv(r_buf, 4, MPI_CHAR, 1, 101, MPI_COMM_WORLD,
> &my_stat[0]);
>
>     }else{
>         for ( i=0; i< loop; i++ ) {
>         MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
> request+i);
>         }
>     MPI_Waitall(loop, request, my_stat);
>         MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
>     }
>     // fork a child process and make sure it lives beyond parent
> touching pages
>     // if fork is not properly handled in stack, parent would get a
copy
>     // of its registered/locked pages (such as qp wqes) on 1st access
>     // and problems such as Local Length Error would be reported by
HCA
>     if (fork() == 0) {
>         // child exists but doesn't touch anything, parent still owns
> pages
>         sleep(10);
>         // exec another program
>         execlp("date", "date", NULL);
>         // just in case exec fails
>         exit(0);
>     }
>
>     MPI_Barrier(MPI_COMM_WORLD);
>
>     if (myid == 0)
>     {
>         t_start=MPI_Wtime();
>         for ( i=0; i< loop; i++ ) {
>             MPI_Isend(s_buf, size, MPI_CHAR, 1, 100, MPI_COMM_WORLD,
> request+i);
>         }
>
>         MPI_Waitall(loop, request, my_stat);
>         MPI_Recv(r_buf, 4, MPI_CHAR, 1, 101, MPI_COMM_WORLD,
> &my_stat[0]);
>
>         t_end=MPI_Wtime();
>         t = t_end - t_start;
>
>     }else{
>         for ( i=0; i< loop; i++ ) {
>         MPI_Irecv(r_buf, size, MPI_CHAR, 0, 100, MPI_COMM_WORLD,
> request+i);
>         }
>     MPI_Waitall(loop, request, my_stat);
>         MPI_Send(s_buf, 4, MPI_CHAR, 0, 101, MPI_COMM_WORLD);
>     }
>
>     if ( myid == 0 ) {
>        double tmp;
>        tmp = ((size*1.0)/1.0e6)*loop;
>        fprintf(stdout,"%d\t%f\n", size, tmp/t);
>     }
>     {
>         int status;
>         int ret;
>
>         ret = wait(&status);
>         if (ret == -1 || ! WIFEXITED(status) || WEXITSTATUS(status) !=
> 0)
>         {
>            fprintf(stdout,"ERROR: child failure: ret=%d, status=0x%x,
> exit_status=%d\n", ret, status, WEXITSTATUS(status));
>         }
>     }
>
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     return 0;
> }
>
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


From rdreier at cisco.com  Wed Nov 12 10:20:40 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 12 Nov 2008 10:20:40 -0800
Subject: [ofa-general] Re: [PATCH 2.6.28] RDMA/cxgb3: deadlock in iw_cxgb3
	can cause hang when configuring interface.
In-Reply-To: <20081106230642.28808.66765.stgit@dell3.ogc.int> (Steve Wise's
	message of "Thu, 06 Nov 2008 17:06:42 -0600")
References: <20081106230642.28808.66765.stgit@dell3.ogc.int>
Message-ID: <adazlk4n6vb.fsf@cisco.com>

Looks good, applied.

However, I think it's a little yucky to call ethtool ops without rtnl,
although it is of course perfectly safe in this case.  It might be nicer
to introduce a new cxgb3 <-> iw_cxgb3 interface that returns the
firmware version, which can also be used to implement the get_drvinfo
ethtool op as well.  That would let you avoid fw_vers_string_to_u64() as
well -- it is a little silly at the moment how cxgb3 converts to a
string and then iw_cxgb3 parses that string.

But that's all much lower priority than just fixing a deadlock.

 - R.


From sashak at voltaire.com  Wed Nov 12 10:54:57 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 12 Nov 2008 20:54:57 +0200
Subject: [ofa-general] Re: [PATCH] opensm/opensm/osm_state_mgr.c: Add check
	for valid physical port before using pointer.
In-Reply-To: <20081110131140.52561f42.weiny2@llnl.gov>
References: <20081104095744.35893d4a.weiny2@llnl.gov>
	<20081110201333.GM313@sashak.voltaire.com>
	<20081110131140.52561f42.weiny2@llnl.gov>
Message-ID: <20081112185457.GD27271@sashak.voltaire.com>

Hi Ira,

On 13:11 Mon 10 Nov     , Ira Weiny wrote:
> > 
> > Actually it can be a valid case. For example when node was first time
> > discovered via port A, when this port was disconnected and the same node
> > was discovered via port B - it is not a new node and node_info (where
> > port number for osm_node_get_any_physp_ptr() is stored) will not be
> > updated.
> 
> Ah, good point, I just happened to see it when PortInfo failed.
> 
> > 
> > Obviously the patch is fine. But probably we need more general fix, for
> > example to redo osm_node_get_any_physp_ptr() so that it will not return
> > invalid ports. Need to review other osm_node_get_any_physp_ptr() usages.
> 
> I was wondering if it would return invalid ports ever.  It would be easy for it
> to return only valid ports but perhaps that should be another function to
> preserve functionality?

Perhaps. OTOH osm_node_get_any_physp_ptr() is used very few. I think
first we need to review all those cases, then we will know better how to
handle this.

Sasha


From chu11 at llnl.gov  Wed Nov 12 11:26:56 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 12 Nov 2008 11:26:56 -0800
Subject: [ofa-general] opensm - can/cannot set alternate default pkey?
Message-ID: <1226518016.7156.15.camel@cardanus.llnl.gov>

Before I run off and write a patch I shouldn't, I thought I'd ask.

In 10.9.1.2 of the spec, it states, "The P_Key value of 0xFFFF shall
represent the default partition key."

(I couldn't find the glossary in the spec about what "shall" means, but
I assume it means "must" or "required" like RFCs.)

Does this mean that a P_Key of 0xFFFF must be in the P_Key_Table?
Currently, it seems that in opensm, no matter how you write your
partition.conf file, 0xFFFF will always be the P_Key_Table.  This is
because opensm inserts this in it's internal list by default, and
nothing (as far as I can find) can remove it/get rid of it out of that
internal list.

This seems wrong to me, but I'm getting confused on the wording.

Thanks,
Al

-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From kliteyn at dev.mellanox.co.il  Wed Nov 12 12:00:10 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 12 Nov 2008 22:00:10 +0200
Subject: [ofa-general] opensm: Routing on non-pure Fat-Tree
In-Reply-To: <491AFDF1.2080607@ext.bull.net>
References: <491A7956.2000406@ext.bull.net>
	<491AFD06.3010207@dev.mellanox.co.il>
	<491AFDF1.2080607@ext.bull.net>
Message-ID: <491B35CA.8020904@dev.mellanox.co.il>

Nicolas Morey Chaisemartin wrote:
> Yevgeny Kliteynik a écrit :
>> Hi Nicolas,
>>
>> Nicolas Morey Chaisemartin wrote:
>>> Hello,
>>>
>>> I am conducting some tests on routing non-pure fat-tree network using 
>>> the fat tree algorithm of OpenSM.
>>> The network I am experimenting on is a 3 level fat tree, with a 
>>> pruned 3rd layer.
>>> By providing the root_guid_file, the algorithm works great !
>>>
>>> The problem is, we would like to add some service nodes directly on 
>>> the 3rd level switches.
>>> I have added the cn_guid_file so the network is still recognize as a 
>>> fat tree.
>>> OpenSM once more manage to create the routing for the network. It 
>>> provides full connectivity,
>>> except there are no routes between non computes nodes.
>>> I understand that the point of setting these node as not compute node 
>>> should intend they won't talk to each other, but we still need a bit 
>>> of connectivity between them to exchange few datas (pings and such).
>>> A simple min-hop or such should be enough to generate those routes.
>>> It will probably desequilibrate the number of routes going through 
>>> the top links, but those additional link makes virtually no traffic 
>>> at all, so in practical it shouldn't be a problem.
>>
>> Fat-tree should create full connectivity as long as there is an up/down
>> route between ports. Do you get connectivity between these nodes with
>> up/down routing algorithm?
>> Try running it with the same root_guid_file.
>>
>> -- Yevgeny
> 
> Well the route would be more down/up compared to the rest of the transfer.
> (Im not sure I was clear, but when i talk of 3rd level, I mean top 
> level. 1st level begin the switches just above the compute nodes)

Oh, OK. I was thinking the opposite. So you connect these
non-CNs to spine switches.

> I'll try this tomorrow

No need :)
Fat-tree is a variation of up/down routing. As such, down/up
routes are not allowed. You won't have connectivity between
these nodes neither in fat-tree nor in up/down routing.

>>> Is there any reasons such a behavior wasn't implemented yet?

The idea of allowing only up/down routes is preventing credit
loops in the fabric.

>>> Should there be one?

I guess it is possible, but these down/up routes will create
credit loops, so any traffic between these "special" nodes is
potentially bad for fabric.
Note that there's already a "connect roots" option in the
up/down routing which violates the up/down rule, but this is
only between switches, so I believe that the only traffic
that uses these routes is management traffic.

-- Yevgeny

>>> Regards
>>>
>>> Nicolas Morey-Chaisemartin
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit 
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>
>>
>>
> 
> 


From akepner at sgi.com  Wed Nov 12 14:18:46 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 12 Nov 2008 14:18:46 -0800
Subject: [ofa-general] opensm: bad multicast forwarding table entries
Message-ID: <20081112221846.GE25248@sgi.com>


Here's a description of a problem we're seeing where multicast 
forwarding tables are apparently getting set up incorrectly. I'd 
appreciate any debug help from the opensm experts out there.

On large clusters (>1000 nodes or so) we often see hundreds of errors 
from 'ibdiagnet -r' like the following (this is the simplest example 
I could find):

-I- Multicast Group:0xC069 has:2 switches and:2 HCAs
-E- Disconnected switch:S0800690000002e51/U1 in group:0xC069
-E- Disconnected HCA:r4i2n10/U1

These have invariably been multicast groups associated with IPv6 
solicited node multicast addresses, e.g., in this case 'saquery -m' 
shows only a single member, "r5lead":

MCMemberRecord member dump:
                MGID....................0xff12601bffff0000 : 0x00000001ff26d289
                Mlid....................0xC069
                PortGid.................0xfe80000000000000 : 0x0002c9020026d289
                ScopeState..............0x1
                ProxyJoin...............0x0
                NodeDescription.........r5lead HCA-1

ibdiagnet shows that "r5lead" is connected to the switch with lid 
1609, port 24:

Switch  24 "S-0800690000002db4"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1609 lmc 0
[24]    "H-0002c9020026d288"[1](2c9020026d289)          # "r5lead HCA-1" lid 1576 4xDDR

and the multicast forwarding table (from 'dump_mfts.sh') is consistent:

Multicast mlids [0xc000-0xc3ff] of switch Lid 1609 guid 0x0800690000002db4 (MT47396 Infiniscale-III Mellanox Technologies):
            0                   1                   2
     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
 MLid
....
0xc069                                                      x


So far, so good. But we also have r4i2n10, connected to the switch with 
lid 1533 port 7:

switchguid=0x800690000002e50(800690000002e50)
Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
......
[7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR

with this mft entry:

Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
            0                   1                   2
     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
 MLid
.....
0xc069                    x

Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a 
mft entry for the multicast group with MGID ff12601bffff::1ff26d289?

Anyone else seen similar?

-- 
Arthur


From hal.rosenstock at gmail.com  Wed Nov 12 14:46:18 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Nov 2008 17:46:18 -0500
Subject: [ofa-general] opensm - can/cannot set alternate default pkey?
In-Reply-To: <1226518016.7156.15.camel@cardanus.llnl.gov>
References: <1226518016.7156.15.camel@cardanus.llnl.gov>
Message-ID: <f0e08f230811121446l569b180bqe62b0083eafddba6@mail.gmail.com>

Hi Al,

On Wed, Nov 12, 2008 at 2:26 PM, Al Chu <chu11 at llnl.gov> wrote:
> Before I run off and write a patch I shouldn't, I thought I'd ask.

I don't think there's a need (see below).

> In 10.9.1.2 of the spec, it states, "The P_Key value of 0xFFFF shall
> represent the default partition key."

Default in this sense is referring to the default partition (and it is
not changeable in the same sense other defaults are).

All end ports _must_ be a member of the default partition either as a
full or limited member. This is needed for SA communication. See p.882
Table 185 P_KeyTable (initialization) for one citation on this. There
are others in the spec.

> (I couldn't find the glossary in the spec about what "shall" means, but
> I assume it means "must" or "required" like RFCs.)

Yes.

> Does this mean that a P_Key of 0xFFFF must be in the P_Key_Table?

Either 0xffff or 0x7fff must be in the P_KeyTable of every end port.

> Currently, it seems that in opensm, no matter how you write your
> partition.conf file, 0xFFFF will always be the P_Key_Table.  This is
> because opensm inserts this in it's internal list by default, and
> nothing (as far as I can find) can remove it/get rid of it out of that
> internal list.

That's being a full member of the default partition. You should be
able to change this to be a limited member of the default partition
too.

-- Hal

> This seems wrong to me, but I'm getting confused on the wording.

> Thanks,
> Al
>
> --
> Albert Chu
> chu11 at llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Wed Nov 12 14:46:57 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Nov 2008 17:46:57 -0500
Subject: ***SPAM*** Re: [ofa-general] opensm: bad multicast forwarding table
	entries
In-Reply-To: <20081112221846.GE25248@sgi.com>
References: <20081112221846.GE25248@sgi.com>
Message-ID: <f0e08f230811121446g396e1343ia2242d607bed123d@mail.gmail.com>

On Wed, Nov 12, 2008 at 5:18 PM,  <akepner at sgi.com> wrote:
>
> Here's a description of a problem we're seeing where multicast
> forwarding tables are apparently getting set up incorrectly. I'd
> appreciate any debug help from the opensm experts out there.
>
> On large clusters (>1000 nodes or so) we often see hundreds of errors
> from 'ibdiagnet -r' like the following (this is the simplest example
> I could find):
>
> -I- Multicast Group:0xC069 has:2 switches and:2 HCAs
> -E- Disconnected switch:S0800690000002e51/U1 in group:0xC069
> -E- Disconnected HCA:r4i2n10/U1
>
> These have invariably been multicast groups associated with IPv6
> solicited node multicast addresses, e.g., in this case 'saquery -m'
> shows only a single member, "r5lead":
>
> MCMemberRecord member dump:
>                MGID....................0xff12601bffff0000 : 0x00000001ff26d289
>                Mlid....................0xC069
>                PortGid.................0xfe80000000000000 : 0x0002c9020026d289
>                ScopeState..............0x1
>                ProxyJoin...............0x0
>                NodeDescription.........r5lead HCA-1
>
> ibdiagnet shows that "r5lead" is connected to the switch with lid
> 1609, port 24:
>
> Switch  24 "S-0800690000002db4"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1609 lmc 0
> [24]    "H-0002c9020026d288"[1](2c9020026d289)          # "r5lead HCA-1" lid 1576 4xDDR
>
> and the multicast forwarding table (from 'dump_mfts.sh') is consistent:
>
> Multicast mlids [0xc000-0xc3ff] of switch Lid 1609 guid 0x0800690000002db4 (MT47396 Infiniscale-III Mellanox Technologies):
>            0                   1                   2
>     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
>  MLid
> ....
> 0xc069                                                      x
>
>
> So far, so good. But we also have r4i2n10, connected to the switch with
> lid 1533 port 7:
>
> switchguid=0x800690000002e50(800690000002e50)
> Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
> ......
> [7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR
>
> with this mft entry:
>
> Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
>            0                   1                   2
>     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
>  MLid
> .....
> 0xc069                    x
>
> Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a
> mft entry for the multicast group with MGID ff12601bffff::1ff26d289?

Are you using the consolidate IPv6 SNM (solicited node multicast)
option in OpenSM ?

-- Hal

> Anyone else seen similar?
>
> --
> Arthur
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From chu11 at llnl.gov  Wed Nov 12 14:59:53 2008
From: chu11 at llnl.gov (Al Chu)
Date: Wed, 12 Nov 2008 14:59:53 -0800
Subject: [ofa-general] opensm - can/cannot set alternate default pkey?
In-Reply-To: <f0e08f230811121446l569b180bqe62b0083eafddba6@mail.gmail.com>
References: <1226518016.7156.15.camel@cardanus.llnl.gov>
	<f0e08f230811121446l569b180bqe62b0083eafddba6@mail.gmail.com>
Message-ID: <1226530793.7156.25.camel@cardanus.llnl.gov>

Hey Hal,

Now its making more sense to me.  Thanks for clearing it up.

Al

On Wed, 2008-11-12 at 17:46 -0500, Hal Rosenstock wrote:
> Hi Al,
> 
> On Wed, Nov 12, 2008 at 2:26 PM, Al Chu <chu11 at llnl.gov> wrote:
> > Before I run off and write a patch I shouldn't, I thought I'd ask.
> 
> I don't think there's a need (see below).
> 
> > In 10.9.1.2 of the spec, it states, "The P_Key value of 0xFFFF shall
> > represent the default partition key."
> 
> Default in this sense is referring to the default partition (and it is
> not changeable in the same sense other defaults are).
> 
> All end ports _must_ be a member of the default partition either as a
> full or limited member. This is needed for SA communication. See p.882
> Table 185 P_KeyTable (initialization) for one citation on this. There
> are others in the spec.
>
> > (I couldn't find the glossary in the spec about what "shall" means, but
> > I assume it means "must" or "required" like RFCs.)
> 
> Yes.
> 
> > Does this mean that a P_Key of 0xFFFF must be in the P_Key_Table?
> 
> Either 0xffff or 0x7fff must be in the P_KeyTable of every end port.
> 
> > Currently, it seems that in opensm, no matter how you write your
> > partition.conf file, 0xFFFF will always be the P_Key_Table.  This is
> > because opensm inserts this in it's internal list by default, and
> > nothing (as far as I can find) can remove it/get rid of it out of that
> > internal list.
> 
> That's being a full member of the default partition. You should be
> able to change this to be a limited member of the default partition
> too.
> 
> -- Hal
> 
> > This seems wrong to me, but I'm getting confused on the wording.
> 
> > Thanks,
> > Al
> >
> > --
> > Albert Chu
> > chu11 at llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> >
> 
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From akepner at sgi.com  Wed Nov 12 15:00:13 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 12 Nov 2008 15:00:13 -0800
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <f0e08f230811121446g396e1343ia2242d607bed123d@mail.gmail.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121446g396e1343ia2242d607bed123d@mail.gmail.com>
Message-ID: <20081112230013.GF25248@sgi.com>

On Wed, Nov 12, 2008 at 05:46:57PM -0500, Hal Rosenstock wrote:
> ...
> Are you using the consolidate IPv6 SNM (solicited node multicast)
> option in OpenSM ?
> 

No, we're generally using OFED 1.3-1.3.1 vintage code, which 
doesn't have that option. (In fact, this is the first I've 
heard of it.)

-- 
Arthur


From hal.rosenstock at gmail.com  Wed Nov 12 15:14:15 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Nov 2008 18:14:15 -0500
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <20081112230013.GF25248@sgi.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121446g396e1343ia2242d607bed123d@mail.gmail.com>
	<20081112230013.GF25248@sgi.com>
Message-ID: <f0e08f230811121514q6c797e7ub5d2491dc51ab82c@mail.gmail.com>

On Wed, Nov 12, 2008 at 6:00 PM,  <akepner at sgi.com> wrote:
> On Wed, Nov 12, 2008 at 05:46:57PM -0500, Hal Rosenstock wrote:
>> ...
>> Are you using the consolidate IPv6 SNM (solicited node multicast)
>> option in OpenSM ?
>>
>
> No, we're generally using OFED 1.3-1.3.1 vintage code, which
> doesn't have that option. (In fact, this is the first I've
> heard of it.)

OK; that at least level sets this. I'm not sure about what's changed
in this area but I'll respond some more to the original post.

-- Hal

> Arthur
>
>


From hal.rosenstock at gmail.com  Wed Nov 12 15:27:29 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Nov 2008 18:27:29 -0500
Subject: ***SPAM*** Re: [ofa-general] opensm: bad multicast forwarding table
	entries
In-Reply-To: <20081112221846.GE25248@sgi.com>
References: <20081112221846.GE25248@sgi.com>
Message-ID: <f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>

On Wed, Nov 12, 2008 at 5:18 PM,  <akepner at sgi.com> wrote:
>
> Here's a description of a problem we're seeing where multicast
> forwarding tables are apparently getting set up incorrectly. I'd
> appreciate any debug help from the opensm experts out there.
>
> On large clusters (>1000 nodes or so) we often see hundreds of errors
> from 'ibdiagnet -r' like the following (this is the simplest example
> I could find):
>
> -I- Multicast Group:0xC069 has:2 switches and:2 HCAs
> -E- Disconnected switch:S0800690000002e51/U1 in group:0xC069
> -E- Disconnected HCA:r4i2n10/U1

Is it really an error to have a multicast group like this ? I agree
it's not needed to route if there's only 1 member port.

Can you describe the scenario under which this occurs ? Are things
steady state or are there changes going on in the subnet ? Any errors
in the opensm log ?

> These have invariably been multicast groups associated with IPv6
> solicited node multicast addresses, e.g., in this case 'saquery -m'
> shows only a single member, "r5lead":
>
> MCMemberRecord member dump:
>                MGID....................0xff12601bffff0000 : 0x00000001ff26d289
>                Mlid....................0xC069
>                PortGid.................0xfe80000000000000 : 0x0002c9020026d289
>                ScopeState..............0x1
>                ProxyJoin...............0x0
>                NodeDescription.........r5lead HCA-1
>
> ibdiagnet shows that "r5lead" is connected to the switch with lid
> 1609, port 24:
>
> Switch  24 "S-0800690000002db4"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1609 lmc 0
> [24]    "H-0002c9020026d288"[1](2c9020026d289)          # "r5lead HCA-1" lid 1576 4xDDR
>
> and the multicast forwarding table (from 'dump_mfts.sh') is consistent:
>
> Multicast mlids [0xc000-0xc3ff] of switch Lid 1609 guid 0x0800690000002db4 (MT47396 Infiniscale-III Mellanox Technologies):
>            0                   1                   2
>     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
>  MLid
> ....
> 0xc069                                                      x
>
>
> So far, so good. But we also have r4i2n10, connected to the switch with
> lid 1533 port 7:
>
> switchguid=0x800690000002e50(800690000002e50)
> Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
> ......
> [7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR
>
> with this mft entry:
>
> Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
>            0                   1                   2
>     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
>  MLid
> .....
> 0xc069                    x
>
> Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a
> mft entry for the multicast group with MGID ff12601bffff::1ff26d289?

The MFT entry is based on an MLID and not the MGID. What does saquery
-g show ? Does it show one or more than one MGID with an MLID of
0xc069 ? Also, does saquery -m 0xc069 show one member ?

I don't think OpenSM does this but if the multicast groups are
disjoint, the same MLID could be used for two different groups (MGIDs)
in different parts of the subnet.

Sasha is probably best to comment on what has changed in this area. Is
it possible to try this with the latest OpenSM to see if this has been
fixed ?

-- Hal

> Anyone else seen similar?
>
> --
> Arthur
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From akepner at sgi.com  Wed Nov 12 15:54:31 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 12 Nov 2008 15:54:31 -0800
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
Message-ID: <20081112235431.GG25248@sgi.com>

On Wed, Nov 12, 2008 at 06:27:29PM -0500, Hal Rosenstock wrote:

Thanks for having a look at this, Hal.

> On Wed, Nov 12, 2008 at 5:18 PM,  <akepner at sgi.com> wrote:
> > .....
> > -I- Multicast Group:0xC069 has:2 switches and:2 HCAs
> > -E- Disconnected switch:S0800690000002e51/U1 in group:0xC069
> > -E- Disconnected HCA:r4i2n10/U1
> 
> Is it really an error to have a multicast group like this ? 

Well, 'ibidagnet -r' reports it as an error. 

> ... I agree
> it's not needed to route if there's only 1 member port.
> 
> Can you describe the scenario under which this occurs ? Are things
> steady state or are there changes going on in the subnet ? Any errors
> in the opensm log ?

As far as I know, this is steady state behavior. I'll check about 
opensm logging any errors.

> ..... 
> > So far, so good. But we also have r4i2n10, connected to the switch with
> > lid 1533 port 7:
> >
> > switchguid=0x800690000002e50(800690000002e50)
> > Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
> > ......
> > [7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR
> >
> > with this mft entry:
> >
> > Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
> >            0                   1                   2
> >     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
> >  MLid
> > .....
> > 0xc069                    x
> >
> > Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a
> > mft entry for the multicast group with MGID ff12601bffff::1ff26d289?
> 
> The MFT entry is based on an MLID and not the MGID. What does saquery
> -g show ? Does it show one or more than one MGID with an MLID of
> 0xc069 ? 

Will also try to get this information.

> Also, does saquery -m 0xc069 show one member ?

Yes, only one member.

> 
> I don't think OpenSM does this but if the multicast groups are
> disjoint, the same MLID could be used for two different groups (MGIDs)
> in different parts of the subnet.
> 

Oh, that'd be confusing.

> Sasha is probably best to comment on what has changed in this area. Is
> it possible to try this with the latest OpenSM to see if this has been
> fixed ?
> 

I doubt that this alone would be important enough to get the 
customer to try upgrading opensm, but I can let them know it's 
an option - especially if there's good reason to think it'd 
help.

-- 
Arthur


From sashak at voltaire.com  Wed Nov 12 16:03:36 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 02:03:36 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: consolidate logging code
In-Reply-To: <20081111202648.GB8894@sashak.voltaire.com>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
	<1226351730.13603.27.camel@cardanus.llnl.gov>
	<1226353273.13603.39.camel@cardanus.llnl.gov>
	<20081111202648.GB8894@sashak.voltaire.com>
Message-ID: <20081113000336.GE27271@sashak.voltaire.com>


Consolidate code like:

	char buff[128];

	sprintf(buff, fmt, ...);
	printf("%s", buff);
	cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);

into single log_report() function.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |  169 ++++++++++++++++----------------------------
 1 files changed, 60 insertions(+), 109 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 71ba7f5..666c93c 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -468,6 +468,17 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 
 /**********************************************************************
  **********************************************************************/
+static void log_report(const char *fmt, ...)
+{
+	char buf[128];
+	va_list args;
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+	printf(buf);
+	cl_log_event("OpenSM", CL_LOG_INFO, buf, NULL, 0);
+}
+
 static void log_config_value(char *name, const char *fmt, ...)
 {
 	char buf[128];
@@ -839,28 +850,20 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 
 static void subn_verify_max_vls(IN unsigned *max_vls, IN char *key)
 {
-	char buff[128];
-
 	if (*max_vls > 15) {
-		sprintf(buff, " Invalid Cached Option:%s=%u:"
-			"Using Default:%u\n",
-			key, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option:%s=%u:"
+			   "Using Default:%u\n",
+			   key, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
 		*max_vls = OSM_DEFAULT_QOS_MAX_VLS;
 	}
 }
 
 static void subn_verify_high_limit(IN unsigned *high_limit, IN char *key)
 {
-	char buff[128];
-
 	if (*high_limit > 255) {
-		sprintf(buff, " Invalid Cached Option:%s=%u:"
-			"Using Default:%u\n",
-			key, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option:%s=%u:"
+			   "Using Default:%u\n",
+			   key, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
 		*high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
 	}
 }
@@ -868,7 +871,6 @@ static void subn_verify_high_limit(IN unsigned *high_limit, IN char *key)
 static void subn_verify_vlarb(IN char *vlarb, IN char *key)
 {
 	if (vlarb) {
-		char buff[128];
 		char *str, *tok, *end, *ptr;
 		int count = 0;
 
@@ -890,60 +892,39 @@ static void subn_verify_vlarb(IN char *vlarb, IN char *key)
 
 				vl = strtol(vl_str, &end, 0);
 
-				if (*end) {
-					sprintf(buff,
+				if (*end)
+					log_report(
 						" Warning: Cached Option %s:vl=%s improperly formatted\n",
 						key, vl_str);
-					printf(buff);
-					cl_log_event("OpenSM", CL_LOG_INFO,
-						     buff, NULL, 0);
-				} else if (vl < 0 || vl > 14) {
-					sprintf(buff,
+				else if (vl < 0 || vl > 14)
+					log_report(
 						" Warning: Cached Option %s:vl=%ld out of range\n",
 						key, vl);
-					printf(buff);
-					cl_log_event("OpenSM", CL_LOG_INFO,
-						     buff, NULL, 0);
-				}
 
 				weight = strtol(weight_str, &end, 0);
 
-				if (*end) {
-					sprintf(buff,
+				if (*end)
+					log_report(
 						" Warning: Cached Option %s:weight=%s improperly formatted\n",
 						key, weight_str);
-					printf(buff);
-					cl_log_event("OpenSM", CL_LOG_INFO,
-						     buff, NULL, 0);
-				} else if (weight < 0 || weight > 255) {
-					sprintf(buff,
+				else if (weight < 0 || weight > 255)
+					log_report(
 						" Warning: Cached Option %s:weight=%ld out of range\n",
 						key, weight);
-					printf(buff);
-					cl_log_event("OpenSM", CL_LOG_INFO,
-						     buff, NULL, 0);
-				}
-			} else {
-				sprintf(buff,
+			} else
+				log_report(
 					" Warning: Cached Option %s:vl:weight=%s improperly formatted\n",
 					key, tok);
-				printf(buff);
-				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
-					     0);
-			}
 
 			count++;
 			tok = strtok_r(NULL, ",\n", &ptr);
 		}
 
-		if (count > 64) {
-			sprintf(buff,
+		if (count > 64)
+			log_report(
 				" Warning: Cached Option %s: > 64 listed: "
 				"excess vl:weight pairs will be dropped\n",
 				key);
-			printf(buff);
-			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
-		}
 
 		free(str);
 	}
@@ -952,7 +933,6 @@ static void subn_verify_vlarb(IN char *vlarb, IN char *key)
 static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
 {
 	if (sl2vl) {
-		char buff[128];
 		char *str, *tok, *end, *ptr;
 		int count = 0;
 
@@ -963,40 +943,26 @@ static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
 		while (tok) {
 			long vl = strtol(tok, &end, 0);
 
-			if (*end) {
-				sprintf(buff,
+			if (*end)
+				log_report(
 					" Warning: Cached Option %s:vl=%s improperly formatted\n",
 					key, tok);
-				printf(buff);
-				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
-					     0);
-			} else if (vl < 0 || vl > 15) {
-				sprintf(buff,
+			else if (vl < 0 || vl > 15)
+				log_report(
 					" Warning: Cached Option %s:vl=%ld out of range\n",
 					key, vl);
-				printf(buff);
-				cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL,
-					     0);
-			}
 
 			count++;
 			tok = strtok_r(NULL, ",\n", &ptr);
 		}
 
-		if (count < 16) {
-			sprintf(buff,
-				" Warning: Cached Option %s: < 16 VLs listed\n",
-				key);
-			printf(buff);
-			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
-		}
-		if (count > 16) {
-			sprintf(buff,
-				" Warning: Cached Option %s: > 16 listed: "
-				"excess VLs will be dropped\n", key);
-			printf(buff);
-			cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
-		}
+		if (count < 16)
+			log_report(" Warning: Cached Option %s: < 16 VLs "
+				   "listed\n", key);
+
+		if (count > 16)
+			log_report(" Warning: Cached Option %s: > 16 listed: "
+				   "excess VLs will be dropped\n", key);
 
 		free(str);
 	}
@@ -1004,33 +970,24 @@ static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
 
 static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 {
-	char buff[128];
-
 	if (p_opts->lmc > 7) {
-		sprintf(buff, " Invalid Cached Option Value:lmc = %u:"
-			"Using Default:%u\n", p_opts->lmc, OSM_DEFAULT_LMC);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option Value:lmc = %u:"
+			   "Using Default:%u\n", p_opts->lmc, OSM_DEFAULT_LMC);
 		p_opts->lmc = OSM_DEFAULT_LMC;
 	}
 
 	if (15 < p_opts->sm_priority) {
-		sprintf(buff, " Invalid Cached Option Value:sm_priority = %u:"
-			"Using Default:%u\n",
-			p_opts->sm_priority, OSM_DEFAULT_SM_PRIORITY);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option Value:sm_priority = %u:"
+			   "Using Default:%u\n",
+			   p_opts->sm_priority, OSM_DEFAULT_SM_PRIORITY);
 		p_opts->sm_priority = OSM_DEFAULT_SM_PRIORITY;
 	}
 
 	if ((15 < p_opts->force_link_speed) ||
 	    (p_opts->force_link_speed > 7 && p_opts->force_link_speed < 15)) {
-		sprintf(buff,
-			" Invalid Cached Option Value:force_link_speed = %u:"
-			"Using Default:%u\n", p_opts->force_link_speed,
-			IB_PORT_LINK_SPEED_ENABLED_MASK);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option Value:force_link_speed = %u:"
+			   "Using Default:%u\n", p_opts->force_link_speed,
+			   IB_PORT_LINK_SPEED_ENABLED_MASK);
 		p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK;
 	}
 
@@ -1041,11 +998,9 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 	    && strcmp(p_opts->console, OSM_REMOTE_CONSOLE)
 #endif
 	    ) {
-		sprintf(buff, " Invalid Cached Option Value:console = %s"
-			", Using Default:%s\n",
-			p_opts->console, OSM_DEFAULT_CONSOLE);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option Value:console = %s"
+			   ", Using Default:%s\n",
+			   p_opts->console, OSM_DEFAULT_CONSOLE);
 		p_opts->console = OSM_DEFAULT_CONSOLE;
 	}
 
@@ -1108,22 +1063,18 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 	}
 #ifdef ENABLE_OSM_PERF_MGR
 	if (p_opts->perfmgr_sweep_time_s < 1) {
-		sprintf(buff,
-			" Invalid Cached Option Value:perfmgr_sweep_time_s = %u"
-			"Using Default:%u\n", p_opts->perfmgr_sweep_time_s,
-			OSM_PERFMGR_DEFAULT_SWEEP_TIME_S);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option Value:perfmgr_sweep_time_s "
+			   "= %u Using Default:%u\n",
+			   p_opts->perfmgr_sweep_time_s,
+			   OSM_PERFMGR_DEFAULT_SWEEP_TIME_S);
 		p_opts->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S;
 	}
 	if (p_opts->perfmgr_max_outstanding_queries < 1) {
-		sprintf(buff,
-			" Invalid Cached Option Value:perfmgr_max_outstanding_queries = %u"
-			"Using Default:%u\n",
-			p_opts->perfmgr_max_outstanding_queries,
-			OSM_PERFMGR_DEFAULT_MAX_OUTSTANDING_QUERIES);
-		printf(buff);
-		cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0);
+		log_report(" Invalid Cached Option Value:"
+			   "perfmgr_max_outstanding_queries = %u"
+			   " Using Default:%u\n",
+			   p_opts->perfmgr_max_outstanding_queries,
+			   OSM_PERFMGR_DEFAULT_MAX_OUTSTANDING_QUERIES);
 		p_opts->perfmgr_max_outstanding_queries =
 		    OSM_PERFMGR_DEFAULT_MAX_OUTSTANDING_QUERIES;
 	}
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 12 16:04:10 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 02:04:10 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c: consolidate logging
	code
In-Reply-To: <20081113000336.GE27271@sashak.voltaire.com>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
	<1226351730.13603.27.camel@cardanus.llnl.gov>
	<1226353273.13603.39.camel@cardanus.llnl.gov>
	<20081111202648.GB8894@sashak.voltaire.com>
	<20081113000336.GE27271@sashak.voltaire.com>
Message-ID: <20081113000410.GF27271@sashak.voltaire.com>

>From c7fd1c7668acc5f5c1819f23b35a0baad0c09045 Mon Sep 17 00:00:00 2001
From: Sasha Khapyorsky <sashak at voltaire.com>
Date: Thu, 13 Nov 2008 01:20:07 +0200
Subject: [PATCH] opensm/osm_subnet.c: use strdup() function

Instead of malloc() and strcpy() use strdup() function.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 666c93c..cd8c8e5 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -611,8 +611,7 @@ opts_unpack_charp(IN char *p_req_key,
 				  Ignore the possible memory leak here;
 				  the pointer may be to a static default.
 				*/
-				*p_val = (char *)malloc(strlen(p_val_str) + 1);
-				strcpy(*p_val, p_val_str);
+				*p_val = strdup(p_val_str);
 			}
 		}
 	}
@@ -874,8 +873,7 @@ static void subn_verify_vlarb(IN char *vlarb, IN char *key)
 		char *str, *tok, *end, *ptr;
 		int count = 0;
 
-		str = (char *)malloc(strlen(vlarb) + 1);
-		strcpy(str, vlarb);
+		str = strdup(vlarb);
 
 		tok = strtok_r(str, ",\n", &ptr);
 		while (tok) {
@@ -936,8 +934,7 @@ static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
 		char *str, *tok, *end, *ptr;
 		int count = 0;
 
-		str = (char *)malloc(strlen(sl2vl) + 1);
-		strcpy(str, sl2vl);
+		str = strdup(sl2vl);
 
 		tok = strtok_r(str, ",\n", &ptr);
 		while (tok) {
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 12 16:05:28 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 02:05:28 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: consolidate qos
	parameters verification code
In-Reply-To: <20081113000336.GE27271@sashak.voltaire.com>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
	<1226351730.13603.27.camel@cardanus.llnl.gov>
	<1226353273.13603.39.camel@cardanus.llnl.gov>
	<20081111202648.GB8894@sashak.voltaire.com>
	<20081113000336.GE27271@sashak.voltaire.com>
Message-ID: <20081113000528.GG27271@sashak.voltaire.com>


Consolidate qos config parameters verification code.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |  150 +++++++++++++++++---------------------------
 1 files changed, 58 insertions(+), 92 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index cd8c8e5..006d14e 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -847,27 +847,28 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 /**********************************************************************
  **********************************************************************/
 
-static void subn_verify_max_vls(IN unsigned *max_vls, IN char *key)
+static void subn_verify_max_vls(unsigned *max_vls, const char *prefix)
 {
 	if (*max_vls > 15) {
-		log_report(" Invalid Cached Option:%s=%u:"
+		log_report(" Invalid Cached Option:%s_max_vls=%u:"
 			   "Using Default:%u\n",
-			   key, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
+			   prefix, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
 		*max_vls = OSM_DEFAULT_QOS_MAX_VLS;
 	}
 }
 
-static void subn_verify_high_limit(IN unsigned *high_limit, IN char *key)
+static void subn_verify_high_limit(unsigned *high_limit, const char *prefix)
 {
 	if (*high_limit > 255) {
-		log_report(" Invalid Cached Option:%s=%u:"
+		log_report(" Invalid Cached Option:%s_high_limit=%u:"
 			   "Using Default:%u\n",
-			   key, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
+			   prefix, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
 		*high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
 	}
 }
 
-static void subn_verify_vlarb(IN char *vlarb, IN char *key)
+static void subn_verify_vlarb(char *vlarb, const char *prefix,
+			      const char *suffix)
 {
 	if (vlarb) {
 		char *str, *tok, *end, *ptr;
@@ -891,44 +892,48 @@ static void subn_verify_vlarb(IN char *vlarb, IN char *key)
 				vl = strtol(vl_str, &end, 0);
 
 				if (*end)
-					log_report(
-						" Warning: Cached Option %s:vl=%s improperly formatted\n",
-						key, vl_str);
+					log_report(" Warning: Cached Option "
+						   "%s_vlarb_%s:vl=%s "
+						   "improperly formatted\n",
+						   prefix, suffix, vl_str);
 				else if (vl < 0 || vl > 14)
-					log_report(
-						" Warning: Cached Option %s:vl=%ld out of range\n",
-						key, vl);
+					log_report(" Warning: Cached Option "
+						   "%s_vlarb_%s:vl=%ld out "
+						   "of range\n",
+						   prefix, suffix, vl);
 
 				weight = strtol(weight_str, &end, 0);
 
 				if (*end)
-					log_report(
-						" Warning: Cached Option %s:weight=%s improperly formatted\n",
-						key, weight_str);
+					log_report(" Warning: Cached Option "
+						   "%s_vlarb_%s:weight=%s "
+						   "improperly formatted\n",
+						   prefix, suffix, weight_str);
 				else if (weight < 0 || weight > 255)
-					log_report(
-						" Warning: Cached Option %s:weight=%ld out of range\n",
-						key, weight);
+					log_report(" Warning: Cached Option "
+						   "%s_vlarb_%s:weight=%ld "
+						   "out of range\n",
+						   prefix, suffix, weight);
 			} else
-				log_report(
-					" Warning: Cached Option %s:vl:weight=%s improperly formatted\n",
-					key, tok);
+				log_report(" Warning: Cached Option "
+					   "%s_vlarb_%s:vl:weight=%s "
+					   "improperly formatted\n",
+					   prefix, suffix, tok);
 
 			count++;
 			tok = strtok_r(NULL, ",\n", &ptr);
 		}
 
 		if (count > 64)
-			log_report(
-				" Warning: Cached Option %s: > 64 listed: "
-				"excess vl:weight pairs will be dropped\n",
-				key);
+			log_report(" Warning: Cached Option %s_vlarb_%s: "
+				   "> 64 listed: excess vl:weight pairs "
+				   "will be dropped\n", prefix, suffix);
 
 		free(str);
 	}
 }
 
-static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
+static void subn_verify_sl2vl(char *sl2vl, const char *prefix)
 {
 	if (sl2vl) {
 		char *str, *tok, *end, *ptr;
@@ -941,30 +946,40 @@ static void subn_verify_sl2vl(IN char *sl2vl, IN char *key)
 			long vl = strtol(tok, &end, 0);
 
 			if (*end)
-				log_report(
-					" Warning: Cached Option %s:vl=%s improperly formatted\n",
-					key, tok);
+				log_report(" Warning: Cached Option %s_sl2vl:"
+					   "vl=%s improperly formatted\n",
+					   prefix, tok);
 			else if (vl < 0 || vl > 15)
-				log_report(
-					" Warning: Cached Option %s:vl=%ld out of range\n",
-					key, vl);
+				log_report(" Warning: Cached Option %s_sl2vl:"
+					   "vl=%ld out of range\n",
+					   prefix, vl);
 
 			count++;
 			tok = strtok_r(NULL, ",\n", &ptr);
 		}
 
 		if (count < 16)
-			log_report(" Warning: Cached Option %s: < 16 VLs "
-				   "listed\n", key);
+			log_report(" Warning: Cached Option %s_sl2vl: < 16 VLs "
+				   "listed\n", prefix);
 
 		if (count > 16)
-			log_report(" Warning: Cached Option %s: > 16 listed: "
-				   "excess VLs will be dropped\n", key);
+			log_report(" Warning: Cached Option %s_sl2vl: "
+				   "> 16 listed: excess VLs will be dropped\n",
+				   prefix);
 
 		free(str);
 	}
 }
 
+static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix)
+{
+	subn_verify_max_vls(&set->max_vls, prefix);
+	subn_verify_high_limit(&set->high_limit, prefix);
+	subn_verify_vlarb(set->vlarb_low, prefix, "low");
+	subn_verify_vlarb(set->vlarb_high, prefix, "high");
+	subn_verify_sl2vl(set->sl2vl, prefix);
+}
+
 static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 {
 	if (p_opts->lmc > 7) {
@@ -1002,62 +1017,13 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 	}
 
 	if (p_opts->qos) {
-		subn_verify_max_vls(&(p_opts->qos_options.max_vls),
-				    "qos_max_vls");
-		subn_verify_max_vls(&(p_opts->qos_ca_options.max_vls),
-				    "qos_ca_max_vls");
-		subn_verify_max_vls(&(p_opts->qos_sw0_options.max_vls),
-				    "qos_sw0_max_vls");
-		subn_verify_max_vls(&(p_opts->qos_swe_options.max_vls),
-				    "qos_swe_max_vls");
-		subn_verify_max_vls(&(p_opts->qos_rtr_options.max_vls),
-				    "qos_rtr_max_vls");
-
-		subn_verify_high_limit(&(p_opts->qos_options.high_limit),
-				       "qos_high_limit");
-		subn_verify_high_limit(&(p_opts->qos_ca_options.high_limit),
-				       "qos_ca_high_limit");
-		subn_verify_high_limit(&
-				       (p_opts->qos_sw0_options.high_limit),
-				       "qos_sw0_high_limit");
-		subn_verify_high_limit(&
-				       (p_opts->qos_swe_options.high_limit),
-				       "qos_swe_high_limit");
-		subn_verify_high_limit(&
-				       (p_opts->qos_rtr_options.high_limit),
-				       "qos_rtr_high_limit");
-
-		subn_verify_vlarb(p_opts->qos_options.vlarb_low,
-				  "qos_vlarb_low");
-		subn_verify_vlarb(p_opts->qos_ca_options.vlarb_low,
-				  "qos_ca_vlarb_low");
-		subn_verify_vlarb(p_opts->qos_sw0_options.vlarb_low,
-				  "qos_sw0_vlarb_low");
-		subn_verify_vlarb(p_opts->qos_swe_options.vlarb_low,
-				  "qos_swe_vlarb_low");
-		subn_verify_vlarb(p_opts->qos_rtr_options.vlarb_low,
-				  "qos_rtr_vlarb_low");
-
-		subn_verify_vlarb(p_opts->qos_options.vlarb_high,
-				  "qos_vlarb_high");
-		subn_verify_vlarb(p_opts->qos_ca_options.vlarb_high,
-				  "qos_ca_vlarb_high");
-		subn_verify_vlarb(p_opts->qos_sw0_options.vlarb_high,
-				  "qos_sw0_vlarb_high");
-		subn_verify_vlarb(p_opts->qos_swe_options.vlarb_high,
-				  "qos_swe_vlarb_high");
-		subn_verify_vlarb(p_opts->qos_rtr_options.vlarb_high,
-				  "qos_rtr_vlarb_high");
-
-		subn_verify_sl2vl(p_opts->qos_options.sl2vl, "qos_sl2vl");
-		subn_verify_sl2vl(p_opts->qos_ca_options.sl2vl, "qos_ca_sl2vl");
-		subn_verify_sl2vl(p_opts->qos_sw0_options.sl2vl,
-				  "qos_sw0_sl2vl");
-		subn_verify_sl2vl(p_opts->qos_swe_options.sl2vl,
-				  "qos_swe_sl2vl");
-		subn_verify_sl2vl(p_opts->qos_rtr_options.sl2vl,
-				  "qos_rtr_sl2vl");
+		subn_verify_qos_set(&p_opts->qos_options, "qos");
+		subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca");
+		subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0");
+		subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe");
+		subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr");
 	}
+
 #ifdef ENABLE_OSM_PERF_MGR
 	if (p_opts->perfmgr_sweep_time_s < 1) {
 		log_report(" Invalid Cached Option Value:perfmgr_sweep_time_s "
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 12 16:19:44 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 02:19:44 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet.c: move
	osm_subn_rescan_conf_files() function
In-Reply-To: <20081113000528.GG27271@sashak.voltaire.com>
References: <1225404081.1197.534.camel@cardanus.llnl.gov>
	<20081110210233.GE3467@sashak.voltaire.com>
	<1226351730.13603.27.camel@cardanus.llnl.gov>
	<1226353273.13603.39.camel@cardanus.llnl.gov>
	<20081111202648.GB8894@sashak.voltaire.com>
	<20081113000336.GE27271@sashak.voltaire.com>
	<20081113000528.GG27271@sashak.voltaire.com>
Message-ID: <20081113001944.GH27271@sashak.voltaire.com>


Move osm_subn_rescan_conf_files() function.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |  116 +++++++++++++++++++++-----------------------
 1 files changed, 56 insertions(+), 60 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 006d14e..8569043 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -71,8 +71,6 @@
 
 static const char null_str[] = "(null)";
 
-static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts);
-
 /**********************************************************************
  **********************************************************************/
 void osm_subn_construct(IN osm_subn_t * const p_subn)
@@ -788,64 +786,6 @@ osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn)
 
 /**********************************************************************
  **********************************************************************/
-int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
-{
-	FILE *opts_file;
-	char line[1024];
-	char *p_key, *p_val, *p_last;
-
-	if (!p_subn->opt.config_file)
-		return 0;
-
-	opts_file = fopen(p_subn->opt.config_file, "r");
-	if (!opts_file) {
-		if (errno == ENOENT)
-			return 1;
-		OSM_LOG(&p_subn->p_osm->log, OSM_LOG_ERROR,
-			"cannot open file \'%s\': %s\n",
-			p_subn->opt.config_file, strerror(errno));
-		return -1;
-	}
-
-	while (fgets(line, 1023, opts_file) != NULL) {
-		/* get the first token */
-		p_key = strtok_r(line, " \t\n", &p_last);
-		if (p_key) {
-			p_val = strtok_r(NULL, " \t\n", &p_last);
-
-			subn_parse_qos_options("qos",
-					       p_key, p_val,
-					       &p_subn->opt.qos_options);
-
-			subn_parse_qos_options("qos_ca",
-					       p_key, p_val,
-					       &p_subn->opt.qos_ca_options);
-
-			subn_parse_qos_options("qos_sw0",
-					       p_key, p_val,
-					       &p_subn->opt.qos_sw0_options);
-
-			subn_parse_qos_options("qos_swe",
-					       p_key, p_val,
-					       &p_subn->opt.qos_swe_options);
-
-			subn_parse_qos_options("qos_rtr",
-					       p_key, p_val,
-					       &p_subn->opt.qos_rtr_options);
-
-		}
-	}
-	fclose(opts_file);
-
-	subn_verify_conf_file(&p_subn->opt);
-
-	osm_parse_prefix_routes_file(p_subn);
-
-	return 0;
-}
-
-/**********************************************************************
- **********************************************************************/
 
 static void subn_verify_max_vls(unsigned *max_vls, const char *prefix)
 {
@@ -1308,6 +1248,62 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 	return 0;
 }
 
+int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
+{
+	FILE *opts_file;
+	char line[1024];
+	char *p_key, *p_val, *p_last;
+
+	if (!p_subn->opt.config_file)
+		return 0;
+
+	opts_file = fopen(p_subn->opt.config_file, "r");
+	if (!opts_file) {
+		if (errno == ENOENT)
+			return 1;
+		OSM_LOG(&p_subn->p_osm->log, OSM_LOG_ERROR,
+			"cannot open file \'%s\': %s\n",
+			p_subn->opt.config_file, strerror(errno));
+		return -1;
+	}
+
+	while (fgets(line, 1023, opts_file) != NULL) {
+		/* get the first token */
+		p_key = strtok_r(line, " \t\n", &p_last);
+		if (p_key) {
+			p_val = strtok_r(NULL, " \t\n", &p_last);
+
+			subn_parse_qos_options("qos",
+					       p_key, p_val,
+					       &p_subn->opt.qos_options);
+
+			subn_parse_qos_options("qos_ca",
+					       p_key, p_val,
+					       &p_subn->opt.qos_ca_options);
+
+			subn_parse_qos_options("qos_sw0",
+					       p_key, p_val,
+					       &p_subn->opt.qos_sw0_options);
+
+			subn_parse_qos_options("qos_swe",
+					       p_key, p_val,
+					       &p_subn->opt.qos_swe_options);
+
+			subn_parse_qos_options("qos_rtr",
+					       p_key, p_val,
+					       &p_subn->opt.qos_rtr_options);
+
+		}
+	}
+	fclose(opts_file);
+
+	subn_verify_conf_file(&p_subn->opt);
+
+	osm_parse_prefix_routes_file(p_subn);
+
+	return 0;
+}
+
 /**********************************************************************
  **********************************************************************/
 int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 12 16:24:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 02:24:03 +0200
Subject: [ofa-general] Re: [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <1226447872.6239.2.camel@cardanus.llnl.gov>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
	<20081111191958.GA8894@sashak.voltaire.com>
	<1226447872.6239.2.camel@cardanus.llnl.gov>
Message-ID: <20081113002403.GI27271@sashak.voltaire.com>

Hi Al,

On 15:57 Tue 11 Nov     , Al Chu wrote:
> 
> Sorry, I may have not explained it well. Lets say I do this in the
> config file.
> 
> qos_vlarb_high FOOBAR
> # qos_ca_vlarb_high BLAH
> qos_swe_vlarb_high XYZZY
> 
> I currently expect qos_ca_vlarb_high to use the value of FOOBAR because
> I commented out the field.  But it uses OSM_DEFAULT_QOS_HIGH_LIMIT
> instead.  The reason is because qos_build_config() checks for NULL to
> use default vs. non-default values.
> 
> p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> 
> Under the above situation where I've commented out veral fields, opt-
> >vlarb_high is always non-NULL b/c it was set to
> OSM_DEFAULT_QOS_HIGH_LIMIT. Thus OSM_DEFAULT_QOS_HIGH_LIMIT is used
> instead of FOOBAR.
> 
> > > 2)
> > > 
> > > In qos_build_config() we load the high_limit like this:
> > > 
> > > cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > > 
> > > So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> > > options to "go back to" the default high_limit.  It just assumes that
> > > whatever is input (or was set by default) is what you should use.
> > 
> > Right. What is a limitation here? That an user cannot set this to
> > "no value"? But she/he can just skip it.
> 
> Similar to the above issue, lets say I want to do:
> 
> qos_high_limit 8
> # qos_ca_high_limit 15
> # qos_swe_high_limit 15
> 
> I want qos_ca_high_limit and qos_swe_high_limit to use whatever I set in
> qos_high_limit.  But the code doesn't allow for this.
> 
> > 
> > > 3)
> > > 
> > > Some fields like qos_vlarb_high are assumed to be correctly set and can
> > > segfault opensm.
> > 
> > qos_build_config() assumes that valid parameters are used. And we are
> > using this this way (I hope :)) (finally it is not library API).
> 
> I think the issue is the osm_subnet.c code did not properly check all
> inputs, and subsequently some inputs used in qos_build_config() were
> bad.  I think
> 
> qos_vlarb_high (null)
> 
> was something I tried that opensm seg-faulted on.  

Ok. I see now.

Probably it will be simpler just to generate a valid qos parameter sets
right after parser (in verification time)? Like in your modified (and
rebased against recent patches) patch below?

Sasha


>From a973a8a1ea6c805cf07965d86731ae04510266ce Mon Sep 17 00:00:00 2001
From: Al Chu <chu11 at llnl.gov>
Date: Mon, 10 Nov 2008 13:41:04 -0800
Subject: [PATCH] fix qos config parsing bugs

Signed-off-by: Albert Chu <chu11 at llnl.gov>
Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h |   12 +-
 opensm/opensm/osm_qos.c            |    6 +-
 opensm/opensm/osm_subnet.c         |  298 ++++++++++++++++++++---------------
 3 files changed, 181 insertions(+), 135 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index a16cbce..2bcd232 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -100,7 +100,7 @@ struct osm_qos_policy;
 */
 typedef struct osm_qos_options {
 	unsigned max_vls;
-	unsigned high_limit;
+	int high_limit;
 	char *vlarb_high;
 	char *vlarb_low;
 	char *sl2vl;
@@ -109,20 +109,20 @@ typedef struct osm_qos_options {
 * FIELDS
 *
 *	max_vls
-*		The number of maximum VLs on the Subnet
+*		The number of maximum VLs on the Subnet (0 == use default)
 *
 *	high_limit
 *		The limit of High Priority component of VL Arbitration
-*		table (IBA 7.6.9)
+*		table (IBA 7.6.9) (-1 == use default)
 *
 *	vlarb_high
-*		High priority VL Arbitration table template.
+*		High priority VL Arbitration table template. (NULL == use default)
 *
 *	vlarb_low
-*		Low priority VL Arbitration table template.
+*		Low priority VL Arbitration table template. (NULL == use default)
 *
 *	sl2vl
-*		SL2VL Mapping table (IBA 7.6.6) template.
+*		SL2VL Mapping table (IBA 7.6.6) template. (NULL == use default)
 *
 *********/
 
diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index 1679ae0..b451c25 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -382,7 +382,11 @@ static void qos_build_config(struct qos_config *cfg,
 	memset(cfg, 0, sizeof(*cfg));
 
 	cfg->max_vls = opt->max_vls > 0 ? opt->max_vls : dflt->max_vls;
-	cfg->vl_high_limit = (uint8_t) opt->high_limit;
+
+	if (opt->high_limit >= 0)
+		cfg->vl_high_limit = (uint8_t) opt->high_limit;
+	else
+		cfg->vl_high_limit = (uint8_t) dflt->high_limit;
 
 	p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
 	for (i = 0; i < 2 * IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; i++) {
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 8569043..1c9777e 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -370,6 +370,15 @@ static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
 	opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
 }
 
+static void subn_init_qos_options(IN osm_qos_options_t * opt)
+{
+	opt->max_vls = 0;
+	opt->high_limit = -1;
+	opt->vlarb_high = NULL;
+	opt->vlarb_low = NULL;
+	opt->sl2vl = NULL;
+}
+
 /**********************************************************************
  **********************************************************************/
 void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
@@ -457,11 +466,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
 	p_opt->no_clients_rereg = FALSE;
 	p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE;
 	p_opt->consolidate_ipv6_snm_req = FALSE;
-	subn_set_default_qos_options(&p_opt->qos_options);
-	subn_set_default_qos_options(&p_opt->qos_ca_options);
-	subn_set_default_qos_options(&p_opt->qos_sw0_options);
-	subn_set_default_qos_options(&p_opt->qos_swe_options);
-	subn_set_default_qos_options(&p_opt->qos_rtr_options);
+	subn_init_qos_options(&p_opt->qos_options);
+	subn_init_qos_options(&p_opt->qos_ca_options);
+	subn_init_qos_options(&p_opt->qos_sw0_options);
+	subn_init_qos_options(&p_opt->qos_swe_options);
+	subn_init_qos_options(&p_opt->qos_rtr_options);
 }
 
 /**********************************************************************
@@ -526,6 +535,21 @@ opts_unpack_uint32(IN char *p_req_key,
 /**********************************************************************
  **********************************************************************/
 static void
+opts_unpack_int32(IN char *p_req_key,
+		  IN char *p_key, IN char *p_val_str, IN int32_t * p_val)
+{
+	if (!strcmp(p_req_key, p_key)) {
+		int32_t val = strtol(p_val_str, NULL, 0);
+		if (val != *p_val) {
+			log_config_value(p_key, "%d", val);
+			*p_val = val;
+		}
+	}
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
 opts_unpack_uint16(IN char *p_req_key,
 		   IN char *p_key, IN char *p_val_str, IN uint16_t * p_val)
 {
@@ -651,7 +675,7 @@ subn_parse_qos_options(IN const char *prefix,
 	snprintf(name, sizeof(name), "%s_max_vls", prefix);
 	opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls);
 	snprintf(name, sizeof(name), "%s_high_limit", prefix);
-	opts_unpack_uint32(name, p_key, p_val_str, &opt->high_limit);
+	opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit);
 	snprintf(name, sizeof(name), "%s_vlarb_high", prefix);
 	opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high);
 	snprintf(name, sizeof(name), "%s_vlarb_low", prefix);
@@ -786,138 +810,142 @@ osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn)
 
 /**********************************************************************
  **********************************************************************/
-
-static void subn_verify_max_vls(unsigned *max_vls, const char *prefix)
+static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned dflt)
 {
-	if (*max_vls > 15) {
-		log_report(" Invalid Cached Option:%s_max_vls=%u:"
-			   "Using Default:%u\n",
-			   prefix, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
-		*max_vls = OSM_DEFAULT_QOS_MAX_VLS;
+	if (!(*max_vls) || *max_vls > 15) {
+		log_report(" Invalid Cached Option: %s_max_vls=%u: "
+			   "Using Default = %u\n", prefix, *max_vls, dflt);
+		*max_vls = dflt;
 	}
 }
 
-static void subn_verify_high_limit(unsigned *high_limit, const char *prefix)
+static void subn_verify_high_limit(int *high_limit, const char *prefix, int dflt)
 {
-	if (*high_limit > 255) {
-		log_report(" Invalid Cached Option:%s_high_limit=%u:"
-			   "Using Default:%u\n",
-			   prefix, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
-		*high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
+	if (*high_limit < 0 || *high_limit > 255) {
+		log_report(" Invalid Cached Option: %s_high_limit=%d: "
+			   "Using Default: %d\n", prefix, *high_limit, dflt);
+		*high_limit = dflt;
 	}
 }
 
-static void subn_verify_vlarb(char *vlarb, const char *prefix,
-			      const char *suffix)
+static void subn_verify_vlarb(char **vlarb, const char *prefix,
+			      const char *suffix, char *dflt)
 {
-	if (vlarb) {
-		char *str, *tok, *end, *ptr;
-		int count = 0;
-
-		str = strdup(vlarb);
-
-		tok = strtok_r(str, ",\n", &ptr);
-		while (tok) {
-			char *vl_str, *weight_str;
-
-			vl_str = tok;
-			weight_str = strchr(tok, ':');
-
-			if (weight_str) {
-				long vl, weight;
-
-				*weight_str = '\0';
-				weight_str++;
-
-				vl = strtol(vl_str, &end, 0);
-
-				if (*end)
-					log_report(" Warning: Cached Option "
-						   "%s_vlarb_%s:vl=%s "
-						   "improperly formatted\n",
-						   prefix, suffix, vl_str);
-				else if (vl < 0 || vl > 14)
-					log_report(" Warning: Cached Option "
-						   "%s_vlarb_%s:vl=%ld out "
-						   "of range\n",
-						   prefix, suffix, vl);
-
-				weight = strtol(weight_str, &end, 0);
-
-				if (*end)
-					log_report(" Warning: Cached Option "
-						   "%s_vlarb_%s:weight=%s "
-						   "improperly formatted\n",
-						   prefix, suffix, weight_str);
-				else if (weight < 0 || weight > 255)
-					log_report(" Warning: Cached Option "
-						   "%s_vlarb_%s:weight=%ld "
-						   "out of range\n",
-						   prefix, suffix, weight);
-			} else
-				log_report(" Warning: Cached Option "
-					   "%s_vlarb_%s:vl:weight=%s "
-					   "improperly formatted\n",
-					   prefix, suffix, tok);
+	char *str, *tok, *end, *ptr;
+	int count = 0;
+
+	if (*vlarb == NULL) {
+		log_report(" Invalid Cached Option: %s_vlarb_%s: "
+		"Using Default\n", prefix, suffix);
+		*vlarb = dflt;
+		return;
+	}
 
-			count++;
-			tok = strtok_r(NULL, ",\n", &ptr);
-		}
+	str = strdup(*vlarb);
+
+	tok = strtok_r(str, ",\n", &ptr);
+	while (tok) {
+		char *vl_str, *weight_str;
 
-		if (count > 64)
-			log_report(" Warning: Cached Option %s_vlarb_%s: "
-				   "> 64 listed: excess vl:weight pairs "
-				   "will be dropped\n", prefix, suffix);
+		vl_str = tok;
+		weight_str = strchr(tok, ':');
 
-		free(str);
+		if (weight_str) {
+			long vl, weight;
+
+			*weight_str = '\0';
+			weight_str++;
+
+			vl = strtol(vl_str, &end, 0);
+
+			if (*end)
+				log_report(" Warning: Cached Option "
+					   "%s_vlarb_%s:vl=%s"
+					   " improperly formatted\n",
+					   prefix, suffix, vl_str);
+			else if (vl < 0 || vl > 14)
+				log_report(" Warning: Cached Option "
+					   "%s_vlarb_%s:vl=%ld out of range\n",
+					   prefix, suffix, vl);
+
+			weight = strtol(weight_str, &end, 0);
+
+			if (*end)
+				log_report(" Warning: Cached Option "
+					   "%s_vlarb_%s:weight=%s "
+					   "improperly formatted\n",
+					   prefix, suffix, weight_str);
+			else if (weight < 0 || weight > 255)
+				log_report(" Warning: Cached Option "
+					   "%s_vlarb_%s:weight=%ld "
+					   "out of range\n",
+					   prefix, suffix, weight);
+		} else
+			log_report(" Warning: Cached Option "
+				   "%s_vlarb_%s:vl:weight=%s "
+				   "improperly formatted\n",
+				   prefix, suffix, tok);
+
+		count++;
+		tok = strtok_r(NULL, ",\n", &ptr);
 	}
+
+	if (count > 64)
+		log_report(" Warning: Cached Option %s_vlarb_%s: > 64 listed:"
+			   " excess vl:weight pairs will be dropped\n",
+			   prefix, suffix);
+
+	free(str);
 }
 
-static void subn_verify_sl2vl(char *sl2vl, const char *prefix)
+static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt)
 {
-	if (sl2vl) {
-		char *str, *tok, *end, *ptr;
-		int count = 0;
+	char *str, *tok, *end, *ptr;
+	int count = 0;
+
+	if (*sl2vl == NULL) {
+		log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n",
+			   prefix);
+		*sl2vl = dflt;
+		return;
+	}
 
-		str = strdup(sl2vl);
+	str = strdup(*sl2vl);
 
-		tok = strtok_r(str, ",\n", &ptr);
-		while (tok) {
-			long vl = strtol(tok, &end, 0);
+	tok = strtok_r(str, ",\n", &ptr);
+	while (tok) {
+		long vl = strtol(tok, &end, 0);
 
-			if (*end)
-				log_report(" Warning: Cached Option %s_sl2vl:"
-					   "vl=%s improperly formatted\n",
-					   prefix, tok);
-			else if (vl < 0 || vl > 15)
-				log_report(" Warning: Cached Option %s_sl2vl:"
-					   "vl=%ld out of range\n",
-					   prefix, vl);
-
-			count++;
-			tok = strtok_r(NULL, ",\n", &ptr);
-		}
+		if (*end)
+			log_report(" Warning: Cached Option %s_sl2vl:vl=%s "
+				   "improperly formatted\n", prefix, tok);
+		else if (vl < 0 || vl > 15)
+			log_report(" Warning: Cached Option %s_sl2vl:vl=%ld "
+				   "out of range\n", prefix, vl);
 
-		if (count < 16)
-			log_report(" Warning: Cached Option %s_sl2vl: < 16 VLs "
-				   "listed\n", prefix);
+		count++;
+		tok = strtok_r(NULL, ",\n", &ptr);
+	}
 
-		if (count > 16)
-			log_report(" Warning: Cached Option %s_sl2vl: "
-				   "> 16 listed: excess VLs will be dropped\n",
-				   prefix);
+	if (count < 16)
+		log_report(" Warning: Cached Option %s_sl2vl: < 16 VLs "
+			   "listed\n", prefix);
 
-		free(str);
-	}
+	if (count > 16)
+		log_report(" Warning: Cached Option %s_sl2vl: > 16 listed: "
+			   "excess VLs will be dropped\n", prefix);
+
+	free(str);
 }
 
-static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix)
+static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix,
+				osm_qos_options_t *dflt)
 {
-	subn_verify_max_vls(&set->max_vls, prefix);
-	subn_verify_high_limit(&set->high_limit, prefix);
-	subn_verify_vlarb(set->vlarb_low, prefix, "low");
-	subn_verify_vlarb(set->vlarb_high, prefix, "high");
-	subn_verify_sl2vl(set->sl2vl, prefix);
+	subn_verify_max_vls(&set->max_vls, prefix, dflt->max_vls);
+	subn_verify_high_limit(&set->high_limit, prefix, dflt->high_limit);
+	subn_verify_vlarb(&set->vlarb_low, prefix, "low", dflt->vlarb_low);
+	subn_verify_vlarb(&set->vlarb_high, prefix, "high", dflt->vlarb_high);
+	subn_verify_sl2vl(&set->sl2vl, prefix, dflt->sl2vl);
 }
 
 static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
@@ -957,11 +985,24 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 	}
 
 	if (p_opts->qos) {
-		subn_verify_qos_set(&p_opts->qos_options, "qos");
-		subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca");
-		subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0");
-		subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe");
-		subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr");
+		osm_qos_options_t dflt;
+
+		/* the default options in qos_options must be correct.
+		 * every other one need not be, b/c those will default
+		 * back to whatever is in qos_options.
+		 */
+
+		subn_set_default_qos_options(&dflt);
+
+		subn_verify_qos_set(&p_opts->qos_options, "qos", &dflt);
+		subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca",
+				    &p_opts->qos_options);
+		subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0",
+				    &p_opts->qos_options);
+		subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe",
+				    &p_opts->qos_options);
+		subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr",
+				    &p_opts->qos_options);
 	}
 
 #ifdef ENABLE_OSM_PERF_MGR
@@ -1267,30 +1308,31 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 		return -1;
 	}
 
+	subn_init_qos_options(&p_subn->opt.qos_options);
+	subn_init_qos_options(&p_subn->opt.qos_ca_options);
+	subn_init_qos_options(&p_subn->opt.qos_sw0_options);
+	subn_init_qos_options(&p_subn->opt.qos_swe_options);
+	subn_init_qos_options(&p_subn->opt.qos_rtr_options);
+
 	while (fgets(line, 1023, opts_file) != NULL) {
 		/* get the first token */
 		p_key = strtok_r(line, " \t\n", &p_last);
 		if (p_key) {
 			p_val = strtok_r(NULL, " \t\n", &p_last);
 
-			subn_parse_qos_options("qos",
-					       p_key, p_val,
+			subn_parse_qos_options("qos", p_key, p_val,
 					       &p_subn->opt.qos_options);
 
-			subn_parse_qos_options("qos_ca",
-					       p_key, p_val,
+			subn_parse_qos_options("qos_ca", p_key, p_val,
 					       &p_subn->opt.qos_ca_options);
 
-			subn_parse_qos_options("qos_sw0",
-					       p_key, p_val,
+			subn_parse_qos_options("qos_sw0", p_key, p_val,
 					       &p_subn->opt.qos_sw0_options);
 
-			subn_parse_qos_options("qos_swe",
-					       p_key, p_val,
+			subn_parse_qos_options("qos_swe", p_key, p_val,
 					       &p_subn->opt.qos_swe_options);
 
-			subn_parse_qos_options("qos_rtr",
-					       p_key, p_val,
+			subn_parse_qos_options("qos_rtr", p_key, p_val,
 					       &p_subn->opt.qos_rtr_options);
 
 		}
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 12 16:39:12 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 02:39:12 +0200
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <20081112235431.GG25248@sgi.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
	<20081112235431.GG25248@sgi.com>
Message-ID: <20081113003912.GJ27271@sashak.voltaire.com>

On 15:54 Wed 12 Nov     , akepner at sgi.com wrote:
> > ..... 
> > > So far, so good. But we also have r4i2n10, connected to the switch with
> > > lid 1533 port 7:
> > >
> > > switchguid=0x800690000002e50(800690000002e50)
> > > Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
> > > ......
> > > [7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR
> > >
> > > with this mft entry:
> > >
> > > Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
> > >            0                   1                   2
> > >     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
> > >  MLid
> > > .....
> > > 0xc069                    x
> > >
> > > Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a
> > > mft entry for the multicast group with MGID ff12601bffff::1ff26d289?

Any chance that port "r4i2n10" joins MGID ff12601bffff::1ff26d289 as
non-member?

You can run OpenSM with -V flag and track all joins.

Sasha


From hal.rosenstock at gmail.com  Wed Nov 12 18:08:09 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 12 Nov 2008 21:08:09 -0500
Subject: ***SPAM*** Re: [ofa-general] opensm: bad multicast forwarding table
	entries
In-Reply-To: <20081113003912.GJ27271@sashak.voltaire.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
	<20081112235431.GG25248@sgi.com>
	<20081113003912.GJ27271@sashak.voltaire.com>
Message-ID: <f0e08f230811121808w41594741h859e37feff777f75@mail.gmail.com>

On Wed, Nov 12, 2008 at 7:39 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 15:54 Wed 12 Nov     , akepner at sgi.com wrote:
>> > .....
>> > > So far, so good. But we also have r4i2n10, connected to the switch with
>> > > lid 1533 port 7:
>> > >
>> > > switchguid=0x800690000002e50(800690000002e50)
>> > > Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
>> > > ......
>> > > [7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR
>> > >
>> > > with this mft entry:
>> > >
>> > > Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
>> > >            0                   1                   2
>> > >     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
>> > >  MLid
>> > > .....
>> > > 0xc069                    x
>> > >
>> > > Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a
>> > > mft entry for the multicast group with MGID ff12601bffff::1ff26d289?
>
> Any chance that port "r4i2n10" joins MGID ff12601bffff::1ff26d289 as
> non-member?

Wouldn't saquery -m show this member too ? Arthur said there was only
1 member indicated.

-- Hal

> You can run OpenSM with -V flag and track all joins.
>
> Sasha
>


From sashak at voltaire.com  Wed Nov 12 18:14:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 04:14:03 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sa_mcmember_record: return a real
	port JoinState on update
Message-ID: <20081113021403.GK27271@sashak.voltaire.com>


When port JoinState is updated by MCMember leave request response should
have a real (new) JoinState. This fix addresses bug#1373.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_sa_mcmember_record.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 878d21e..4ca5896 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -1095,12 +1095,10 @@ __osm_mcmr_rcv_leave_mgrp(IN osm_sa_t * sa,
 		goto Exit;
 	}
 
-	mcmember_rec.scope_state = p_mcm_port->scope_state;
 	/* remove port or update join state */
 	removed = osm_mgrp_remove_port(sa->p_subn, sa->p_log, p_mgrp, p_mcm_port,
 				       p_recvd_mcmember_rec->scope_state&0x0F);
-	if (removed)
-		mcmember_rec.scope_state = p_mcm_port->scope_state;
+	mcmember_rec.scope_state = p_mcm_port->scope_state;
 
 	CL_PLOCK_RELEASE(sa->p_lock);
 
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 12 18:34:51 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 04:34:51 +0200
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <f0e08f230811121808w41594741h859e37feff777f75@mail.gmail.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
	<20081112235431.GG25248@sgi.com>
	<20081113003912.GJ27271@sashak.voltaire.com>
	<f0e08f230811121808w41594741h859e37feff777f75@mail.gmail.com>
Message-ID: <20081113023451.GL27271@sashak.voltaire.com>

On 21:08 Wed 12 Nov     , Hal Rosenstock wrote:
> >
> > Any chance that port "r4i2n10" joins MGID ff12601bffff::1ff26d289 as
> > non-member?
> 
> Wouldn't saquery -m show this member too ? Arthur said there was only
> 1 member indicated.

Yes, I think you are right and it should. Need to check although.

Sasha


From ogerlitz at voltaire.com  Wed Nov 12 23:20:57 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 13 Nov 2008 09:20:57 +0200 (IST)
Subject: [ofa-general] rate assignment for path queries
Message-ID: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>

Hi Yevgeny,

If opensm doesn't have a match on any qos-assignment rule (eg when there's
no qos-config file), when coming to serve sa path query,  my understanding
is that the "qos related fields" of the partition would be used.

For example, I have set the following partition config file which assigns
<sl=1,rate=2> to the 0x8001 partition, and run without any qos file.

Default=0x7fff,ipoib : ALL=full;
RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
RED=0x8002, ipoib, sl=2, rate=3, defmember=full : ALL=full;

When a path query is issued, Indeed sl=1 is returned but I see that a
rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).

Have I done anything wrong? is it a known issue? what does it means
when the SM prints "min rate = 6"

Or.


Nov 13 02:12:49 219374 [42803940] 0x08 -> PathRecord dump:
				service_id..............0x0000000000000000
				dgid....................0xfe80000000000000 : 0x0002c90300026be7
				sgid....................0xfe80000000000000 : 0x0002c90300026be3
				dlid....................0x0
				slid....................0x0
				hop_flow_raw............0x0
				tclass..................0x0
				num_path_revers.........0x1
				pkey....................0x8001
				qos_class...............0x0
				sl......................0x0
				mtu.....................0x3
				rate....................0x0
				pkt_life................0x0
				preference..............0x0
				resv2...................0x0
				resv3...................0x0
Nov 13 02:12:49 219386 [42803940] 0x10 -> __osm_pr_rcv_check_mcast_dest: [
Nov 13 02:12:49 219390 [42803940] 0x10 -> __osm_pr_rcv_check_mcast_dest: ]
Nov 13 02:12:49 219394 [42803940] 0x08 -> osm_pr_rcv_process: Unicast destination requested
Nov 13 02:12:49 219398 [42803940] 0x10 -> __osm_pr_rcv_get_end_points: [
Nov 13 02:12:49 219403 [42803940] 0x10 -> __osm_pr_rcv_get_end_points: ]
Nov 13 02:12:49 219407 [42803940] 0x10 -> __osm_pr_rcv_process_pair: [
Nov 13 02:12:49 219411 [42803940] 0x10 -> __osm_pr_rcv_get_port_pair_paths: [
Nov 13 02:12:49 219415 [42803940] 0x08 -> __osm_pr_rcv_get_port_pair_paths: Src port 0x0002c90300026be3, Dst port 0x0002c90300026be7
Nov 13 02:12:49 219420 [42803940] 0x10 -> osm_port_share_pkey: [
Nov 13 02:12:49 219424 [42803940] 0x10 -> osm_port_share_pkey: ]
Nov 13 02:12:49 219428 [42803940] 0x10 -> osm_port_share_pkey: [
Nov 13 02:12:49 219432 [42803940] 0x10 -> osm_port_share_pkey: ]
Nov 13 02:12:49 219436 [42803940] 0x10 -> osm_port_share_pkey: [
Nov 13 02:12:49 219440 [42803940] 0x10 -> osm_port_share_pkey: ]
Nov 13 02:12:49 219444 [42803940] 0x08 -> __osm_pr_rcv_get_port_pair_paths: Src LIDs [0x7-0x7], Dest LIDs [0x8-0x8]
Nov 13 02:12:49 219449 [42803940] 0x10 -> __osm_pr_rcv_get_lid_pair_path: [
Nov 13 02:12:49 219453 [42803940] 0x08 -> __osm_pr_rcv_get_lid_pair_path: Src LID 0x7, Dest LID 0x8
Nov 13 02:12:49 219458 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: [
Nov 13 02:12:49 219464 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path min MTU = 4, min rate = 6
Nov 13 02:12:49 219471 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = 0x8001, sl = 1
Nov 13 02:12:49 219476 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: ]
Nov 13 02:12:49 219480 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: [
Nov 13 02:12:49 219484 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path min MTU = 4, min rate = 6
Nov 13 02:12:49 219489 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = 0x8001, sl = 1


From kliteyn at dev.mellanox.co.il  Thu Nov 13 01:02:54 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 13 Nov 2008 11:02:54 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_sa_mcmember_record: return a
 real port	JoinState on update
In-Reply-To: <20081113021403.GK27271@sashak.voltaire.com>
References: <20081113021403.GK27271@sashak.voltaire.com>
Message-ID: <491BED3E.3060104@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> When port JoinState is updated by MCMember leave request response should
> have a real (new) JoinState. This fix addresses bug#1373.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_sa_mcmember_record.c |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
> index 878d21e..4ca5896 100644
> --- a/opensm/opensm/osm_sa_mcmember_record.c
> +++ b/opensm/opensm/osm_sa_mcmember_record.c
> @@ -1095,12 +1095,10 @@ __osm_mcmr_rcv_leave_mgrp(IN osm_sa_t * sa,
>  		goto Exit;
>  	}
>  
> -	mcmember_rec.scope_state = p_mcm_port->scope_state;
>  	/* remove port or update join state */
>  	removed = osm_mgrp_remove_port(sa->p_subn, sa->p_log, p_mgrp, p_mcm_port,
>  				       p_recvd_mcmember_rec->scope_state&0x0F);
> -	if (removed)
> -		mcmember_rec.scope_state = p_mcm_port->scope_state;
> +	mcmember_rec.scope_state = p_mcm_port->scope_state;

I did the exact same fix last night :)

-- Yevgeny

>  	CL_PLOCK_RELEASE(sa->p_lock);
>  


From dorfman.eli at gmail.com  Thu Nov 13 01:18:07 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 13 Nov 2008 11:18:07 +0200
Subject: ***SPAM*** [ofa-general] [PATCH] opensm/osm_sa_path_record.c print
	port guids in error message
Message-ID: <491BF0CF.4060306@gmail.com>

print port guids in error message when there is no shared pkey between the ports.

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_sa_path_record.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c
index fc425d5..b100384 100644
--- a/opensm/opensm/osm_sa_path_record.c
+++ b/opensm/opensm/osm_sa_path_record.c
@@ -596,7 +596,10 @@ __osm_pr_rcv_get_path_parms(IN osm_sa_t * sa,
 		pkey = p_pr->pkey;
 		if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) {
 			OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1F1A: "
-				"Ports do not share specified PKey 0x%04x\n",
+				"Ports 0x%016" PRIx64 " 0x%016" PRIx64
+				" do not share specified PKey 0x%04x\n",
+				cl_ntoh64(osm_physp_get_port_guid(p_src_physp)),
+				cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)),
 				cl_ntoh16(pkey));
 			status = IB_NOT_FOUND;
 			goto Exit;
@@ -618,7 +621,10 @@ __osm_pr_rcv_get_path_parms(IN osm_sa_t * sa,
 						     p_src_physp, p_dest_physp);
 		if (!pkey) {
 			OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1F1E: "
-				"Ports do not share PKeys defined by QoS level\n");
+				"Ports 0x%016" PRIx64 " 0x%016" PRIx64
+				" do not share PKeys defined by QoS level\n",
+				cl_ntoh64(osm_physp_get_port_guid(p_src_physp)),
+				cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)));
 			status = IB_NOT_FOUND;
 			goto Exit;
 		}
@@ -630,7 +636,10 @@ __osm_pr_rcv_get_path_parms(IN osm_sa_t * sa,
 		pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp);
 		if (!pkey) {
 			OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1F1B: "
-				"Ports do not have any shared PKeys\n");
+				"Ports 0x%016" PRIx64 " 0x%016" PRIx64
+				" do not have any shared PKeys\n",
+				cl_ntoh64(osm_physp_get_port_guid(p_src_physp)),
+				cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)));
 			status = IB_NOT_FOUND;
 			goto Exit;
 		}
-- 
1.5.5


From ogerlitz at voltaire.com  Thu Nov 13 02:37:55 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 13 Nov 2008 12:37:55 +0200 (IST)
Subject: [ofa-general] using qos_X vs qos_ca_X / qos_swe_X directives
Message-ID: <Pine.LNX.4.64.0811131229570.27833@zuben.voltaire.com>

Hi Yevgeny,

I noted that when I use qos_X directives in the opensm config file, they are not
applied by the SM on the fabric, but rather the "default values (hard-coded in
OpenSM initialization)". When I use qos_ca_X and qos_swe_X directives, they
are applied on the fabric. I have checked this with both 3.1.11 or 3.2.2
(in their ofed 1.3.1 / ofed 1.4 form).

Or.

E.g

try

qos_max_vls 4
qos_high_limit 255
qos_vlarb_high 0:128,1:64,2:32
qos_vlarb_low  0:1,1:2,2:3,3:4
qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

vs

qos_ca_max_vls 4
qos_ca_high_limit 255
qos_ca_vlarb_high 0:128,1:64,2:32
qos_ca_vlarb_low  0:1,1:2,2:3,3:4
qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7


From vlad at lists.openfabrics.org  Thu Nov 13 03:18:45 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 13 Nov 2008 03:18:45 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081113-0200 daily build status
Message-ID: <20081113111845.C6FFEE60323@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Thu Nov 13 04:23:26 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 14:23:26 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sa_path_record.c print port
	guids in error message
In-Reply-To: <491BF0CF.4060306@gmail.com>
References: <491BF0CF.4060306@gmail.com>
Message-ID: <20081113122326.GT27271@sashak.voltaire.com>

On 11:18 Thu 13 Nov     , Eli Dorfman wrote:
> print port guids in error message when there is no shared pkey between the ports.
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Thu Nov 13 05:17:03 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 15:17:03 +0200
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
Message-ID: <20081113131703.GV27271@sashak.voltaire.com>

Hi Or,

On 09:20 Thu 13 Nov     , Or Gerlitz wrote:
> 
> If opensm doesn't have a match on any qos-assignment rule (eg when there's
> no qos-config file), when coming to serve sa path query,  my understanding
> is that the "qos related fields" of the partition would be used.
> 
> For example, I have set the following partition config file which assigns
> <sl=1,rate=2> to the 0x8001 partition, and run without any qos file.
> 
> Default=0x7fff,ipoib : ALL=full;
> RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
> RED=0x8002, ipoib, sl=2, rate=3, defmember=full : ALL=full;
> 
> When a path query is issued, Indeed sl=1 is returned but I see that a
> rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).

For my best knowledge rate=2 in partition config file will be related to
corresponded IPoIB multicast group for this partition, and not to
PathRecord. In PathRecord you get maximum available rate on the
requested path.

> Have I done anything wrong? is it a known issue? what does it means
> when the SM prints "min rate = 6"

Here "min rate" means minimal common rate on the path.

Sasha

> 
> Or.
> 
> 
> Nov 13 02:12:49 219374 [42803940] 0x08 -> PathRecord dump:
> 				service_id..............0x0000000000000000
> 				dgid....................0xfe80000000000000 : 0x0002c90300026be7
> 				sgid....................0xfe80000000000000 : 0x0002c90300026be3
> 				dlid....................0x0
> 				slid....................0x0
> 				hop_flow_raw............0x0
> 				tclass..................0x0
> 				num_path_revers.........0x1
> 				pkey....................0x8001
> 				qos_class...............0x0
> 				sl......................0x0
> 				mtu.....................0x3
> 				rate....................0x0
> 				pkt_life................0x0
> 				preference..............0x0
> 				resv2...................0x0
> 				resv3...................0x0
> Nov 13 02:12:49 219386 [42803940] 0x10 -> __osm_pr_rcv_check_mcast_dest: [
> Nov 13 02:12:49 219390 [42803940] 0x10 -> __osm_pr_rcv_check_mcast_dest: ]
> Nov 13 02:12:49 219394 [42803940] 0x08 -> osm_pr_rcv_process: Unicast destination requested
> Nov 13 02:12:49 219398 [42803940] 0x10 -> __osm_pr_rcv_get_end_points: [
> Nov 13 02:12:49 219403 [42803940] 0x10 -> __osm_pr_rcv_get_end_points: ]
> Nov 13 02:12:49 219407 [42803940] 0x10 -> __osm_pr_rcv_process_pair: [
> Nov 13 02:12:49 219411 [42803940] 0x10 -> __osm_pr_rcv_get_port_pair_paths: [
> Nov 13 02:12:49 219415 [42803940] 0x08 -> __osm_pr_rcv_get_port_pair_paths: Src port 0x0002c90300026be3, Dst port 0x0002c90300026be7
> Nov 13 02:12:49 219420 [42803940] 0x10 -> osm_port_share_pkey: [
> Nov 13 02:12:49 219424 [42803940] 0x10 -> osm_port_share_pkey: ]
> Nov 13 02:12:49 219428 [42803940] 0x10 -> osm_port_share_pkey: [
> Nov 13 02:12:49 219432 [42803940] 0x10 -> osm_port_share_pkey: ]
> Nov 13 02:12:49 219436 [42803940] 0x10 -> osm_port_share_pkey: [
> Nov 13 02:12:49 219440 [42803940] 0x10 -> osm_port_share_pkey: ]
> Nov 13 02:12:49 219444 [42803940] 0x08 -> __osm_pr_rcv_get_port_pair_paths: Src LIDs [0x7-0x7], Dest LIDs [0x8-0x8]
> Nov 13 02:12:49 219449 [42803940] 0x10 -> __osm_pr_rcv_get_lid_pair_path: [
> Nov 13 02:12:49 219453 [42803940] 0x08 -> __osm_pr_rcv_get_lid_pair_path: Src LID 0x7, Dest LID 0x8
> Nov 13 02:12:49 219458 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: [
> Nov 13 02:12:49 219464 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path min MTU = 4, min rate = 6
> Nov 13 02:12:49 219471 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = 0x8001, sl = 1
> Nov 13 02:12:49 219476 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: ]
> Nov 13 02:12:49 219480 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: [
> Nov 13 02:12:49 219484 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path min MTU = 4, min rate = 6
> Nov 13 02:12:49 219489 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = 0x8001, sl = 1


From sashak at voltaire.com  Thu Nov 13 05:22:36 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 15:22:36 +0200
Subject: [ofa-general] using qos_X vs qos_ca_X / qos_swe_X directives
In-Reply-To: <Pine.LNX.4.64.0811131229570.27833@zuben.voltaire.com>
References: <Pine.LNX.4.64.0811131229570.27833@zuben.voltaire.com>
Message-ID: <20081113132236.GW27271@sashak.voltaire.com>

On 12:37 Thu 13 Nov     , Or Gerlitz wrote:
> Hi Yevgeny,
> 
> I noted that when I use qos_X directives in the opensm config file, they are not
> applied by the SM on the fabric, but rather the "default values (hard-coded in
> OpenSM initialization)". When I use qos_ca_X and qos_swe_X directives, they
> are applied on the fabric. I have checked this with both 3.1.11 or 3.2.2
> (in their ofed 1.3.1 / ofed 1.4 form).

Yes, this is a "feature" (bug). We are discussing this right now in the
thread
http://lists.openfabrics.org/pipermail/general/2008-November/055394.html

Sasha

> 
> Or.
> 
> E.g
> 
> try
> 
> qos_max_vls 4
> qos_high_limit 255
> qos_vlarb_high 0:128,1:64,2:32
> qos_vlarb_low  0:1,1:2,2:3,3:4
> qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> 
> vs
> 
> qos_ca_max_vls 4
> qos_ca_high_limit 255
> qos_ca_vlarb_high 0:128,1:64,2:32
> qos_ca_vlarb_low  0:1,1:2,2:3,3:4
> qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From hal.rosenstock at gmail.com  Thu Nov 13 05:38:57 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 13 Nov 2008 08:38:57 -0500
Subject: [ofa-general] Re: [PATCH] opensm/osm_sa_mcmember_record: return a
	real port JoinState on update
In-Reply-To: <20081113021403.GK27271@sashak.voltaire.com>
References: <20081113021403.GK27271@sashak.voltaire.com>
Message-ID: <f0e08f230811130538p2d2c2750h245993234773ae54@mail.gmail.com>

On Wed, Nov 12, 2008 at 9:14 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> When port JoinState is updated by MCMember leave request response should
> have a real (new) JoinState. This fix addresses bug#1373.
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_sa_mcmember_record.c |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
>
> diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
> index 878d21e..4ca5896 100644
> --- a/opensm/opensm/osm_sa_mcmember_record.c
> +++ b/opensm/opensm/osm_sa_mcmember_record.c
> @@ -1095,12 +1095,10 @@ __osm_mcmr_rcv_leave_mgrp(IN osm_sa_t * sa,
>                goto Exit;
>        }
>
> -       mcmember_rec.scope_state = p_mcm_port->scope_state;
>        /* remove port or update join state */
>        removed = osm_mgrp_remove_port(sa->p_subn, sa->p_log, p_mgrp, p_mcm_port,
>                                       p_recvd_mcmember_rec->scope_state&0x0F);
> -       if (removed)
> -               mcmember_rec.scope_state = p_mcm_port->scope_state;
> +       mcmember_rec.scope_state = p_mcm_port->scope_state;

In looking at this, this is really only compliant if done for trusted
requests (and there are other trust issues with SA MCMemberRecord).
This issue clearly predates the patch.

-- Hal

>
>        CL_PLOCK_RELEASE(sa->p_lock);
>
> --
> 1.6.0.3.517.g759a
>
>


From hal.rosenstock at gmail.com  Thu Nov 13 05:39:51 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 13 Nov 2008 08:39:51 -0500
Subject: ***SPAM*** [ofa-general] [PATCH] opensm/osm_sa_path_record.c
	print port guids in error message
In-Reply-To: <491BF0CF.4060306@gmail.com>
References: <491BF0CF.4060306@gmail.com>
Message-ID: <f0e08f230811130539u1a307aeby92b3d31acda5f016@mail.gmail.com>

On Thu, Nov 13, 2008 at 4:18 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
> print port guids in error message when there is no shared pkey between the ports.
>
> Signed-off-by: Eli Dorfman <elid at voltaire.com>
> ---
>  opensm/opensm/osm_sa_path_record.c |   15 ++++++++++++---
>  1 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c
> index fc425d5..b100384 100644
> --- a/opensm/opensm/osm_sa_path_record.c
> +++ b/opensm/opensm/osm_sa_path_record.c
> @@ -596,7 +596,10 @@ __osm_pr_rcv_get_path_parms(IN osm_sa_t * sa,
>                pkey = p_pr->pkey;
>                if (!osm_physp_share_this_pkey(p_src_physp, p_dest_physp, pkey)) {
>                        OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1F1A: "
> -                               "Ports do not share specified PKey 0x%04x\n",
> +                               "Ports 0x%016" PRIx64 " 0x%016" PRIx64
> +                               " do not share specified PKey 0x%04x\n",
> +                               cl_ntoh64(osm_physp_get_port_guid(p_src_physp)),
> +                               cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)),
>                                cl_ntoh16(pkey));
>                        status = IB_NOT_FOUND;
>                        goto Exit;
> @@ -618,7 +621,10 @@ __osm_pr_rcv_get_path_parms(IN osm_sa_t * sa,
>                                                     p_src_physp, p_dest_physp);
>                if (!pkey) {
>                        OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1F1E: "
> -                               "Ports do not share PKeys defined by QoS level\n");
> +                               "Ports 0x%016" PRIx64 " 0x%016" PRIx64
> +                               " do not share PKeys defined by QoS level\n",
> +                               cl_ntoh64(osm_physp_get_port_guid(p_src_physp)),
> +                               cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)));
>                        status = IB_NOT_FOUND;
>                        goto Exit;
>                }
> @@ -630,7 +636,10 @@ __osm_pr_rcv_get_path_parms(IN osm_sa_t * sa,
>                pkey = osm_physp_find_common_pkey(p_src_physp, p_dest_physp);
>                if (!pkey) {
>                        OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1F1B: "
> -                               "Ports do not have any shared PKeys\n");
> +                               "Ports 0x%016" PRIx64 " 0x%016" PRIx64
> +                               " do not have any shared PKeys\n",
> +                               cl_ntoh64(osm_physp_get_port_guid(p_src_physp)),
> +                               cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)));
>                        status = IB_NOT_FOUND;
>                        goto Exit;
>                }

A nit but IMO these messages would best be consistent with the ones
which are similar in osm_sa_multipath_record.c

-- Hal

> 1.5.5
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From ogerlitz at voltaire.com  Thu Nov 13 05:43:26 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 13 Nov 2008 15:43:26 +0200
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <20081113131703.GV27271@sashak.voltaire.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
	<20081113131703.GV27271@sashak.voltaire.com>
Message-ID: <491C2EFE.4060900@voltaire.com>

Sasha Khapyorsky wrote:
>> RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
>>
>> When a path query is issued, Indeed sl=1 is returned but I see that a
>> rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).
> For my best knowledge rate=2 in partition config file will be related to corresponded IPoIB multicast group for this partition, and not to PathRecord. In PathRecord you get maximum available rate on the requested path.
I understand your comment about the relation to multicast join and not 
path queries. However,  currently, where  there's no rule in the 
qos-config file (or no file) that matches the path query, the SM does 
provide the SL assigned to the partition (specified in the query) 
through the pkey file but it doesn't do so for the Rate. So you say that 
for QoS = <SL, Rate> assignment one should use the qos-policy file, let 
it be.

Or.


From kliteyn at dev.mellanox.co.il  Thu Nov 13 06:23:25 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 13 Nov 2008 16:23:25 +0200
Subject: [ofa-general] [PATCH] osmtest/osmt_multicast.c: some refinements to
 the multicast flow
Message-ID: <491C385D.9090909@dev.mellanox.co.il>

Hi Sasha,

Here are some osmtest refinements (multicast flow) that
I did while debugging the recent two multicast bugs in
opensm: some comments fixes, creating a group that was
removed because last full member left, and adding one
query to check that invalid delete request really fails.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/osmtest/osmt_multicast.c |   64 ++++++++++++++++++++++++++++++++++----
 1 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/opensm/osmtest/osmt_multicast.c b/opensm/osmtest/osmt_multicast.c
index a397142..57a8772 100644
--- a/opensm/osmtest/osmt_multicast.c
+++ b/opensm/osmtest/osmt_multicast.c
@@ -1813,7 +1813,7 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)

 	/* Lets try another valid join scope state */
 	OSM_LOG(&p_osmt->log, OSM_LOG_INFO,
-		"Checking new MGID creation with valid join state (o15.0.1.9)...\n");
+		"Checking new MGID creation with valid join state (o15.0.2.3)...\n");

 	mc_req_rec.mgid = good_mgid;
 	mc_req_rec.mgid.raw[12] = 0xFB;
@@ -1853,7 +1853,7 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)
 	    IB_MCR_COMPMASK_MGID |
 	    IB_MCR_COMPMASK_PORT_GID | IB_MCR_COMPMASK_JOIN_STATE;

-	status = osmt_send_mcast_request(p_osmt, 0x1,	/* User Defined query */
+	status = osmt_send_mcast_request(p_osmt, 0x1,	/* SubnAdmSet */
 					 &mc_req_rec, comp_mask, &res_sa_mad);
 	if (status != IB_SUCCESS) {
 		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02CC: "
@@ -1862,6 +1862,16 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)
 		goto Exit;
 	}

+	p_mc_res = ib_sa_mad_get_payload_ptr(&res_sa_mad);
+	if ((p_mc_res->scope_state & 0x7) != 0x7) {
+		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02D0: "
+			"Validating JoinState update failed. "
+			"Expected 0x27 got 0x%02X\n",
+			p_mc_res->scope_state);
+		status = IB_ERROR;
+		goto Exit;
+	}
+
 	/* o15.0.1.11: */
 	/* - Try to join into a MGID that exists with JoinState=SendOnlyMember -  */
 	/*   see that it updates JoinState. What is the routing change? */
@@ -1869,12 +1879,24 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)
 		"Checking Retry of existing MGID - See JoinState update (o15.0.1.11)...\n");

 	mc_req_rec.mgid = good_mgid;
-	mc_req_rec.scope_state = 0x22;	/* link-local scope, send only  member */

+	/* first, make sure  that the group exists */
+	mc_req_rec.scope_state = 0x21;
 	status = osmt_send_mcast_request(p_osmt, 1,
 					 &mc_req_rec, comp_mask, &res_sa_mad);
 	if (status != IB_SUCCESS) {
 		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02CD: "
+			"Failed to create/join as full member - got %s/%s\n",
+			ib_get_err_str(status),
+			ib_get_mad_status_str((ib_mad_t *) (&res_sa_mad)));
+		goto Exit;
+	}
+
+	mc_req_rec.scope_state = 0x22;	/* link-local scope, non-member */
+	status = osmt_send_mcast_request(p_osmt, 1,
+					 &mc_req_rec, comp_mask, &res_sa_mad);
+	if (status != IB_SUCCESS) {
+		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02D1: "
 			"Failed to update existing MGID - got %s/%s\n",
 			ib_get_err_str(status),
 			ib_get_mad_status_str((ib_mad_t *) (&res_sa_mad)));
@@ -1899,15 +1921,33 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)
 	mc_req_rec.rate =
 	    IB_LINK_WIDTH_ACTIVE_1X | IB_PATH_SELECTOR_GREATER_THAN << 6;
 	mc_req_rec.mgid = good_mgid;
-	/* link-local scope, non member (so we should not be able to delete) */
-	/*  but the FullMember bit should be gone */
+
 	OSM_LOG(&p_osmt->log, OSM_LOG_INFO,
 		"Checking Partially delete JoinState (o15.0.1.14)...\n");
-	mc_req_rec.scope_state = 0x22;
+
+	/* link-local scope, both non-member bits,
+	   so we should not be able to delete) */
+	mc_req_rec.scope_state = 0x26;
+	OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, EXPECTING_ERRORS_START "\n");
 	status = osmt_send_mcast_request(p_osmt, 0,
 					 &mc_req_rec, comp_mask, &res_sa_mad);
-	if ((status != IB_SUCCESS) || (p_mc_res->scope_state != 0x21)) {
+	OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, EXPECTING_ERRORS_END "\n");
+
+	if (status != IB_REMOTE_ERROR) {
 		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02CF: "
+			"Expected to fail partially update JoinState, "
+			"but got %s\n",
+			ib_get_err_str(status));
+		status = IB_ERROR;
+		goto Exit;
+	}
+
+	/* link-local scope, NonMember bit, the FullMember bit should stay */
+	mc_req_rec.scope_state = 0x22;
+	status = osmt_send_mcast_request(p_osmt, 0,
+					 &mc_req_rec, comp_mask, &res_sa_mad);
+	if (status != IB_SUCCESS) {
+		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02D3: "
 			"Failed to partially update JoinState : %s/%s\n",
 			ib_get_err_str(status),
 			ib_get_mad_status_str((ib_mad_t *) (&res_sa_mad)));
@@ -1915,6 +1955,16 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)
 		goto Exit;
 	}

+	p_mc_res = ib_sa_mad_get_payload_ptr(&res_sa_mad);
+	if (p_mc_res->scope_state != 0x21) {
+		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02D4: "
+			"Failed to partially update JoinState : "
+			"JoinState = 0x%02X, expected 0x%02X\n",
+			p_mc_res->scope_state, 0x21);
+		status = IB_ERROR;
+		goto Exit;
+	}
+
 	/* So far successfully delete state - Now change it */
 	mc_req_rec.mgid = good_mgid;
 	mc_req_rec.scope_state = 0x24;	/* link-local scope, send only  member */
-- 
1.5.1.4


From tziporet at dev.mellanox.co.il  Thu Nov 13 07:04:16 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 13 Nov 2008 17:04:16 +0200
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
In-Reply-To: <490DD27C.4070109@pobox.com>
References: <4907348E.7060508@mellanox.co.il>
	<490A8FA9.7080802@pobox.com>	<aday7047jos.fsf@cisco.com>
	<490DA91A.1030703@pobox.com>	<adaprlew1wd.fsf@cisco.com>
	<490DD27C.4070109@pobox.com>
Message-ID: <491C41F0.3080304@mellanox.co.il>

Jeff Garzik wrote:
> Roland Dreier wrote:
>> In general I think I have a bigger chance of merging more mlx4_core
>> stuff through my tree, so it will probably be smoother in terms of
>> conflicts etc. if I carry this patch.
>
>
> Fine by me...
>
What is the status of this?
I know its in mlx_core but mainly needed for mlnx_en and has minimal 
impact on the IB side
I think Roland is at new baby vacation so what is the resolution?

Thanks
Tziporet


From dorfman.eli at gmail.com  Thu Nov 13 07:51:22 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 13 Nov 2008 17:51:22 +0200
Subject: ***SPAM*** [ofa-general] [PATCH] opensm/osm_mcast_tbl.c wrong max
	mcast lid cause the sm to set invalid MFT block.
Message-ID: <491C4CFA.8000006@gmail.com>

wrong max mcast lid cause the sm to set invalid MFT block.
when mcmember tries to set mcast lid beyond mcast capability (e.g. 0xc400),
the sm accepts this and tries to set invalid block.
Signed-off-by: Eli Dorfman <elid at voltaire.com>

---
 opensm/opensm/osm_mcast_tbl.c          |    6 +++---
 opensm/opensm/osm_sa_mcmember_record.c |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
index 92fbb63..17fb69c 100644
--- a/opensm/opensm/osm_mcast_tbl.c
+++ b/opensm/opensm/osm_mcast_tbl.c
@@ -81,7 +81,7 @@ osm_mcast_tbl_init(IN osm_mcast_tbl_t * const p_tbl,
 						IB_MCAST_BLOCK_SIZE) /
 					IB_MCAST_BLOCK_SIZE) - 1);
 
-	p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity);
+	p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity - 1);
 
 	/*
 	   The number of bytes needed in the mask table is:
@@ -216,7 +216,7 @@ osm_mcast_tbl_set_block(IN osm_mcast_tbl_t * const p_tbl,
 
 	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
 
-	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
+	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
 		return (IB_INVALID_PARAMETER);
 
 	for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
@@ -274,7 +274,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
 
 	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
 
-	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
+	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
 		return (IB_INVALID_PARAMETER);
 
 	for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 5dd286a..6007b06 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -846,7 +846,7 @@ osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 	mlid = __get_new_mlid(sa, mcm_rec.mlid);
 	if (mlid == 0) {
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: "
-			"__get_new_mlid failed\n");
+			"__get_new_mlid failed request mlid 0x%04x\n", mcm_rec.mlid);
 		status = IB_SA_MAD_STATUS_NO_RESOURCES;
 		goto Exit;
 	}
-- 
1.5.5


From jackm at mellanox.co.il  Thu Nov 13 08:06:32 2008
From: jackm at mellanox.co.il (Jack Morgenstein)
Date: Thu, 13 Nov 2008 18:06:32 +0200
Subject: [ofa-general] Re: [PATCH] mlx4/profile.c: fix warning
	res_namedefined but not used
In-Reply-To: <adad4hbti66.fsf@cisco.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADF195AA@mtlexch01.mtl.com>

This looks fine to me.

- Jack

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Roland Dreier
> Sent: Tuesday, November 04, 2008 9:17 PM
> To: Alexander Beregalov
> Cc: general at lists.openfabrics.org
> Subject: [ofa-general] Re: [PATCH] mlx4/profile.c: fix 
> warning res_namedefined but not used
> 
> 
> Thanks.  What if we fix this like the following instead -- 
> change mlx4_dbg so it always looks to the compiler like it 
> uses all its parameters?  This generates the same code for 
> me, and looks cleaner in that it actually reduces the amount 
> of #ifdef'ed stuff.
> ---
>  drivers/net/mlx4/mlx4.h |    9 +++------
>  1 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4.h 
> b/drivers/net/mlx4/mlx4.h index fa431fa..56a2e21 100644
> --- a/drivers/net/mlx4/mlx4.h
> +++ b/drivers/net/mlx4/mlx4.h
> @@ -87,6 +87,9 @@ enum {
>  
>  #ifdef CONFIG_MLX4_DEBUG
>  extern int mlx4_debug_level;
> +#else /* CONFIG_MLX4_DEBUG */
> +#define mlx4_debug_level	(0)
> +#endif /* CONFIG_MLX4_DEBUG */
>  
>  #define mlx4_dbg(mdev, format, arg...)			
> 		\
>  	do {							
> 	\
> @@ -94,12 +97,6 @@ extern int mlx4_debug_level;
>  			dev_printk(KERN_DEBUG, 
> &mdev->pdev->dev, format, ## arg); \
>  	} while (0)
>  
> -#else /* CONFIG_MLX4_DEBUG */
> -
> -#define mlx4_dbg(mdev, format, arg...) do { (void) mdev; } while (0)
> -
> -#endif /* CONFIG_MLX4_DEBUG */
> -
>  #define mlx4_err(mdev, format, arg...) \
>  	dev_err(&mdev->pdev->dev, format, ## arg)
>  #define mlx4_info(mdev, format, arg...) \ 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org 
> http://lists.openfabrics.org/cgi-> bin/mailman/listinfo/general
> 
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 
> 


From sashak at voltaire.com  Thu Nov 13 08:41:18 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 13 Nov 2008 18:41:18 +0200
Subject: [ofa-general] Re: [PATCH] osmtest/osmt_multicast.c: some refinements
	to the multicast flow
In-Reply-To: <491C385D.9090909@dev.mellanox.co.il>
References: <491C385D.9090909@dev.mellanox.co.il>
Message-ID: <20081113164118.GY27271@sashak.voltaire.com>

On 16:23 Thu 13 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Here are some osmtest refinements (multicast flow) that
> I did while debugging the recent two multicast bugs in
> opensm: some comments fixes, creating a group that was
> removed because last full member left, and adding one
> query to check that invalid delete request really fails.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From chu11 at llnl.gov  Thu Nov 13 09:20:02 2008
From: chu11 at llnl.gov (Al Chu)
Date: Thu, 13 Nov 2008 09:20:02 -0800
Subject: [ofa-general] Re: [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <20081113002403.GI27271@sashak.voltaire.com>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
	<20081111191958.GA8894@sashak.voltaire.com>
	<1226447872.6239.2.camel@cardanus.llnl.gov>
	<20081113002403.GI27271@sashak.voltaire.com>
Message-ID: <1226596802.7156.41.camel@cardanus.llnl.gov>

Hey Sasha,

On Thu, 2008-11-13 at 02:24 +0200, Sasha Khapyorsky wrote:
> Hi Al,
> 
> On 15:57 Tue 11 Nov     , Al Chu wrote:
> > 
> > Sorry, I may have not explained it well. Lets say I do this in the
> > config file.
> > 
> > qos_vlarb_high FOOBAR
> > # qos_ca_vlarb_high BLAH
> > qos_swe_vlarb_high XYZZY
> > 
> > I currently expect qos_ca_vlarb_high to use the value of FOOBAR because
> > I commented out the field.  But it uses OSM_DEFAULT_QOS_HIGH_LIMIT
> > instead.  The reason is because qos_build_config() checks for NULL to
> > use default vs. non-default values.
> > 
> > p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> > 
> > Under the above situation where I've commented out veral fields, opt-
> > >vlarb_high is always non-NULL b/c it was set to
> > OSM_DEFAULT_QOS_HIGH_LIMIT. Thus OSM_DEFAULT_QOS_HIGH_LIMIT is used
> > instead of FOOBAR.
> > 
> > > > 2)
> > > > 
> > > > In qos_build_config() we load the high_limit like this:
> > > > 
> > > > cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > > > 
> > > > So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> > > > options to "go back to" the default high_limit.  It just assumes that
> > > > whatever is input (or was set by default) is what you should use.
> > > 
> > > Right. What is a limitation here? That an user cannot set this to
> > > "no value"? But she/he can just skip it.
> > 
> > Similar to the above issue, lets say I want to do:
> > 
> > qos_high_limit 8
> > # qos_ca_high_limit 15
> > # qos_swe_high_limit 15
> > 
> > I want qos_ca_high_limit and qos_swe_high_limit to use whatever I set in
> > qos_high_limit.  But the code doesn't allow for this.
> > 
> > > 
> > > > 3)
> > > > 
> > > > Some fields like qos_vlarb_high are assumed to be correctly set and can
> > > > segfault opensm.
> > > 
> > > qos_build_config() assumes that valid parameters are used. And we are
> > > using this this way (I hope :)) (finally it is not library API).
> > 
> > I think the issue is the osm_subnet.c code did not properly check all
> > inputs, and subsequently some inputs used in qos_build_config() were
> > bad.  I think
> > 
> > qos_vlarb_high (null)
> > 
> > was something I tried that opensm seg-faulted on.  
> 
> Ok. I see now.
> 
> Probably it will be simpler just to generate a valid qos parameter sets
> right after parser (in verification time)?

Ahh, I see what you did.  It's much cleaner this way.

> Like in your modified (and
> rebased against recent patches) patch below?

Patch looks good to me.

Thanks,
Al

> 
> Sasha
> 
> 
> >From a973a8a1ea6c805cf07965d86731ae04510266ce Mon Sep 17 00:00:00 2001
> From: Al Chu <chu11 at llnl.gov>
> Date: Mon, 10 Nov 2008 13:41:04 -0800
> Subject: [PATCH] fix qos config parsing bugs
> 
> Signed-off-by: Albert Chu <chu11 at llnl.gov>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/include/opensm/osm_subnet.h |   12 +-
>  opensm/opensm/osm_qos.c            |    6 +-
>  opensm/opensm/osm_subnet.c         |  298 ++++++++++++++++++++---------------
>  3 files changed, 181 insertions(+), 135 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index a16cbce..2bcd232 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -100,7 +100,7 @@ struct osm_qos_policy;
>  */
>  typedef struct osm_qos_options {
>  	unsigned max_vls;
> -	unsigned high_limit;
> +	int high_limit;
>  	char *vlarb_high;
>  	char *vlarb_low;
>  	char *sl2vl;
> @@ -109,20 +109,20 @@ typedef struct osm_qos_options {
>  * FIELDS
>  *
>  *	max_vls
> -*		The number of maximum VLs on the Subnet
> +*		The number of maximum VLs on the Subnet (0 == use default)
>  *
>  *	high_limit
>  *		The limit of High Priority component of VL Arbitration
> -*		table (IBA 7.6.9)
> +*		table (IBA 7.6.9) (-1 == use default)
>  *
>  *	vlarb_high
> -*		High priority VL Arbitration table template.
> +*		High priority VL Arbitration table template. (NULL == use default)
>  *
>  *	vlarb_low
> -*		Low priority VL Arbitration table template.
> +*		Low priority VL Arbitration table template. (NULL == use default)
>  *
>  *	sl2vl
> -*		SL2VL Mapping table (IBA 7.6.6) template.
> +*		SL2VL Mapping table (IBA 7.6.6) template. (NULL == use default)
>  *
>  *********/
>  
> diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
> index 1679ae0..b451c25 100644
> --- a/opensm/opensm/osm_qos.c
> +++ b/opensm/opensm/osm_qos.c
> @@ -382,7 +382,11 @@ static void qos_build_config(struct qos_config *cfg,
>  	memset(cfg, 0, sizeof(*cfg));
>  
>  	cfg->max_vls = opt->max_vls > 0 ? opt->max_vls : dflt->max_vls;
> -	cfg->vl_high_limit = (uint8_t) opt->high_limit;
> +
> +	if (opt->high_limit >= 0)
> +		cfg->vl_high_limit = (uint8_t) opt->high_limit;
> +	else
> +		cfg->vl_high_limit = (uint8_t) dflt->high_limit;
>  
>  	p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
>  	for (i = 0; i < 2 * IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; i++) {
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index 8569043..1c9777e 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -370,6 +370,15 @@ static void subn_set_default_qos_options(IN osm_qos_options_t * opt)
>  	opt->sl2vl = OSM_DEFAULT_QOS_SL2VL;
>  }
>  
> +static void subn_init_qos_options(IN osm_qos_options_t * opt)
> +{
> +	opt->max_vls = 0;
> +	opt->high_limit = -1;
> +	opt->vlarb_high = NULL;
> +	opt->vlarb_low = NULL;
> +	opt->sl2vl = NULL;
> +}
> +
>  /**********************************************************************
>   **********************************************************************/
>  void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
> @@ -457,11 +466,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
>  	p_opt->no_clients_rereg = FALSE;
>  	p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE;
>  	p_opt->consolidate_ipv6_snm_req = FALSE;
> -	subn_set_default_qos_options(&p_opt->qos_options);
> -	subn_set_default_qos_options(&p_opt->qos_ca_options);
> -	subn_set_default_qos_options(&p_opt->qos_sw0_options);
> -	subn_set_default_qos_options(&p_opt->qos_swe_options);
> -	subn_set_default_qos_options(&p_opt->qos_rtr_options);
> +	subn_init_qos_options(&p_opt->qos_options);
> +	subn_init_qos_options(&p_opt->qos_ca_options);
> +	subn_init_qos_options(&p_opt->qos_sw0_options);
> +	subn_init_qos_options(&p_opt->qos_swe_options);
> +	subn_init_qos_options(&p_opt->qos_rtr_options);
>  }
>  
>  /**********************************************************************
> @@ -526,6 +535,21 @@ opts_unpack_uint32(IN char *p_req_key,
>  /**********************************************************************
>   **********************************************************************/
>  static void
> +opts_unpack_int32(IN char *p_req_key,
> +		  IN char *p_key, IN char *p_val_str, IN int32_t * p_val)
> +{
> +	if (!strcmp(p_req_key, p_key)) {
> +		int32_t val = strtol(p_val_str, NULL, 0);
> +		if (val != *p_val) {
> +			log_config_value(p_key, "%d", val);
> +			*p_val = val;
> +		}
> +	}
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> +static void
>  opts_unpack_uint16(IN char *p_req_key,
>  		   IN char *p_key, IN char *p_val_str, IN uint16_t * p_val)
>  {
> @@ -651,7 +675,7 @@ subn_parse_qos_options(IN const char *prefix,
>  	snprintf(name, sizeof(name), "%s_max_vls", prefix);
>  	opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls);
>  	snprintf(name, sizeof(name), "%s_high_limit", prefix);
> -	opts_unpack_uint32(name, p_key, p_val_str, &opt->high_limit);
> +	opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit);
>  	snprintf(name, sizeof(name), "%s_vlarb_high", prefix);
>  	opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high);
>  	snprintf(name, sizeof(name), "%s_vlarb_low", prefix);
> @@ -786,138 +810,142 @@ osm_parse_prefix_routes_file(IN osm_subn_t * const p_subn)
>  
>  /**********************************************************************
>   **********************************************************************/
> -
> -static void subn_verify_max_vls(unsigned *max_vls, const char *prefix)
> +static void subn_verify_max_vls(unsigned *max_vls, const char *prefix, unsigned dflt)
>  {
> -	if (*max_vls > 15) {
> -		log_report(" Invalid Cached Option:%s_max_vls=%u:"
> -			   "Using Default:%u\n",
> -			   prefix, *max_vls, OSM_DEFAULT_QOS_MAX_VLS);
> -		*max_vls = OSM_DEFAULT_QOS_MAX_VLS;
> +	if (!(*max_vls) || *max_vls > 15) {
> +		log_report(" Invalid Cached Option: %s_max_vls=%u: "
> +			   "Using Default = %u\n", prefix, *max_vls, dflt);
> +		*max_vls = dflt;
>  	}
>  }
>  
> -static void subn_verify_high_limit(unsigned *high_limit, const char *prefix)
> +static void subn_verify_high_limit(int *high_limit, const char *prefix, int dflt)
>  {
> -	if (*high_limit > 255) {
> -		log_report(" Invalid Cached Option:%s_high_limit=%u:"
> -			   "Using Default:%u\n",
> -			   prefix, *high_limit, OSM_DEFAULT_QOS_HIGH_LIMIT);
> -		*high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT;
> +	if (*high_limit < 0 || *high_limit > 255) {
> +		log_report(" Invalid Cached Option: %s_high_limit=%d: "
> +			   "Using Default: %d\n", prefix, *high_limit, dflt);
> +		*high_limit = dflt;
>  	}
>  }
>  
> -static void subn_verify_vlarb(char *vlarb, const char *prefix,
> -			      const char *suffix)
> +static void subn_verify_vlarb(char **vlarb, const char *prefix,
> +			      const char *suffix, char *dflt)
>  {
> -	if (vlarb) {
> -		char *str, *tok, *end, *ptr;
> -		int count = 0;
> -
> -		str = strdup(vlarb);
> -
> -		tok = strtok_r(str, ",\n", &ptr);
> -		while (tok) {
> -			char *vl_str, *weight_str;
> -
> -			vl_str = tok;
> -			weight_str = strchr(tok, ':');
> -
> -			if (weight_str) {
> -				long vl, weight;
> -
> -				*weight_str = '\0';
> -				weight_str++;
> -
> -				vl = strtol(vl_str, &end, 0);
> -
> -				if (*end)
> -					log_report(" Warning: Cached Option "
> -						   "%s_vlarb_%s:vl=%s "
> -						   "improperly formatted\n",
> -						   prefix, suffix, vl_str);
> -				else if (vl < 0 || vl > 14)
> -					log_report(" Warning: Cached Option "
> -						   "%s_vlarb_%s:vl=%ld out "
> -						   "of range\n",
> -						   prefix, suffix, vl);
> -
> -				weight = strtol(weight_str, &end, 0);
> -
> -				if (*end)
> -					log_report(" Warning: Cached Option "
> -						   "%s_vlarb_%s:weight=%s "
> -						   "improperly formatted\n",
> -						   prefix, suffix, weight_str);
> -				else if (weight < 0 || weight > 255)
> -					log_report(" Warning: Cached Option "
> -						   "%s_vlarb_%s:weight=%ld "
> -						   "out of range\n",
> -						   prefix, suffix, weight);
> -			} else
> -				log_report(" Warning: Cached Option "
> -					   "%s_vlarb_%s:vl:weight=%s "
> -					   "improperly formatted\n",
> -					   prefix, suffix, tok);
> +	char *str, *tok, *end, *ptr;
> +	int count = 0;
> +
> +	if (*vlarb == NULL) {
> +		log_report(" Invalid Cached Option: %s_vlarb_%s: "
> +		"Using Default\n", prefix, suffix);
> +		*vlarb = dflt;
> +		return;
> +	}
>  
> -			count++;
> -			tok = strtok_r(NULL, ",\n", &ptr);
> -		}
> +	str = strdup(*vlarb);
> +
> +	tok = strtok_r(str, ",\n", &ptr);
> +	while (tok) {
> +		char *vl_str, *weight_str;
>  
> -		if (count > 64)
> -			log_report(" Warning: Cached Option %s_vlarb_%s: "
> -				   "> 64 listed: excess vl:weight pairs "
> -				   "will be dropped\n", prefix, suffix);
> +		vl_str = tok;
> +		weight_str = strchr(tok, ':');
>  
> -		free(str);
> +		if (weight_str) {
> +			long vl, weight;
> +
> +			*weight_str = '\0';
> +			weight_str++;
> +
> +			vl = strtol(vl_str, &end, 0);
> +
> +			if (*end)
> +				log_report(" Warning: Cached Option "
> +					   "%s_vlarb_%s:vl=%s"
> +					   " improperly formatted\n",
> +					   prefix, suffix, vl_str);
> +			else if (vl < 0 || vl > 14)
> +				log_report(" Warning: Cached Option "
> +					   "%s_vlarb_%s:vl=%ld out of range\n",
> +					   prefix, suffix, vl);
> +
> +			weight = strtol(weight_str, &end, 0);
> +
> +			if (*end)
> +				log_report(" Warning: Cached Option "
> +					   "%s_vlarb_%s:weight=%s "
> +					   "improperly formatted\n",
> +					   prefix, suffix, weight_str);
> +			else if (weight < 0 || weight > 255)
> +				log_report(" Warning: Cached Option "
> +					   "%s_vlarb_%s:weight=%ld "
> +					   "out of range\n",
> +					   prefix, suffix, weight);
> +		} else
> +			log_report(" Warning: Cached Option "
> +				   "%s_vlarb_%s:vl:weight=%s "
> +				   "improperly formatted\n",
> +				   prefix, suffix, tok);
> +
> +		count++;
> +		tok = strtok_r(NULL, ",\n", &ptr);
>  	}
> +
> +	if (count > 64)
> +		log_report(" Warning: Cached Option %s_vlarb_%s: > 64 listed:"
> +			   " excess vl:weight pairs will be dropped\n",
> +			   prefix, suffix);
> +
> +	free(str);
>  }
>  
> -static void subn_verify_sl2vl(char *sl2vl, const char *prefix)
> +static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt)
>  {
> -	if (sl2vl) {
> -		char *str, *tok, *end, *ptr;
> -		int count = 0;
> +	char *str, *tok, *end, *ptr;
> +	int count = 0;
> +
> +	if (*sl2vl == NULL) {
> +		log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n",
> +			   prefix);
> +		*sl2vl = dflt;
> +		return;
> +	}
>  
> -		str = strdup(sl2vl);
> +	str = strdup(*sl2vl);
>  
> -		tok = strtok_r(str, ",\n", &ptr);
> -		while (tok) {
> -			long vl = strtol(tok, &end, 0);
> +	tok = strtok_r(str, ",\n", &ptr);
> +	while (tok) {
> +		long vl = strtol(tok, &end, 0);
>  
> -			if (*end)
> -				log_report(" Warning: Cached Option %s_sl2vl:"
> -					   "vl=%s improperly formatted\n",
> -					   prefix, tok);
> -			else if (vl < 0 || vl > 15)
> -				log_report(" Warning: Cached Option %s_sl2vl:"
> -					   "vl=%ld out of range\n",
> -					   prefix, vl);
> -
> -			count++;
> -			tok = strtok_r(NULL, ",\n", &ptr);
> -		}
> +		if (*end)
> +			log_report(" Warning: Cached Option %s_sl2vl:vl=%s "
> +				   "improperly formatted\n", prefix, tok);
> +		else if (vl < 0 || vl > 15)
> +			log_report(" Warning: Cached Option %s_sl2vl:vl=%ld "
> +				   "out of range\n", prefix, vl);
>  
> -		if (count < 16)
> -			log_report(" Warning: Cached Option %s_sl2vl: < 16 VLs "
> -				   "listed\n", prefix);
> +		count++;
> +		tok = strtok_r(NULL, ",\n", &ptr);
> +	}
>  
> -		if (count > 16)
> -			log_report(" Warning: Cached Option %s_sl2vl: "
> -				   "> 16 listed: excess VLs will be dropped\n",
> -				   prefix);
> +	if (count < 16)
> +		log_report(" Warning: Cached Option %s_sl2vl: < 16 VLs "
> +			   "listed\n", prefix);
>  
> -		free(str);
> -	}
> +	if (count > 16)
> +		log_report(" Warning: Cached Option %s_sl2vl: > 16 listed: "
> +			   "excess VLs will be dropped\n", prefix);
> +
> +	free(str);
>  }
>  
> -static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix)
> +static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix,
> +				osm_qos_options_t *dflt)
>  {
> -	subn_verify_max_vls(&set->max_vls, prefix);
> -	subn_verify_high_limit(&set->high_limit, prefix);
> -	subn_verify_vlarb(set->vlarb_low, prefix, "low");
> -	subn_verify_vlarb(set->vlarb_high, prefix, "high");
> -	subn_verify_sl2vl(set->sl2vl, prefix);
> +	subn_verify_max_vls(&set->max_vls, prefix, dflt->max_vls);
> +	subn_verify_high_limit(&set->high_limit, prefix, dflt->high_limit);
> +	subn_verify_vlarb(&set->vlarb_low, prefix, "low", dflt->vlarb_low);
> +	subn_verify_vlarb(&set->vlarb_high, prefix, "high", dflt->vlarb_high);
> +	subn_verify_sl2vl(&set->sl2vl, prefix, dflt->sl2vl);
>  }
>  
>  static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
> @@ -957,11 +985,24 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
>  	}
>  
>  	if (p_opts->qos) {
> -		subn_verify_qos_set(&p_opts->qos_options, "qos");
> -		subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca");
> -		subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0");
> -		subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe");
> -		subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr");
> +		osm_qos_options_t dflt;
> +
> +		/* the default options in qos_options must be correct.
> +		 * every other one need not be, b/c those will default
> +		 * back to whatever is in qos_options.
> +		 */
> +
> +		subn_set_default_qos_options(&dflt);
> +
> +		subn_verify_qos_set(&p_opts->qos_options, "qos", &dflt);
> +		subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca",
> +				    &p_opts->qos_options);
> +		subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0",
> +				    &p_opts->qos_options);
> +		subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe",
> +				    &p_opts->qos_options);
> +		subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr",
> +				    &p_opts->qos_options);
>  	}
>  
>  #ifdef ENABLE_OSM_PERF_MGR
> @@ -1267,30 +1308,31 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
>  		return -1;
>  	}
>  
> +	subn_init_qos_options(&p_subn->opt.qos_options);
> +	subn_init_qos_options(&p_subn->opt.qos_ca_options);
> +	subn_init_qos_options(&p_subn->opt.qos_sw0_options);
> +	subn_init_qos_options(&p_subn->opt.qos_swe_options);
> +	subn_init_qos_options(&p_subn->opt.qos_rtr_options);
> +
>  	while (fgets(line, 1023, opts_file) != NULL) {
>  		/* get the first token */
>  		p_key = strtok_r(line, " \t\n", &p_last);
>  		if (p_key) {
>  			p_val = strtok_r(NULL, " \t\n", &p_last);
>  
> -			subn_parse_qos_options("qos",
> -					       p_key, p_val,
> +			subn_parse_qos_options("qos", p_key, p_val,
>  					       &p_subn->opt.qos_options);
>  
> -			subn_parse_qos_options("qos_ca",
> -					       p_key, p_val,
> +			subn_parse_qos_options("qos_ca", p_key, p_val,
>  					       &p_subn->opt.qos_ca_options);
>  
> -			subn_parse_qos_options("qos_sw0",
> -					       p_key, p_val,
> +			subn_parse_qos_options("qos_sw0", p_key, p_val,
>  					       &p_subn->opt.qos_sw0_options);
>  
> -			subn_parse_qos_options("qos_swe",
> -					       p_key, p_val,
> +			subn_parse_qos_options("qos_swe", p_key, p_val,
>  					       &p_subn->opt.qos_swe_options);
>  
> -			subn_parse_qos_options("qos_rtr",
> -					       p_key, p_val,
> +			subn_parse_qos_options("qos_rtr", p_key, p_val,
>  					       &p_subn->opt.qos_rtr_options);
>  
>  		}
-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


From chu11 at llnl.gov  Thu Nov 13 09:47:23 2008
From: chu11 at llnl.gov (Al Chu)
Date: Thu, 13 Nov 2008 09:47:23 -0800
Subject: [ofa-general] [ipoib][patch] handle pkey input to create_child and
	delete_child consistently
Message-ID: <1226598443.7156.52.camel@cardanus.llnl.gov>

I noticed that the pkey is handled differently between ipoib's
create_child and delete_child functions.  So a user can create a
interface with a pkey, but not delete it with the same pkey.  Sort of
makes it confusing for the average person.

# sys/class/net/ib0 > echo 0x6fff > create_child
# /sys/class/net/ib0 > echo 0x6fff > delete_child
-bash: echo: write error: No such file or directory
# /sys/class/net/ib0 > echo 0xefff > delete_child
# /sys/class/net/ib0 >

The attached patch simply bitwise-ORs the full membership bit into the
delete_child function for consistency.  A check for a valid full-
membership bit on the create_child function would be fine as well, but
IMO this is the lesser confusing option (and is backwards compatible to
any scripts people have already written).

Al

-- 
Albert Chu
chu11 at llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-handle-pkey-in-create_child-and-delete_child-consist.patch
Type: text/x-patch
Size: 922 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081113/20571e13/attachment.bin>

From hal.rosenstock at gmail.com  Thu Nov 13 09:56:46 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Thu, 13 Nov 2008 12:56:46 -0500
Subject: ***SPAM*** [ofa-general] [PATCH] opensm/osm_mcast_tbl.c wrong max
	mcast lid cause the sm to set invalid MFT block.
In-Reply-To: <491C4CFA.8000006@gmail.com>
References: <491C4CFA.8000006@gmail.com>
Message-ID: <f0e08f230811130956w6388bf6ex89fc5dd9b5ac6d77@mail.gmail.com>

Hi Eli,

On Thu, Nov 13, 2008 at 10:51 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
> wrong max mcast lid cause the sm to set invalid MFT block.
> when mcmember tries to set mcast lid beyond mcast capability (e.g. 0xc400),
> the sm accepts this and tries to set invalid block.

Good find (and nice test case).

Do the switch SMA's reject those invalid sets ? I'm hoping that's the case.

See below for minor question on the patch.

> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>
> ---
>  opensm/opensm/osm_mcast_tbl.c          |    6 +++---
>  opensm/opensm/osm_sa_mcmember_record.c |    2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
> index 92fbb63..17fb69c 100644
> --- a/opensm/opensm/osm_mcast_tbl.c
> +++ b/opensm/opensm/osm_mcast_tbl.c
> @@ -81,7 +81,7 @@ osm_mcast_tbl_init(IN osm_mcast_tbl_t * const p_tbl,
>                                                IB_MCAST_BLOCK_SIZE) /
>                                        IB_MCAST_BLOCK_SIZE) - 1);
>
> -       p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity);
> +       p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity - 1);
>
>        /*
>           The number of bytes needed in the mask table is:
> @@ -216,7 +216,7 @@ osm_mcast_tbl_set_block(IN osm_mcast_tbl_t * const p_tbl,
>
>        mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
>
> -       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
> +       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
>                return (IB_INVALID_PARAMETER);
>
>        for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
> @@ -274,7 +274,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
>
>        mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
>
> -       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
> +       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
>                return (IB_INVALID_PARAMETER);
>
>        for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
> diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
> index 5dd286a..6007b06 100644
> --- a/opensm/opensm/osm_sa_mcmember_record.c
> +++ b/opensm/opensm/osm_sa_mcmember_record.c
> @@ -846,7 +846,7 @@ osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
>        mlid = __get_new_mlid(sa, mcm_rec.mlid);
>        if (mlid == 0) {
>                OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: "
> -                       "__get_new_mlid failed\n");
> +                       "__get_new_mlid failed request mlid 0x%04x\n", mcm_rec.mlid);

                            ^^^^^^^^^^^^^^^^
Should this be cl_ntoh16(mcm_rec.mlid) ?

-- Hal

>                status = IB_SA_MAD_STATUS_NO_RESOURCES;
>                goto Exit;
>        }
> --
> 1.5.5
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From rdreier at cisco.com  Thu Nov 13 07:22:44 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 13 Nov 2008 07:22:44 -0800
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
References: <4907348E.7060508@mellanox.co.il> <490A8FA9.7080802@pobox.com>
	<aday7047jos.fsf@cisco.com> <490DA91A.1030703@pobox.com>
	<adaprlew1wd.fsf@cisco.com> <490DD27C.4070109@pobox.com>
	<491C41F0.3080304@mellanox.co.il>
Message-ID: <adaskpvl3pz.fsf@cisco.com>

 > What is the status of this?
 > I know its in mlx_core but mainly needed for mlnx_en and has minimal
 > impact on the IB side
 > I think Roland is at new baby vacation so what is the resolution?

This is 2.6.29 material, and I should be able to get to it in the next
few weeks.

 - R.


From kliteyn at dev.mellanox.co.il  Thu Nov 13 14:22:35 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Fri, 14 Nov 2008 00:22:35 +0200
Subject: [ofa-general] [PATCH] opensm/osm_lid_mgr.c: ignore and overwrite
	guid2lid (windows)
Message-ID: <491CA8AB.1010801@dev.mellanox.co.il>

Hi Sasha,

When Windows is crashing with BSOD, it might corrupt files that were
previously opened for writing, even if the files are closed. As a result,
we might see corrupted guid2lid file, and OpenSM will exit on such error.
This patch makes SM ignore (and later overwrite) corrupted guid2lid files.

The patch has already been accepted into ofw.

I'm posting it to openib too, so that when some day WinSM will be
synchronized with OpenSM, this fix won't be lost.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_lid_mgr.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index 0c536a8..c135d4a 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -261,6 +261,12 @@ osm_lid_mgr_init(IN osm_lid_mgr_t * const p_mgr, IN osm_sm_t *sm)
 	/* we use the stored guid to lid table if not forced to reassign */
 	if (!p_mgr->p_subn->opt.reassign_lids) {
 		if (osm_db_restore(p_mgr->p_g2l)) {
+#ifndef __WIN__
+			/*
+			 * When Windows is BSODing, it might corrupt files that
+			 * were previously opened for writing, even if the files
+			 * are closed, so we might see corrupted guid2lid file.
+			 */
 			if (p_mgr->p_subn->opt.exit_on_fatal) {
 				osm_log(p_mgr->p_log, OSM_LOG_SYS,
 					"FATAL: Error restoring Guid-to-Lid "
@@ -268,6 +274,7 @@ osm_lid_mgr_init(IN osm_lid_mgr_t * const p_mgr, IN osm_sm_t *sm)
 				status = IB_ERROR;
 				goto Exit;
 			} else
+#endif
 				OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR,
 					"ERR 0317: Error restoring Guid-to-Lid "
 					"persistent database\n");
-- 
1.5.1.4


From vlad at lists.openfabrics.org  Fri Nov 14 03:24:46 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 14 Nov 2008 03:24:46 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081114-0200 daily build status
Message-ID: <20081114112447.24D79E60DCB@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From michael.heinz at qlogic.com  Fri Nov 14 08:27:33 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Fri, 14 Nov 2008 10:27:33 -0600
Subject: [ofa-general] ib_ucm does not start correctly on redhat 4 boxes.
Message-ID: <C07C40DB2364324799506DE8FF12F8D886B9A1@EPEXCH1.qlogic.org>

On my Suse machines, the ib_ucm module loads normally and creates its
/dev/infiniband/ucm0 file correctly - but on the redhat boxes, the
device file is never created, even though the module loads.
 
Does anyone know of a fix? I manually created the file with mknod and
that worked; so obviously the module loaded correctly, it's just the
device file that's not getting initialized.
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081114/9f3b2697/attachment.html>

From jsquyres at cisco.com  Fri Nov 14 08:29:29 2008
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 14 Nov 2008 11:29:29 -0500
Subject: [ofa-general] ib_ucm does not start correctly on redhat 4 boxes.
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D886B9A1@EPEXCH1.qlogic.org>
References: <C07C40DB2364324799506DE8FF12F8D886B9A1@EPEXCH1.qlogic.org>
Message-ID: <897AF2B5-724C-46C8-AB5E-F8559D5B4162@cisco.com>

I filed a ticket about this long ago.  Still hasn't been fixed:

     https://bugs.openfabrics.org/show_bug.cgi?id=963


On Nov 14, 2008, at 11:27 AM, Mike Heinz wrote:

> On my Suse machines, the ib_ucm module loads normally and creates  
> its /dev/infiniband/ucm0 file correctly - but on the redhat boxes,  
> the device file is never created, even though the module loads.
>
> Does anyone know of a fix? I manually created the file with mknod  
> and that worked; so obviously the module loaded correctly, it's just  
> the device file that's not getting initialized.
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


-- 
Jeff Squyres
Cisco Systems


From michael.heinz at qlogic.com  Fri Nov 14 08:35:57 2008
From: michael.heinz at qlogic.com (Mike Heinz)
Date: Fri, 14 Nov 2008 10:35:57 -0600
Subject: [ofa-general] ib_ucm does not start correctly on redhat 4 boxes.
In-Reply-To: <897AF2B5-724C-46C8-AB5E-F8559D5B4162@cisco.com>
References: <C07C40DB2364324799506DE8FF12F8D886B9A1@EPEXCH1.qlogic.org>
	<897AF2B5-724C-46C8-AB5E-F8559D5B4162@cisco.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D886B9A2@EPEXCH1.qlogic.org>

It's odd because a quick look at the code doesn't show anything
tremendously weird. I wonder if it's a bug in RHEL.... 


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-----Original Message-----
From: Jeff Squyres [mailto:jsquyres at cisco.com] 
Sent: Friday, November 14, 2008 11:29 AM
To: Mike Heinz
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] ib_ucm does not start correctly on redhat 4
boxes.

I filed a ticket about this long ago.  Still hasn't been fixed:

     https://bugs.openfabrics.org/show_bug.cgi?id=963


On Nov 14, 2008, at 11:27 AM, Mike Heinz wrote:

> On my Suse machines, the ib_ucm module loads normally and creates its 
> /dev/infiniband/ucm0 file correctly - but on the redhat boxes, the 
> device file is never created, even though the module loads.
>
> Does anyone know of a fix? I manually created the file with mknod and 
> that worked; so obviously the module loaded correctly, it's just the 
> device file that's not getting initialized.
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general


--
Jeff Squyres
Cisco Systems


From tziporet at mellanox.co.il  Fri Nov 14 09:08:24 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Fri, 14 Nov 2008 19:08:24 +0200
Subject: [ofa-general] OFED 1.4  bugs status and OFED meetings
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0FE73D@mtlexch01.mtl.com>

Hi,

This is the bugs status
Bug owners - please update bugs status (I think I saw some commits so
maybe some of them are already fixed) and see if they are really
critical for the release

1323    	blo  	stefan.roscher at de.ibm.com  	IB/ehca:
possibility of kernel panic under certain circu...
1242 	cri 	yannick.cote at qlogic.com 	kernel panic while
running mpi2007 against ofed1.4 -- ib_...
1289 	maj 	amirv at mellanox.co.il 		Ib and ipoib doesnt
respond while running multiple tests ...
1349 	maj 	amirv at mellanox.co.il 		Kernel panic on sdp
1379 	maj 	vu at mellanox.com 		Cannot unload ib_srpt
module on SRP target
1377 	maj 	vu at mellanox.com 		Deadlock occurred during
HA test
1380 	maj 	vu at mellanox.com 		Cannot unload ib_srpt
module on SRP target
1279 	min 	amirv at mellanox.co.il 		ltp_sdp connect "already
connected successful" very slow
1331 	min 	amirv at mellanox.co.il 		SDP connect to 0.0.0.0
doesn't work 

I don't think we need a meeting on Monday (I personally will not be able
to attend)
If we only have bugs in SDP and SRP we should go ahead and build RC5 on
Monday

Reminder to all - please send release notes

Tziporet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081114/daccfee5/attachment.html>

From jsquyres at cisco.com  Fri Nov 14 09:15:35 2008
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 14 Nov 2008 12:15:35 -0500
Subject: [ofa-general] Re: [ewg] OFED 1.4  bugs status and OFED meetings
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0FE73D@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD0FE73D@mtlexch01.mtl.com>
Message-ID: <3EFE4684-EAA2-4A49-B0F5-927962D52A12@cisco.com>

On Nov 14, 2008, at 12:08 PM, Tziporet Koren wrote:

> I don't think we need a meeting on Monday (I personally will not be  
> able to attend)
>


Ok.  Unless, I hear differently by COB today (US Eastern time), I'll  
cancel the phone bridge for Monday.

-- 
Jeff Squyres
Cisco Systems


From akepner at sgi.com  Fri Nov 14 11:43:17 2008
From: akepner at sgi.com (akepner at sgi.com)
Date: Fri, 14 Nov 2008 11:43:17 -0800
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
Message-ID: <20081114194317.GM25248@sgi.com>


FWIW, I asked for the additional data that Hal requested. 

But this time there are no occurrences of "Disconnected 
switch|HCA" errors from 'ibdiagnet -r'. 

The entire cluster was recently rebooted (probably the IB 
switches, too), opensm restarted, etc. So that seems to have 
cleared things up, at least for now.

But this is something that we've seen on quite a few occasions, 
so we'll keep looking for it, and grab what debug info we can 
when it crops up again.

-- 
Arthur


From hal.rosenstock at gmail.com  Fri Nov 14 13:35:15 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 14 Nov 2008 16:35:15 -0500
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <20081113131703.GV27271@sashak.voltaire.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
	<20081113131703.GV27271@sashak.voltaire.com>
Message-ID: <f0e08f230811141335n796496ban7a43bbc7ae8e5ee1@mail.gmail.com>

On Thu, Nov 13, 2008 at 8:17 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Or,
>
> On 09:20 Thu 13 Nov     , Or Gerlitz wrote:
>>
>> If opensm doesn't have a match on any qos-assignment rule (eg when there's
>> no qos-config file), when coming to serve sa path query,  my understanding
>> is that the "qos related fields" of the partition would be used.
>>
>> For example, I have set the following partition config file which assigns
>> <sl=1,rate=2> to the 0x8001 partition, and run without any qos file.
>>
>> Default=0x7fff,ipoib : ALL=full;
>> RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
>> RED=0x8002, ipoib, sl=2, rate=3, defmember=full : ALL=full;
>>
>> When a path query is issued, Indeed sl=1 is returned but I see that a
>> rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).
>
> For my best knowledge rate=2 in partition config file will be related to
> corresponded IPoIB multicast group for this partition, and not to
> PathRecord.

There is a form of PR query that supports returning information on
MGIDs when used as a DGID.

> In PathRecord you get maximum available rate on the
> requested path.

Here you are talking about current OpenSM implementation.

-- Hal

>> Have I done anything wrong? is it a known issue? what does it means
>> when the SM prints "min rate = 6"
>
> Here "min rate" means minimal common rate on the path.

>
> Sasha
>
>>
>> Or.
>>
>>
>> Nov 13 02:12:49 219374 [42803940] 0x08 -> PathRecord dump:
>>                               service_id..............0x0000000000000000
>>                               dgid....................0xfe80000000000000 : 0x0002c90300026be7
>>                               sgid....................0xfe80000000000000 : 0x0002c90300026be3
>>                               dlid....................0x0
>>                               slid....................0x0
>>                               hop_flow_raw............0x0
>>                               tclass..................0x0
>>                               num_path_revers.........0x1
>>                               pkey....................0x8001
>>                               qos_class...............0x0
>>                               sl......................0x0
>>                               mtu.....................0x3
>>                               rate....................0x0
>>                               pkt_life................0x0
>>                               preference..............0x0
>>                               resv2...................0x0
>>                               resv3...................0x0
>> Nov 13 02:12:49 219386 [42803940] 0x10 -> __osm_pr_rcv_check_mcast_dest: [
>> Nov 13 02:12:49 219390 [42803940] 0x10 -> __osm_pr_rcv_check_mcast_dest: ]
>> Nov 13 02:12:49 219394 [42803940] 0x08 -> osm_pr_rcv_process: Unicast destination requested
>> Nov 13 02:12:49 219398 [42803940] 0x10 -> __osm_pr_rcv_get_end_points: [
>> Nov 13 02:12:49 219403 [42803940] 0x10 -> __osm_pr_rcv_get_end_points: ]
>> Nov 13 02:12:49 219407 [42803940] 0x10 -> __osm_pr_rcv_process_pair: [
>> Nov 13 02:12:49 219411 [42803940] 0x10 -> __osm_pr_rcv_get_port_pair_paths: [
>> Nov 13 02:12:49 219415 [42803940] 0x08 -> __osm_pr_rcv_get_port_pair_paths: Src port 0x0002c90300026be3, Dst port 0x0002c90300026be7
>> Nov 13 02:12:49 219420 [42803940] 0x10 -> osm_port_share_pkey: [
>> Nov 13 02:12:49 219424 [42803940] 0x10 -> osm_port_share_pkey: ]
>> Nov 13 02:12:49 219428 [42803940] 0x10 -> osm_port_share_pkey: [
>> Nov 13 02:12:49 219432 [42803940] 0x10 -> osm_port_share_pkey: ]
>> Nov 13 02:12:49 219436 [42803940] 0x10 -> osm_port_share_pkey: [
>> Nov 13 02:12:49 219440 [42803940] 0x10 -> osm_port_share_pkey: ]
>> Nov 13 02:12:49 219444 [42803940] 0x08 -> __osm_pr_rcv_get_port_pair_paths: Src LIDs [0x7-0x7], Dest LIDs [0x8-0x8]
>> Nov 13 02:12:49 219449 [42803940] 0x10 -> __osm_pr_rcv_get_lid_pair_path: [
>> Nov 13 02:12:49 219453 [42803940] 0x08 -> __osm_pr_rcv_get_lid_pair_path: Src LID 0x7, Dest LID 0x8
>> Nov 13 02:12:49 219458 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: [
>> Nov 13 02:12:49 219464 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path min MTU = 4, min rate = 6
>> Nov 13 02:12:49 219471 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = 0x8001, sl = 1
>> Nov 13 02:12:49 219476 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: ]
>> Nov 13 02:12:49 219480 [42803940] 0x10 -> __osm_pr_rcv_get_path_parms: [
>> Nov 13 02:12:49 219484 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path min MTU = 4, min rate = 6
>> Nov 13 02:12:49 219489 [42803940] 0x08 -> __osm_pr_rcv_get_path_parms: Path params: mtu = 4, rate = 6, packet lifetime = 18, pkey = 0x8001, sl = 1
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Fri Nov 14 13:39:45 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 14 Nov 2008 16:39:45 -0500
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <491C2EFE.4060900@voltaire.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
	<20081113131703.GV27271@sashak.voltaire.com>
	<491C2EFE.4060900@voltaire.com>
Message-ID: <f0e08f230811141339s36d405f7we59b5d6f2a93bd43@mail.gmail.com>

On Thu, Nov 13, 2008 at 8:43 AM, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> Sasha Khapyorsky wrote:
>>>
>>> RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
>>>
>>> When a path query is issued, Indeed sl=1 is returned but I see that a
>>> rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).
>>
>> For my best knowledge rate=2 in partition config file will be related to
>> corresponded IPoIB multicast group for this partition, and not to
>> PathRecord. In PathRecord you get maximum available rate on the requested
>> path.
>
> I understand your comment about the relation to multicast join and not path
> queries. However,  currently, where  there's no rule in the qos-config file
> (or no file) that matches the path query, the SM does provide the SL
> assigned to the partition (specified in the query) through the pkey file but
> it doesn't do so for the Rate. So you say that for QoS = <SL, Rate>
> assignment one should use the qos-policy file, let it be.

I think Sasha is not saying "should use qos-policy file". You're
asking about the pre QoS annex Qos implementation in OpenSM and I
think this could be viewed as an omission (bug/feature). I think it
could easily be changed in SA PR/MPR support.

-- Hal

> Or.
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>


From weiny2 at llnl.gov  Fri Nov 14 14:28:48 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 14 Nov 2008 14:28:48 -0800
Subject: [ofa-general] Re: [PATCH V2] Add check for previous versions of
	plugins.
In-Reply-To: <20081109174733.GA30265@sashak.voltaire.com>
References: <20081104095812.2ff5920c.weiny2@llnl.gov>
	<20081109174733.GA30265@sashak.voltaire.com>
Message-ID: <20081114142848.75c64c94.weiny2@llnl.gov>

I believe this will work.  I incorporated your patch but I made this explicit
so it will hopefully be clear what is going on.

Ira


>From 061822466a157bb425600ee0b63cc80ff038d615 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Mon, 3 Nov 2008 15:50:15 -0800
Subject: [PATCH] Add check for previous versions of plugins.

   If old interface plugins are available to OpenSM they will cause a crash.
   Check for this old version and error out gracefully.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 opensm/include/opensm/osm_event_plugin.h |    1 +
 opensm/opensm/osm_event_plugin.c         |   11 +++++++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h
index b2deeba..0922c65 100644
--- a/opensm/include/opensm/osm_event_plugin.h
+++ b/opensm/include/opensm/osm_event_plugin.h
@@ -148,6 +148,7 @@ typedef struct osm_epi_trap_event {
  * The version should be set to OSM_EVENT_PLUGIN_INTERFACE_VER
  */
 #define OSM_EVENT_PLUGIN_IMPL_NAME "osm_event_plugin"
+#define OSM_ORIG_EVENT_PLUGIN_INTERFACE_VER 1
 #define OSM_EVENT_PLUGIN_INTERFACE_VER 2
 typedef struct osm_event_plugin {
 	const char *osm_version;
diff --git a/opensm/opensm/osm_event_plugin.c b/opensm/opensm/osm_event_plugin.c
index c6999f5..b0dc549 100644
--- a/opensm/opensm/osm_event_plugin.c
+++ b/opensm/opensm/osm_event_plugin.c
@@ -66,6 +66,7 @@
 osm_epi_plugin_t *osm_epi_construct(osm_opensm_t *osm, char *plugin_name)
 {
 	char lib_name[OSM_PATH_MAX];
+	struct old_if { unsigned ver; } *old_impl;
 	osm_epi_plugin_t *rc = NULL;
 
 	if (!plugin_name || !*plugin_name)
@@ -96,6 +97,16 @@ osm_epi_plugin_t *osm_epi_construct(osm_opensm_t *osm, char *plugin_name)
 		goto Exit;
 	}
 
+	/* check for old interface */
+	old_impl = (struct old_if *) rc->impl;
+	if (old_impl->ver == OSM_ORIG_EVENT_PLUGIN_INTERFACE_VER) {
+		OSM_LOG(&osm->log, OSM_LOG_ERROR, "Error loading plugin: "
+			"\'%s\' contains a depricated interface version %d\n"
+			"   Please recompile with the new interface.\n",
+			plugin_name, old_impl->ver);
+		goto Exit;
+	}
+
 	/* Check the version to make sure this module will work with us */
 	if (strcmp(rc->impl->osm_version, osm->osm_version)) {
 		OSM_LOG(&osm->log, OSM_LOG_ERROR, "Error loading plugin"
-- 
1.5.4.5


From weiny2 at llnl.gov  Fri Nov 14 14:54:06 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Fri, 14 Nov 2008 14:54:06 -0800
Subject: [ofa-general] [PATCH] Fix max parameter passed to umad_get_cas_names
Message-ID: <20081114145406.57dff1a7.weiny2@llnl.gov>

>From a9149f4e38081d206d0be0af2194f4e09f944f21 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 14 Nov 2008 11:36:01 -0800
Subject: [PATCH] Fix max parameter passed to umad_get_cas_names


Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/src/ibstat.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c
index 6be1302..e2775ca 100644
--- a/infiniband-diags/src/ibstat.c
+++ b/infiniband-diags/src/ibstat.c
@@ -65,6 +65,8 @@
 
 static int debug;
 
+#define MAX_DEVICES 20
+
 char *argv0 = "ibstat";
 
 static char *node_type_str[] = {
@@ -201,7 +203,7 @@ usage(void)
 int
 main(int argc, char *argv[])
 {
-	char names[20][UMAD_CA_NAME_LEN];
+	char names[MAX_DEVICES][UMAD_CA_NAME_LEN];
 	int dev_port = -1;
 	int list_only = 0, short_format = 0, list_ports = 0;
 	int n, i;
@@ -254,7 +256,7 @@ main(int argc, char *argv[])
 	if (umad_init() < 0)
 		IBPANIC("can't init UMAD library");
 
-	if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0)
+	if ((n = umad_get_cas_names((void *)names, MAX_DEVICES)) < 0)
 		IBPANIC("can't list IB device names");
 
 	if (argc) {
-- 
1.5.4.5


From arlin.r.davis at intel.com  Fri Nov 14 17:02:51 2008
From: arlin.r.davis at intel.com (Arlin Davis)
Date: Fri, 14 Nov 2008 17:02:51 -0800
Subject: [ofa-general] [PATCH]  uDAPL release notes updated for OFED 1.4
Message-ID: <000a01c946bd$e295c840$4797070a@amr.corp.intel.com>


uDAPL_release_notes.txt updated for OFED 1.4

Signed-off-by: Arlin Davis <ardavis at ichips.intel.com>

Tziporet, please pull into OFED 1.4. Thanks!

diff --git a/uDAPL_release_notes.txt b/uDAPL_release_notes.txt
index 23b3d8b..33bbf0e 100644
--- a/uDAPL_release_notes.txt
+++ b/uDAPL_release_notes.txt
@@ -1,15 +1,70 @@
 		   Release Notes for 
-		OFED 1.3.1 DAPL Release
-		    June 2008
+		OFED 1.4 DAPL Release
+		    November 2008
 
 
-        OFED 1.3.1 RELEASE NOTES
+      OFED 1.4 RELEASE NOTES
 
 	This release of the DAPL reference implementation 
-        is timed to coincide with OFED release 1.3.1 of the 
-        Open Fabrics (www.openfabrics.org) software stack.
+	is timed to coincide with OFED release 1.3.1 of the 
+	Open Fabrics (www.openfabrics.org) software stack.
+
+	NEW SINCE OFED 1.3.1
+	
+        OFED 1.4 includes new versions compat-dapl-1.2.12-1, dapl-2.0.15-1
+
+	Summary of changes since OFED 1.3.1 release:
+
+	* New Features (scalability improvements - socket cm and UD support)
+
+	1. The new socket CM provider, introduced in 1.2.8 and 2.0.11 packages, 
+	   assumes homogeneous cluster and will setup the QP's based on local 
+	   HCA port attributes and exchanges QP information via socket's using 
+	   the hostname of each node. IPoIB and rdma_cm are NOT required for 
+	   this provider. QP attributes can be adjusted via the following 
+	   environment parameters: 
+
+		DAPL_ACK_TIMER (default=16 5 bits, 4.096us*2^ack_timer. 16 =268ms) 
+		DAPL_ACK_RETRY (default=7 3 bits, 7 * 268ms = 1.8 seconds) 
+		DAPL_RNR_TIMER (default=12 5 bits, 12 = 64ms, 28 = 163ms, 31 = 491ms) 
+		DAPL_RNR_RETRY (default=7 3 bits, 7 = infinite) 
+		DAPL_IB_MTU (default=1024, limited to active MTU max) 
+
+	 The new socket cm entries in /etc/dat.conf provide a link to the actual 
+         HCA device and port. Example v1 and v2 entries for a Mellanox connectx 
+         device, port 1: 
+	 - OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" "" 
+	 - ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" "" 
+	
+	2. New v2 definitions for IB unreliable datagram extension 
+	   (only supported in v2 scm provider, libdaploscm.so.2) 
+		- Extended EP dat_service_type, with DAT_IB_SERVICE_TYPE_UD 
+		- Add IB extension call dat_ib_post_send_ud(). 
+		- Add address handle definition for UD calls. 
+		- Add IB event definitions to provide remote AH via connect 
+		  and connect requests 
+		- See dtestx (-d) source for example usage model
+	
+	* Bug Fixes
+
+	v1,v2 - allow override of /etc/dat.conf via syscondir option 
+	v1,v2 - fix dapltest transaction test to avoid cleanup before rdma complete 
+	v1    - add ipath, ehca socket cm provider entries for v1.2, sync with v2.0 
+	v1,v2 - iWarp, 1 iov on rdma_reads, reduce iov's in dtest, add dat.conf entry 
+	v1,v2 - add $(DESTDIR) on install/uninstall hooks 
+	v2    - add new options to dtestx for UD testing 
+	v2    - IB UD fixes in common code/socket cm provider to allow multiple EP support 
+	v1,v2 - iWarp, 1 iov on rdma_reads, reduce iov's in dtest, add dat.conf entry 
+	v1,v2 - add $(DESTDIR) on install/uninstall hooks
+	v2    - add new options to dtestx for UD testing 
+ 	v2    - IB UD fixes in common code/socket cm provider to allow multiple	EP support 
+	v2	- fix dtest and dtestx build warnings
+	v1,v2 - socket cm fixes, added DAPL_IB_MTU, 
+		  changed default QP timers, include NULL definition.
+	v1,v2 - Fix compiler warnings: dat, dapl, dtest, and dapltest 
+
+      NEW SINCE OFED 1.3
 
-        NEW SINCE OFED 1.3
 	OFED 1.3.1 includes new versions of uDAPL v1 (1.2.7-1) and v2 (2.0.9-1)
 	
 	Summary of changes since OFED 1.3 release:
@@ -23,7 +78,7 @@
 	v1,v2 - long delay during dat_ia_open when DNS not configured 
 	v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max 
         
-        NEW SINCE OFED 1.2
+      NEW SINCE OFED 1.2
 
         * New Features
           1. Add v2.0 library support for new 2.0 API Specification
@@ -62,10 +117,10 @@
           - dtest: typo in memset
   
 
-        BUILD: v1 and v2 uDAPL source install/build instructions (redhat example):
+      BUILD: v1 and v2 uDAPL source install/build instructions (redhat example):
 
-        # cd to distribution SRPMS directory
-	cd /tmp/OFED-1.3/SRPMS
+      # cd to distribution SRPMS directory
+	  cd /tmp/OFED-1.3/SRPMS
         rpm -i dapl-1.2*.rpm
         rpm -i dapl-2.0*.rpm
         cd /usr/src/redhat/SOURCES
@@ -110,7 +165,7 @@
 	DAPL_DBG_TYPE_CNTR      = 0x1000
 
 
-        NEW SINCE Gamma 3.2 and OFED 1.1
+      NEW SINCE Gamma 3.2 and OFED 1.1
 
         * New Features
 

From panda at cse.ohio-state.edu  Fri Nov 14 19:56:36 2008
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Fri, 14 Nov 2008 22:56:36 -0500 (EST)
Subject: [ofa-general] Announcing the release of MVAPICH 1.1
Message-ID: <Pine.GSO.4.40.0811142255140.26862-100000@xi.cse.ohio-state.edu>

The MVAPICH team is pleased to announce the availability of
MVAPICH-1.1 with the following NEW features:

- New Features for OpenFabrics Gen2-IB Interface
  - eXtended Reliable Connection (XRC) support
  - Lock-free design to provide support for asynchronous
    progress at both sender and receiver to overlap
    computation and communication
  - Optimized MPI_allgather collective
  - Efficient intra-node shared memory communication
    support for diskless clusters
  - Enhanced Totalview Support with the new mpirun_rsh framework

- New OpenFabrics Gen2-Hybrid Interface
  - Replaces the Gen2-UD interface of MVAPICH 1.0 series
  - Targeted for large-scale IB clusters (multi-thousand cores) to
    provide highest performance and minimal memory usage
  - Support for UD, RC and XRC transports
  - Adaptive selection during run-time (based on application and
    systems characteristics) to switch between RC and UD
    (or between XRC and UD) transports
  - Delivers performance and scalability with near constant
    memory footprint for communication contexts
  - Zero-copy protocol with UD for large data transfer
  - Multiple buffer organizations with XRC support
  - Shared memory communication between cores within a node
  - Efficient intra-node shared memory communication
    support for diskless clusters
  - Multi-core optimized collectives
    (MPI_Bcast, MPI_Barrier, MPI_Reduce and MPI_Allreduce)
  - Optimized MPI_Allgather collective
  - Enhanced Totalview Support with the new mpirun_rsh framework

- New Features for MVAPICH-InfiniPath (QLogic) Interface
  - Enhanced Totalview Support with the new mpirun_rsh framework

- New Features for Shared-Memory only Interface
  - Enhanced Totalview Support with the new mpirun_rsh framework

More details on all features and supported platforms can be obtained
by visiting the following URL:

http://mvapich.cse.ohio-state.edu/overview/mvapich/features.shtml

MVAPICH 1.1 is being made available with OFED 1.4. It is also tested
with OFED 1.3. It continues to deliver excellent performance.  Sample
performance numbers include:

  OpenFabrics/Gen2-IB on EM64T quad-core with PCIe2 and ConnectX-QDR:
        - 1.17 microsec one-way latency (4 bytes)
        - 2569 MB/sec unidirectional bandwidth
        - 5025 MB/sec bidirectional bandwidth

  OpenFabrics/Gen2-Hybrid on EM64T quad-core with PCIe2 and ConnectX-QDR:
        - 1.18 microsec one-way latency (4 bytes)
        - 2571 MB/sec unidirectional bandwidth
        - 5027 MB/sec bidirectional bandwidth

  OpenFabrics/Gen2-IB on Opteron quad-core with PCIe and ConnectX-DDR:
        - 1.62 microsec one-way latency (4 bytes)
        - 1628 MB/sec unidirectional bandwidth
        - 2889 MB/sec bidirectional bandwidth

  InfiniPath on EM64T quad-core with PCIe2 and QLogic-DDR:
        - 1.28 microsec one-way latency (4 bytes)
        - 1953 MB/sec unidirectional bandwidth

Performance numbers for several other platforms, system configurations
and operations can be viewed by visiting `Performance' section of the
project's web page.

For downloading MVAPICH 1.1 package and accessing the anonymous SVN,
please visit the following URL:

http://mvapich.cse.ohio-state.edu/

All feedbacks, including bug reports, hints for performance tuning,
patches and enhancements are welcome. Please post it to the
mvapich-discuss mailing list.

Thanks,

The MVAPICH Team


From hal.rosenstock at gmail.com  Sat Nov 15 02:34:38 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Sat, 15 Nov 2008 05:34:38 -0500
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <f0e08f230811141339s36d405f7we59b5d6f2a93bd43@mail.gmail.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
	<20081113131703.GV27271@sashak.voltaire.com>
	<491C2EFE.4060900@voltaire.com>
	<f0e08f230811141339s36d405f7we59b5d6f2a93bd43@mail.gmail.com>
Message-ID: <f0e08f230811150234u4f7f2622ya6cb3321a3ac4f09@mail.gmail.com>

On Fri, Nov 14, 2008 at 4:39 PM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> On Thu, Nov 13, 2008 at 8:43 AM, Or Gerlitz <ogerlitz at voltaire.com> wrote:
>> Sasha Khapyorsky wrote:
>>>>
>>>> RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
>>>>
>>>> When a path query is issued, Indeed sl=1 is returned but I see that a
>>>> rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).
>>>
>>> For my best knowledge rate=2 in partition config file will be related to
>>> corresponded IPoIB multicast group for this partition, and not to
>>> PathRecord. In PathRecord you get maximum available rate on the requested
>>> path.
>>
>> I understand your comment about the relation to multicast join and not path
>> queries. However,  currently, where  there's no rule in the qos-config file
>> (or no file) that matches the path query, the SM does provide the SL
>> assigned to the partition (specified in the query) through the pkey file but
>> it doesn't do so for the Rate. So you say that for QoS = <SL, Rate>
>> assignment one should use the qos-policy file, let it be.
>
> I think Sasha is not saying "should use qos-policy file". You're
> asking about the pre QoS annex Qos implementation in OpenSM and I
> think this could be viewed as an omission (bug/feature). I think it
> could easily be changed in SA PR/MPR support.

It's the current semantics of rate just applying to the multicast
group in the partition policy file as Sasha pointed out. Unicast
traffic would be disadvantaged if using that rate. So if this were to
be done, it would need another flag for these semantics there.

Is this needed ?

-- Hal

> -- Hal
>
>> Or.
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>


From vlad at lists.openfabrics.org  Sat Nov 15 03:17:00 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 15 Nov 2008 03:17:00 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081115-0200 daily build status
Message-ID: <20081115111700.670B4E601B7@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From james_ at catbus.co.uk  Sat Nov 15 02:36:35 2008
From: james_ at catbus.co.uk (James Beal)
Date: Sat, 15 Nov 2008 10:36:35 +0000
Subject: [ofa-general] srp_daemon and partitions.
Message-ID: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>


We are currently investigating infiniband and we are so far very  
impressed with the ease of use of the OFED stack. However we seem to  
have run into an issue with the srp disc discovery.

We wish to protect the storage from unwanted use. In a fibre channel  
san environment this would be done in two ways, firstly presentation  
( configuring the controller as to which luns each WWN can access )  
and secondly zoning which is configuring the switches that make the  
fabric as to which ports can communicate. If we can't do this it would  
restrict IB to a single use eg as a replacement for fibre switches.

I can't see how to specify to either srp_daemon or ibsrpdm which pkey  
to use when discovering discs and a quick look at the source code  
doesn't inspire confidence as I can see pkey=ffff as a string in the  
code.

I did try the following:

One host with one adapter communicating with DDN controller, with no   
access control ( pkeys )

The correct lun information was discovered.

root at isg-dev6:~# ibsrpdm -c
id_ext  
= 
50001ff3000501f0 
,ioc_guid 
= 
50001ff3000501f0 
,dgid 
=fe8000000000000050001ff4000501f0,pkey=ffff,service_id=f0010500f31f0050


Access control was reasserted, and can be seen as the lun can no
longer be discovered.

root at isg-dev6:~# ibsrpdm -c

The device was created by "hand"  with the pkey set to the correct value

echo
"id_ext 
= 
50001ff3000501f0 
,ioc_guid 
= 
50001ff3000501f0 
,dgid 
= 
fe8000000000000050001ff4000501f0 
,pkey=1001,service_id=f0010500f31f0050" > /sys/class/infiniband_srp/  
srp-mthca0-1/add_target

And the device can be seen.

multipath -ll
360001ff001f0dbac01000800000a6a6cdm-0 DDN     ,S2A 9900
[size=5.2T][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][enabled]
  \_ 5:0:0:1 sdb 8:16  [active][ready]


So the issue appears to be with ibsrpdm/srp_daemon not allowing the  
pkey to be set

The following message suggests the same.

user_mad: process ibsrpdm did not enable P_Key index support.
user_mad:   Documentation/infiniband/user_mad.txt has info on the new
ABI.


From ogerlitz at voltaire.com  Sat Nov 15 22:17:42 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 16 Nov 2008 08:17:42 +0200
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <f0e08f230811150234u4f7f2622ya6cb3321a3ac4f09@mail.gmail.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>	
	<20081113131703.GV27271@sashak.voltaire.com>	
	<491C2EFE.4060900@voltaire.com>	
	<f0e08f230811141339s36d405f7we59b5d6f2a93bd43@mail.gmail.com>
	<f0e08f230811150234u4f7f2622ya6cb3321a3ac4f09@mail.gmail.com>
Message-ID: <491FBB06.3050704@voltaire.com>

Hal Rosenstock wrote:
> It's the current semantics of rate just applying to the multicast group in the partition policy file as Sasha pointed out. Unicast traffic would be disadvantaged if using that rate. So if this were to be done, it would need another flag for these semantics there.
> Is this needed ?
>   
At this point of time, I don't see any need for a change here.

Or.


From amirv at mellanox.co.il  Sat Nov 15 23:58:48 2008
From: amirv at mellanox.co.il (Amir Vadai)
Date: Sun, 16 Nov 2008 09:58:48 +0200
Subject: [ofa-general] Re: OFED 1.4  bugs status and OFED meetings
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0FE73D@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD0FE73D@mtlexch01.mtl.com>
Message-ID: <491FD2B8.4060301@mellanox.co.il>

BUG1279 and BUG1331 are very minor bugs and won't be fixed for the release.


Tziporet Koren wrote:
>
> Hi,
>
> This is the bugs status
>
> Bug owners - please update bugs status (I think I saw some commits so
> maybe some of them are already fixed) and see if they are really
> critical for the release
>
> 1323            blo     stefan.roscher at de.ibm.com       IB/ehca:
> possibility of kernel panic under certain circu...
>
> 1242    cri     yannick.cote at qlogic.com         kernel panic while
> running mpi2007 against ofed1.4 -- ib_...
>
> 1289    maj     amirv at mellanox.co.il            Ib and ipoib doesnt
> respond while running multiple tests ...
>
> 1349    maj     amirv at mellanox.co.il            Kernel panic on sdp
>
> 1379    maj     vu at mellanox.com                 Cannot unload ib_srpt
> module on SRP target
>
> 1377    maj     vu at mellanox.com                 Deadlock occurred
> during HA test
>
> 1380    maj     vu at mellanox.com                 Cannot unload ib_srpt
> module on SRP target
>
> 1279    min     amirv at mellanox.co.il            ltp_sdp connect
> "already connected successful" very slow
>
> 1331    min     amirv at mellanox.co.il            SDP connect to 0.0.0.0
> doesn't work
>
> I don't think we need a meeting on Monday (I personally will not be
> able to attend)
>
> If we only have bugs in SDP and SRP we should go ahead and build RC5
> on Monday
>
> Reminder to all - please send release notes
>
> Tziporet
>


From vlad at dev.mellanox.co.il  Sun Nov 16 02:01:20 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Sun, 16 Nov 2008 12:01:20 +0200
Subject: [ofa-general] [PATCH]  uDAPL release notes updated for OFED 1.4
In-Reply-To: <000a01c946bd$e295c840$4797070a@amr.corp.intel.com>
References: <000a01c946bd$e295c840$4797070a@amr.corp.intel.com>
Message-ID: <491FEF70.4060601@dev.mellanox.co.il>

Arlin Davis wrote:
> uDAPL_release_notes.txt updated for OFED 1.4
>
> Signed-off-by: Arlin Davis <ardavis at ichips.intel.com>
>
> Tziporet, please pull into OFED 1.4. Thanks!
>   
Applied,

Regards,
Vladimir


From dorfman.eli at gmail.com  Sun Nov 16 02:58:46 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sun, 16 Nov 2008 12:58:46 +0200
Subject: ***SPAM*** [ofa-general] [PATCH] opensm/osm_mcast_tbl.c wrong
	max mcast lid cause the sm to set invalid MFT block.
In-Reply-To: <f0e08f230811130956w6388bf6ex89fc5dd9b5ac6d77@mail.gmail.com>
References: <491C4CFA.8000006@gmail.com>
	<f0e08f230811130956w6388bf6ex89fc5dd9b5ac6d77@mail.gmail.com>
Message-ID: <491FFCE6.1070309@gmail.com>

Hal Rosenstock wrote:
> Hi Eli,
> 
> On Thu, Nov 13, 2008 at 10:51 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:
>> wrong max mcast lid cause the sm to set invalid MFT block.
>> when mcmember tries to set mcast lid beyond mcast capability (e.g. 0xc400),
>> the sm accepts this and tries to set invalid block.
> 
> Good find (and nice test case).
> 
> Do the switch SMA's reject those invalid sets ? I'm hoping that's the case.

yes it is rejected as invalid.

> 
> See below for minor question on the patch.
> 
>> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>>
>> ---
>>  opensm/opensm/osm_mcast_tbl.c          |    6 +++---
>>  opensm/opensm/osm_sa_mcmember_record.c |    2 +-
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
>> index 92fbb63..17fb69c 100644
>> --- a/opensm/opensm/osm_mcast_tbl.c
>> +++ b/opensm/opensm/osm_mcast_tbl.c
>> @@ -81,7 +81,7 @@ osm_mcast_tbl_init(IN osm_mcast_tbl_t * const p_tbl,
>>                                                IB_MCAST_BLOCK_SIZE) /
>>                                        IB_MCAST_BLOCK_SIZE) - 1);
>>
>> -       p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity);
>> +       p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity - 1);
>>
>>        /*
>>           The number of bytes needed in the mask table is:
>> @@ -216,7 +216,7 @@ osm_mcast_tbl_set_block(IN osm_mcast_tbl_t * const p_tbl,
>>
>>        mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
>>
>> -       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
>> +       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
>>                return (IB_INVALID_PARAMETER);
>>
>>        for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
>> @@ -274,7 +274,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
>>
>>        mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
>>
>> -       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
>> +       if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
>>                return (IB_INVALID_PARAMETER);
>>
>>        for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
>> diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
>> index 5dd286a..6007b06 100644
>> --- a/opensm/opensm/osm_sa_mcmember_record.c
>> +++ b/opensm/opensm/osm_sa_mcmember_record.c
>> @@ -846,7 +846,7 @@ osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
>>        mlid = __get_new_mlid(sa, mcm_rec.mlid);
>>        if (mlid == 0) {
>>                OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: "
>> -                       "__get_new_mlid failed\n");
>> +                       "__get_new_mlid failed request mlid 0x%04x\n", mcm_rec.mlid);
> 
>                             ^^^^^^^^^^^^^^^^
> Should this be cl_ntoh16(mcm_rec.mlid) ?

yes, i'll fix the patch.

Thanks,
Eli


From dorfman.eli at gmail.com  Sun Nov 16 03:06:17 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sun, 16 Nov 2008 13:06:17 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH v2] opensm/osm_mcast_tbl.c wrong
	max mcast lid cause the sm to set invalid MFT block.
In-Reply-To: <f0e08f230811130956w6388bf6ex89fc5dd9b5ac6d77@mail.gmail.com>
References: <491C4CFA.8000006@gmail.com>
	<f0e08f230811130956w6388bf6ex89fc5dd9b5ac6d77@mail.gmail.com>
Message-ID: <491FFEA9.2090500@gmail.com>

 wrong max mcast lid cause the sm to set invalid MFT block.
 when mcmember tries to set mcast lid beyond mcast capability (e.g. 0xc400),
 the sm accepts this and tries to set invalid block.

 Signed-off-by: Eli Dorfman <elid at voltaire.com>

---
 opensm/opensm/osm_mcast_tbl.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_mcast_tbl.c b/opensm/opensm/osm_mcast_tbl.c
index 92fbb63..17fb69c 100644
--- a/opensm/opensm/osm_mcast_tbl.c
+++ b/opensm/opensm/osm_mcast_tbl.c
@@ -81,7 +81,7 @@ osm_mcast_tbl_init(IN osm_mcast_tbl_t * const p_tbl,
 						IB_MCAST_BLOCK_SIZE) /
 					IB_MCAST_BLOCK_SIZE) - 1);
 
-	p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity);
+	p_tbl->max_mlid_ho = (uint16_t) (IB_LID_MCAST_START_HO + capacity - 1);
 
 	/*
 	   The number of bytes needed in the mask table is:
@@ -216,7 +216,7 @@ osm_mcast_tbl_set_block(IN osm_mcast_tbl_t * const p_tbl,
 
 	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
 
-	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
+	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
 		return (IB_INVALID_PARAMETER);
 
 	for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
@@ -274,7 +274,7 @@ osm_mcast_tbl_get_block(IN osm_mcast_tbl_t * const p_tbl,
 
 	mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE);
 
-	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE > p_tbl->max_mlid_ho)
+	if (mlid_start_ho + IB_MCAST_BLOCK_SIZE - 1 > p_tbl->max_mlid_ho)
 		return (IB_INVALID_PARAMETER);
 
 	for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++)
-- 
1.5.5


From dorfman.eli at gmail.com  Sun Nov 16 03:08:04 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Sun, 16 Nov 2008 13:08:04 +0200
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_sa_mcmember_record.c
 print multicast lid in error message
Message-ID: <491FFF14.6050006@gmail.com>

 print multicast lid in error message

 Signed-off-by: Eli Dorfman <elid at voltaire.com>

---
 opensm/opensm/osm_sa_mcmember_record.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 5dd286a..4e77f06 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -846,7 +846,7 @@ osm_mcmr_rcv_create_new_mgrp(IN osm_sa_t * sa,
 	mlid = __get_new_mlid(sa, mcm_rec.mlid);
 	if (mlid == 0) {
 		OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B19: "
-			"__get_new_mlid failed\n");
+			"__get_new_mlid failed request mlid 0x%04x\n", cl_ntoh16(mcm_rec.mlid));
 		status = IB_SA_MAD_STATUS_NO_RESOURCES;
 		goto Exit;
 	}
-- 
1.5.5


From vlad at lists.openfabrics.org  Sun Nov 16 03:19:53 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 16 Nov 2008 03:19:53 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081116-0200 daily build status
Message-ID: <20081116111953.C7012E608DC@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Sun Nov 16 04:16:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 14:16:25 +0200
Subject: [ofa-general] [PATCH v2] opensm/osm_mcast_tbl.c wrong max
	mcast lid cause the sm to set invalid MFT block.
In-Reply-To: <491FFEA9.2090500@gmail.com>
References: <491C4CFA.8000006@gmail.com>
	<f0e08f230811130956w6388bf6ex89fc5dd9b5ac6d77@mail.gmail.com>
	<491FFEA9.2090500@gmail.com>
Message-ID: <20081116121625.GA12418@sashak.voltaire.com>

On 13:06 Sun 16 Nov     , Eli Dorfman wrote:
>  wrong max mcast lid cause the sm to set invalid MFT block.
>  when mcmember tries to set mcast lid beyond mcast capability (e.g. 0xc400),
>  the sm accepts this and tries to set invalid block.
> 
>  Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Nov 16 04:17:01 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 14:17:01 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_sa_mcmember_record.c print
	multicast lid in error message
In-Reply-To: <491FFF14.6050006@gmail.com>
References: <491FFF14.6050006@gmail.com>
Message-ID: <20081116121701.GB12418@sashak.voltaire.com>

On 13:08 Sun 16 Nov     , Eli Dorfman wrote:
>  print multicast lid in error message
> 
>  Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Nov 16 04:19:37 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 14:19:37 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_lid_mgr.c: ignore and overwrite
	guid2lid (windows)
In-Reply-To: <491CA8AB.1010801@dev.mellanox.co.il>
References: <491CA8AB.1010801@dev.mellanox.co.il>
Message-ID: <20081116121937.GC12418@sashak.voltaire.com>

On 00:22 Fri 14 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> When Windows is crashing with BSOD, it might corrupt files that were
> previously opened for writing, even if the files are closed. As a result,
> we might see corrupted guid2lid file, and OpenSM will exit on such error.
> This patch makes SM ignore (and later overwrite) corrupted guid2lid files.
> 
> The patch has already been accepted into ofw.
> 
> I'm posting it to openib too, so that when some day WinSM will be
> synchronized with OpenSM, this fix won't be lost.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Nov 16 04:24:48 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 14:24:48 +0200
Subject: [ofa-general] Re: [PATCH] Fix max parameter passed to
	umad_get_cas_names
In-Reply-To: <20081114145406.57dff1a7.weiny2@llnl.gov>
References: <20081114145406.57dff1a7.weiny2@llnl.gov>
Message-ID: <20081116122448.GD12418@sashak.voltaire.com>

On 14:54 Fri 14 Nov     , Ira Weiny wrote:
> From a9149f4e38081d206d0be0af2194f4e09f944f21 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Fri, 14 Nov 2008 11:36:01 -0800
> Subject: [PATCH] Fix max parameter passed to umad_get_cas_names
> 
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Nov 16 04:37:00 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 14:37:00 +0200
Subject: [ofa-general] Re: [opensm patch][1/2] fix qos config parsing bugs
In-Reply-To: <1226596802.7156.41.camel@cardanus.llnl.gov>
References: <1225404078.1197.533.camel@cardanus.llnl.gov>
	<20081111191958.GA8894@sashak.voltaire.com>
	<1226447872.6239.2.camel@cardanus.llnl.gov>
	<20081113002403.GI27271@sashak.voltaire.com>
	<1226596802.7156.41.camel@cardanus.llnl.gov>
Message-ID: <20081116123700.GF12418@sashak.voltaire.com>

On 09:20 Thu 13 Nov     , Al Chu wrote:
> Hey Sasha,
> 
> On Thu, 2008-11-13 at 02:24 +0200, Sasha Khapyorsky wrote:
> > Hi Al,
> > 
> > On 15:57 Tue 11 Nov     , Al Chu wrote:
> > > 
> > > Sorry, I may have not explained it well. Lets say I do this in the
> > > config file.
> > > 
> > > qos_vlarb_high FOOBAR
> > > # qos_ca_vlarb_high BLAH
> > > qos_swe_vlarb_high XYZZY
> > > 
> > > I currently expect qos_ca_vlarb_high to use the value of FOOBAR because
> > > I commented out the field.  But it uses OSM_DEFAULT_QOS_HIGH_LIMIT
> > > instead.  The reason is because qos_build_config() checks for NULL to
> > > use default vs. non-default values.
> > > 
> > > p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high;
> > > 
> > > Under the above situation where I've commented out veral fields, opt-
> > > >vlarb_high is always non-NULL b/c it was set to
> > > OSM_DEFAULT_QOS_HIGH_LIMIT. Thus OSM_DEFAULT_QOS_HIGH_LIMIT is used
> > > instead of FOOBAR.
> > > 
> > > > > 2)
> > > > > 
> > > > > In qos_build_config() we load the high_limit like this:
> > > > > 
> > > > > cfg->vl_high_limit = (uint8_t) opt->high_limit;
> > > > > 
> > > > > So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit
> > > > > options to "go back to" the default high_limit.  It just assumes that
> > > > > whatever is input (or was set by default) is what you should use.
> > > > 
> > > > Right. What is a limitation here? That an user cannot set this to
> > > > "no value"? But she/he can just skip it.
> > > 
> > > Similar to the above issue, lets say I want to do:
> > > 
> > > qos_high_limit 8
> > > # qos_ca_high_limit 15
> > > # qos_swe_high_limit 15
> > > 
> > > I want qos_ca_high_limit and qos_swe_high_limit to use whatever I set in
> > > qos_high_limit.  But the code doesn't allow for this.
> > > 
> > > > 
> > > > > 3)
> > > > > 
> > > > > Some fields like qos_vlarb_high are assumed to be correctly set and can
> > > > > segfault opensm.
> > > > 
> > > > qos_build_config() assumes that valid parameters are used. And we are
> > > > using this this way (I hope :)) (finally it is not library API).
> > > 
> > > I think the issue is the osm_subnet.c code did not properly check all
> > > inputs, and subsequently some inputs used in qos_build_config() were
> > > bad.  I think
> > > 
> > > qos_vlarb_high (null)
> > > 
> > > was something I tried that opensm seg-faulted on.  
> > 
> > Ok. I see now.
> > 
> > Probably it will be simpler just to generate a valid qos parameter sets
> > right after parser (in verification time)?
> 
> Ahh, I see what you did.  It's much cleaner this way.
> 
> > Like in your modified (and
> > rebased against recent patches) patch below?
> 
> Patch looks good to me.

Applied. Thakns.

Sasha


From sashak at voltaire.com  Sun Nov 16 04:41:15 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 14:41:15 +0200
Subject: [ofa-general] Re: [PATCH V2] Add check for previous versions of
	plugins.
In-Reply-To: <20081114142848.75c64c94.weiny2@llnl.gov>
References: <20081104095812.2ff5920c.weiny2@llnl.gov>
	<20081109174733.GA30265@sashak.voltaire.com>
	<20081114142848.75c64c94.weiny2@llnl.gov>
Message-ID: <20081116124115.GG12418@sashak.voltaire.com>

On 14:28 Fri 14 Nov     , Ira Weiny wrote:
> I believe this will work.  I incorporated your patch but I made this explicit
> so it will hopefully be clear what is going on.
> 
> Ira
> 
> 
> From 061822466a157bb425600ee0b63cc80ff038d615 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Mon, 3 Nov 2008 15:50:15 -0800
> Subject: [PATCH] Add check for previous versions of plugins.
> 
>    If old interface plugins are available to OpenSM they will cause a crash.
>    Check for this old version and error out gracefully.
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Sun Nov 16 06:40:29 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 16:40:29 +0200
Subject: [ofa-general] opensm: bad multicast forwarding table entries
In-Reply-To: <20081114194317.GM25248@sgi.com>
References: <20081112221846.GE25248@sgi.com>
	<f0e08f230811121527p258e47cft416f09a2b1e9ea14@mail.gmail.com>
	<20081114194317.GM25248@sgi.com>
Message-ID: <20081116144029.GD6183@sashak.voltaire.com>

On 11:43 Fri 14 Nov     , akepner at sgi.com wrote:
> 
> But this is something that we've seen on quite a few occasions, 
> so we'll keep looking for it, and grab what debug info we can 
> when it crops up again.

Thanks!

Sasha


From sashak at voltaire.com  Sun Nov 16 06:42:24 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 16 Nov 2008 16:42:24 +0200
Subject: [ofa-general] Re: rate assignment for path queries
In-Reply-To: <f0e08f230811141335n796496ban7a43bbc7ae8e5ee1@mail.gmail.com>
References: <Pine.LNX.4.64.0811130912140.27833@zuben.voltaire.com>
	<20081113131703.GV27271@sashak.voltaire.com>
	<f0e08f230811141335n796496ban7a43bbc7ae8e5ee1@mail.gmail.com>
Message-ID: <20081116144224.GE6183@sashak.voltaire.com>

On 16:35 Fri 14 Nov     , Hal Rosenstock wrote:
> On Thu, Nov 13, 2008 at 8:17 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > Hi Or,
> >
> > On 09:20 Thu 13 Nov     , Or Gerlitz wrote:
> >>
> >> If opensm doesn't have a match on any qos-assignment rule (eg when there's
> >> no qos-config file), when coming to serve sa path query,  my understanding
> >> is that the "qos related fields" of the partition would be used.
> >>
> >> For example, I have set the following partition config file which assigns
> >> <sl=1,rate=2> to the 0x8001 partition, and run without any qos file.
> >>
> >> Default=0x7fff,ipoib : ALL=full;
> >> RED=0x8001, ipoib, sl=1, rate=2, defmember=full : ALL=full;
> >> RED=0x8002, ipoib, sl=2, rate=3, defmember=full : ALL=full;
> >>
> >> When a path query is issued, Indeed sl=1 is returned but I see that a
> >> rate=6 (20Gbs) is returned where I configured rate=2 (2.5 Gbs).
> >
> > For my best knowledge rate=2 in partition config file will be related to
> > corresponded IPoIB multicast group for this partition, and not to
> > PathRecord.
> 
> There is a form of PR query that supports returning information on
> MGIDs when used as a DGID.
> 
> > In PathRecord you get maximum available rate on the
> > requested path.
> 
> Here you are talking about current OpenSM implementation.

Yes.

Sasha


From constantine.gavrilov at gmail.com  Sun Nov 16 07:34:32 2008
From: constantine.gavrilov at gmail.com (Constantine Gavrilov)
Date: Sun, 16 Nov 2008 17:34:32 +0200
Subject: [ofa-general] SDP Fixes
Message-ID: <49203D88.7020103@gmail.com>

While playing with SDP code in OFED 1.3.1 (latest stable), I have 
encountered a number of bugs in the zero-copy send code:

* sdp_bz_setup() code does not handle the case of kernel data segment 
correctly (kernel sockets)
* sdp_bz_setup() does not pass ENOMEM, EFAULT or other errors to 
sendmsg(). In fact, a negative possible return from get_user_pages() is 
nor handled.
* the deallocation of bz descriptor in sendmsg() is not handled properly 
-- it is allocated many times, but freed once.
* sdp_bzcopy_get() code does not raise reference count for all  pages in 
the bz descriptor (only the "partial" pages will get the count raised).
   However, the send completion code will call put_page() on all 
entries, leading to a crash for page-aligned transfers.

Attached, please find a patch that solves these problems. With this 
patch, I can use SDP and send page-aligned kernel buffers even for 
zero-copy case.

Still, I do not see any performance benefit when using the zero-copy 
method. I have tried various thresholds (32K, 64K, 128K), and zero-copy 
was always slower.

It seems that the penalty of memcpy() is negligible compared to the 
penalty of reconfiguring the card to use different addresses.

Also, looking at the sndmsg() code, I can say that  allocation and 
deallocation of bz descriptor for each iov element is not optimal. 
Instead, an existing bz descriptor can be re-used if it fits.


-- 
----------------------------------------
Constantine Gavrilov
Kernel Developer
Platform Group
XIV, an IBM global brand 
1 Azrieli Center, Tel-Aviv
Phone: +972-3-6074672
Fax:   +972-3-6959749
----------------------------------------


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sdp_patch.diff.txt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081116/dfedc9ae/attachment.txt>

From halr at obsidianresearch.com  Sun Nov 16 10:37:28 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Sun, 16 Nov 2008 11:37:28 -0700
Subject: [ofa-general] [PATCH 1/2] libibumad: Add UMAD_MAX_DEVICES define
Message-ID: <49206868.5040303@obsidianresearch.com>

Sasha,

Following Ira's ibstat patch...

-- Hal
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-umad-maxdevices1
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081116/a875ac38/attachment.ksh>

From halr at obsidianresearch.com  Sun Nov 16 10:37:31 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Sun, 16 Nov 2008 11:37:31 -0700
Subject: [ofa-general] [PATCH 2/2] infiniband-diags/ibstat.c: Use
	UMAD_MAX_DEVICES define
Message-ID: <4920686B.3010804@obsidianresearch.com>

Sasha,

Please see attached patch.

-- Hal
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-ibstat-maxdevices1
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081116/697a36da/attachment.ksh>

From halr at obsidianresearch.com  Sun Nov 16 10:37:34 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Sun, 16 Nov 2008 11:37:34 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_trap_rcv.c: Fix typo
Message-ID: <4920686E.9070209@obsidianresearch.com>

Sasha,

Please see attached patch.

-- Hal

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-osmtrap1
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081116/9a735f4c/attachment.ksh>

From john.donners at sara.nl  Mon Nov 17 02:02:08 2008
From: john.donners at sara.nl (John Donners)
Date: Mon, 17 Nov 2008 11:02:08 +0100
Subject: [ofa-general] [Fwd: EHCA_ERR:ehcau_modify_qp ibv_cmd_modify_qp()
	failed ret=22]
Message-ID: <49214120.2090400@sara.nl>

Dear all,

I work for the support team at the SARA supercomputing center in Amsterdam.
We are debugging an application that uses OpenIB directly. Soon after 
startup the application fails and the error message is:

PID5c2b ehca0 EHCA_ERR:ehcau_modify_qp ibv_cmd_modify_qp() failed ret=22 
qp=0x2aa3f880 qp_num=1c8f
Last System Error Message from Task 0:: Invalid argument
PID10e7 ehca0 EHCA_ERR:ehcau_modify_qp ibv_cmd_modify_qp() failed ret=22 
qp=0x2aa3f070 qp_num=174c
Last System Error Message from Task 32:: Invalid argument
ERROR: 0031-250 task 35: Terminated
ERROR: 0031-250 task 14: Terminated

To be honest, I don't know the code and I haven't used ibverbs myself
before either, but maybe you could shed some light on what this means
and what we can do about it.

We have a Power6 system with Infiniband running Suse Linux Enterprise
Server 10. The system installation includes OFED-1.3 and libibverbs-1.1.1.

With regards,
John

-- 
John Donners          tel (31)20 5923055
SARA, Kruislaan 415   fax (31)20 6683167
1098 SJ Amsterdam     john.donners at sara.nl
The Netherlands


From ogerlitz at voltaire.com  Mon Nov 17 03:17:29 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 17 Nov 2008 13:17:29 +0200
Subject: [ofa-general] [PATCH] perftest: don't attach the sender QP
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EADF6284B@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EADF6284B@mtlexch01.mtl.com>
Message-ID: <492152C9.8010509@voltaire.com>

Oren Meron wrote:
> What about the send_lat test?
latency tests are typically based on RTT/2 measures which means that 
both sides do send and receive.... if this apply to the send_lat test, 
please don't apply the same patch over there. As for the bandwidth test, 
my patch I think has a defect that makes the client not to join also for 
the bidirectional test, where in that case it needs to, sorry.

Also, I don't think to ever managing to see the server side statistics 
printed, this means that I could only see the sender bandwidth which is 
not necessarily the receiver bandwidth, the importance of seeing it BTW 
applies also the the unicast UD tests.

One nice enhancement which you might want to look at would be to have a 
some sort of MGID supplied from the command line and attach to this MGID 
instead of the current implementation. This would allow to have > 1 
receivers

Or.


From vlad at lists.openfabrics.org  Mon Nov 17 03:36:17 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 17 Nov 2008 03:36:17 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081117-0200 daily build status
Message-ID: <20081117113617.AAB33E608E5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From kliteyn at dev.mellanox.co.il  Mon Nov 17 04:56:28 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 17 Nov 2008 14:56:28 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sa_mcmember_record.c: bad return
 state when leaving mcast
Message-ID: <492169FC.7040609@dev.mellanox.co.il>

Hi Sasha,

Re-fixing our recent fix in handling multicast leave.
When updating the state will cause port removal, port
object will be freed, so bad things will happen if we
try using it's state.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_sa_mcmember_record.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
index 4e77f06..99aee1b 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -1085,10 +1085,14 @@ __osm_mcmr_rcv_leave_mgrp(IN osm_sa_t * sa,
 		goto Exit;
 	}

+	/* store state - we'll need it if the port is removed */
+	mcmember_rec.scope_state = p_mcm_port->scope_state;
+
 	/* remove port or update join state */
 	removed = osm_mgrp_remove_port(sa->p_subn, sa->p_log, p_mgrp, p_mcm_port,
 				       p_recvd_mcmember_rec->scope_state&0x0F);
-	mcmember_rec.scope_state = p_mcm_port->scope_state;
+	if (!removed)
+		mcmember_rec.scope_state = p_mcm_port->scope_state;

 	CL_PLOCK_RELEASE(sa->p_lock);

-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Mon Nov 17 04:58:10 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 17 Nov 2008 14:58:10 +0200
Subject: [ofa-general] [PATCH] opensm/osmtest: fixing some comments in mcast
	flow of osmtest
Message-ID: <49216A62.5010300@dev.mellanox.co.il>

Some cosmetics - fixing comments in multicast flow.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/osmtest/osmt_multicast.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/osmtest/osmt_multicast.c b/opensm/osmtest/osmt_multicast.c
index 57a8772..165457c 100644
--- a/opensm/osmtest/osmt_multicast.c
+++ b/opensm/osmtest/osmt_multicast.c
@@ -2138,7 +2138,7 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)
 	comp_mask = IB_MCR_COMPMASK_GID | IB_MCR_COMPMASK_PORT_GID | IB_MCR_COMPMASK_QKEY | IB_MCR_COMPMASK_PKEY | IB_MCR_COMPMASK_SL | IB_MCR_COMPMASK_FLOW | IB_MCR_COMPMASK_JOIN_STATE | IB_MCR_COMPMASK_TCLASS |	/* all above are required */
 	    IB_MCR_COMPMASK_RATE_SEL | IB_MCR_COMPMASK_RATE;
 	/* link-local scope, non member (so we should not be able to delete) */
-	/* but the FullMember bit should be gone */
+	/* but the NonMember bit should be gone */
 	mc_req_rec.scope_state = 0x22;

 	status = osmt_send_mcast_request(p_osmt, 0,
@@ -2155,7 +2155,7 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt)

 	OSM_LOG(&p_osmt->log, OSM_LOG_INFO,
 		"Validating Join State removal of Non Member bit (o15.0.1.14)...\n");
-	if (p_mc_res->scope_state != 0x25) {	/* scope is MSB - now only the non member & send only member have left */
+	if (p_mc_res->scope_state != 0x25) {	/* scope is MSB - now only the full member & send only member have left */
 		OSM_LOG(&p_osmt->log, OSM_LOG_ERROR, "ERR 02CA: "
 			"Validating JoinState update failed. Expected 0x25 got: 0x%02X\n",
 			p_mc_res->scope_state);
-- 
1.5.1.4


From orenmeron at mellanox.co.il  Mon Nov 17 02:30:14 2008
From: orenmeron at mellanox.co.il (Oren Meron)
Date: Mon, 17 Nov 2008 12:30:14 +0200
Subject: [ofa-general] [PATCH] perftest: don't attach the sender QP
In-Reply-To: <491A7BAC.5030708@voltaire.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADF6284B@mtlexch01.mtl.com>

Hi Or,
Sorry for the late response.
Applied and committed to OFED-1.4.
What about the send_lat test ?
Thanks.

Oren   Meron
Performance

-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
Sent: Wednesday, November 12, 2008 8:46 AM
To: Oren Meron
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] [PATCH] perftest: don't attach the sender QP

Or Gerlitz wrote:
> don't attach the sender QP to the MGID
>   
Oren,

Did you had the chance to look into this patch?

Or.
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
>
> Index: perftest-1.2/send_bw.c
> ===================================================================
> --- perftest-1.2.orig/send_bw.c
> +++ perftest-1.2/send_bw.c
> @@ -421,7 +421,7 @@ static struct pingpong_context *pp_init_
>  			return NULL;
>  		}
>
> -		if ((user_parm->connection_type==UD) &&
(user_parm->use_mcg)) {
> +		if ((user_parm->connection_type==UD) &&
(user_parm->use_mcg) && 
> +!user_parm->servername) {
>  			union ibv_gid gid;
>  			uint8_t mcg_gid[16] = MCG_GID;
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>   


From hal.rosenstock at gmail.com  Mon Nov 17 07:30:54 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Nov 2008 10:30:54 -0500
Subject: [ofa-general] erroneous ibdiagnet subnet check warning
Message-ID: <f0e08f230811170730m83eec6ajd4dddfa969d7afa3@mail.gmail.com>

Hi Oren,

ibdiagnet (version 1.3.0rc14 source undefined) reports:
-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps

but there are SDR links internal to the subnet so this warning is
erroneous and should be fixed for this configuration. It's not just
the member rates that need checking to determine this. I've filed bug
1394 for this issue.

Also, ofed_info shows OFED-1.4-rc3 so is version 1.3.0rc14 on
ibdiagnet correct ?

Thanks for your attention to this.

-- Hal


From hal.rosenstock at gmail.com  Mon Nov 17 07:59:13 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 17 Nov 2008 10:59:13 -0500
Subject: [ofa-general] OpenSM handling of defunct SMs
Message-ID: <f0e08f230811170759j264fe3e8i2781221b774fa71@mail.gmail.com>

Sasha,

What I observe is that OpenSM 3.2.2 continues to poll/retry SMInfo for
a now defunct SM which spams the OpenSM log.

It looks like SMs are removed from the sm_guid_tbl only when the port
is dropped/removed. Shouldn't it also be removed subsequent to a trap
144 which is indicating that the capability mask changed (and the new
capability no longer include IsSM) ? I don't see this anywhere in the
code. Am I missing something ?

If so, should osm_port_info_rcv.c:__osm_pi_rcv_process_endport remove
these so rather than:

                p_sm_tbl = &sm->p_subn->sm_guid_tbl;
                p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
                if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
                        /* clean it up */
                        p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;

                if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {

it should be something like:
                p_sm_tbl = &sm->p_subn->sm_guid_tbl;
                if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {
                    p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
                    if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
                            /* clean it up */
                            p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
                    ...
                } else
                    p_sm = (osm_remote_sm_t *)
cl_qmap_remove(p_sm_tbl, port_guid);

-- Hal


From sashak at voltaire.com  Mon Nov 17 20:21:24 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 06:21:24 +0200
Subject: [ofa-general] Re: [PATCH 1/2] libibumad: Add UMAD_MAX_DEVICES define
In-Reply-To: <49206868.5040303@obsidianresearch.com>
References: <49206868.5040303@obsidianresearch.com>
Message-ID: <20081118042124.GB10251@sashak.voltaire.com>

On 11:37 Sun 16 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> Following Ira's ibstat patch...
>
> -- Hal

> libibumad: Add UMAD_MAX_DEVICES define
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 17 20:21:41 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 06:21:41 +0200
Subject: [ofa-general] Re: [PATCH 2/2] infiniband-diags/ibstat.c: Use
	UMAD_MAX_DEVICES define
In-Reply-To: <4920686B.3010804@obsidianresearch.com>
References: <4920686B.3010804@obsidianresearch.com>
Message-ID: <20081118042141.GC10251@sashak.voltaire.com>

On 11:37 Sun 16 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> Please see attached patch.
>
> -- Hal

> infiniband-diags/ibstat.c: Use UMAD_MAX_DEVICES define
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 17 20:22:08 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 06:22:08 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_trap_rcv.c: Fix typo
In-Reply-To: <4920686E.9070209@obsidianresearch.com>
References: <4920686E.9070209@obsidianresearch.com>
Message-ID: <20081118042208.GD10251@sashak.voltaire.com>

On 11:37 Sun 16 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> Please see attached patch.
>
> -- Hal
>

> opensm/osm_trap_rcv.c: Fix typo
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 17 20:22:40 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 06:22:40 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_sa_mcmember_record.c: bad
	return state when leaving mcast
In-Reply-To: <492169FC.7040609@dev.mellanox.co.il>
References: <492169FC.7040609@dev.mellanox.co.il>
Message-ID: <20081118042240.GE10251@sashak.voltaire.com>

On 14:56 Mon 17 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> Re-fixing our recent fix in handling multicast leave.
> When updating the state will cause port removal, port
> object will be freed, so bad things will happen if we
> try using it's state.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 17 20:23:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 06:23:14 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osmtest: fixing some comments in
	mcast flow of osmtest
In-Reply-To: <49216A62.5010300@dev.mellanox.co.il>
References: <49216A62.5010300@dev.mellanox.co.il>
Message-ID: <20081118042314.GF10251@sashak.voltaire.com>

On 14:58 Mon 17 Nov     , Yevgeny Kliteynik wrote:
> Some cosmetics - fixing comments in multicast flow.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Mon Nov 17 20:41:15 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 06:41:15 +0200
Subject: [ofa-general] [ANNOUNCE] management tarballs release
Message-ID: <20081118044115.GG10251@sashak.voltaire.com>

Hi,

There is a new release of the management (OpenSM and infiniband
diagnostics) tarballs available in:

http://www.openfabrics.org/downloads/management/

md5sum:

89a49b57015524bc3f6ca8667b640b2d  libibumad-1.2.3.tar.gz
bf172da0e70dc4ce6cc625fde8707d00  libibmad-1.2.3.tar.gz
93e14f69ce5004bfdef1009f84a53eb7  opensm-3.2.4.tar.gz
32665d7fb2fe2bf734118b8530d4bbbb  infiniband-diags-1.4.3.tar.gz

All component versions are from recent master branch. Full change log is
below.

Sasha


Al Chu (4):
      opensm: fix manpage typos
      fix documentation typos
      opensm: verify config inputs when config file is rescanned
      fix qos config parsing bugs

Albert Chu (1):
      support dump_conf console command

Doron Shoham (3):
      install QoS_management_in_OpenSM.txt
      change log_max_size to MB
      export osm_log_max in MB

Eli Dorfman (3):
      opensm/osm_sa_path_record.c print port guids in error message
      opensm/osm_mcast_tbl.c wrong max mcast lid cause the sm to set invalid MFT block.
      opensm/osm_sa_mcmember_record.c print multicast lid in error message

Hal Rosenstock (4):
      OpenSM/osm_subnet.c: Fix log_max_size conversion to MB
      libibumad: Add UMAD_MAX_DEVICES define
      infiniband-diags/ibstat.c: Use UMAD_MAX_DEVICES define
      opensm/osm_trap_rcv.c: Fix typo

Ira Weiny (3):
      opensm/opensm/osm_state_mgr.c: Add check for valid physical port before using pointer.
      Fix max parameter passed to umad_get_cas_names
      opensm: Add check for previous versions of plugins.

Or Gerlitz (1):
      opensm: fix iser service-id used for SL assignment

Sasha Khapyorsky (29):
      opensm/osm_ucast_lash: fix extra memory allocations
      opensm/osm_ucast_lash: simplify get_phys_connection() prototype
      opensm/scripts: unify scripts' config
      opens/osm_inform.c: cosmetic changes
      opensm/opens.spec: add -D option for logrotate file install command
      opensm: remove update_master_sm_base_lid field in PortInfo madw context
      libibmad/src/mad.c: indentation fix
      libibmad/dump: print more PortInfo:CapabilityMask bits
      opensm: support more PortInfo:CapabilityMask bits
      opensm: osm_send_trap144() function
      opensm: send trap144 to master SM when priority is raised
      opensm: notify master SM with trap 144
      opensm: hide function name with OSM_LOG_MSG_BOX() macro
      opensm: rename sm signal
      opensm: sweep on SIGCONT
      opensm/include/opensm/osm_switch.h: minor simplifications
      opensm/osm_switch.c: minor: shorter flow
      opensm/osm_ucast_cache.[ch]: indentation fixes
      make.dist: don't use ${date}git suffix for release
      opensm/osm_ucase_cache: simplify cached links allocation code
      opensm/osm_subnet.c: consolidate logging code
      opensm/osm_subnet.c: use strdup() function
      opensm/osm_subnet.c: consolidate qos parameters verification code
      opensm/osm_subnet.c: move osm_subn_rescan_conf_files() function
      opensm/osm_sa_mcmember_record: return a real port JoinState on update
      opensm/osm_sa_mcmember_record: simplify query code
      infiniband-diags/ibstat.c: remove casting
      opensm/osm_trap_rcv.c: kill some empty lines
      management: update versions

Tim Meier (1):
      opensm: osm_opensm.c added a method to remove plugins

Yevgeny Kliteynik (16):
      opensm/scripts/opensm.conf: remove obsolete config file
      opensm/opensm/Makefile.am: allow 'make dist' from non-source directory
      opensm: replace switch's fwd_tbl with simple LFT
      opensm: replace switch's fwd_tbl with simple LFT - remove obsolete files
      opensm/osm_ucast_ftree.c: some simplification in LFT handling
      opensm: free lft_buf if it matches switch's lft
      opensm/osm_ucast_cache: fixing coredump
      opensm/osm_sa.c: adding missing include
      opensm/osm_pkey.c: cosmetics in some log message
      opensm/ib_types.h: rename IB_MC_REC_STATE_SEND_ONLY_MEMBER
      opensm/osm_multicast.c: bug with joining/leaving mcast group
      opensm/Makefile.am: install QoS_management_in_OpenSM.txt
      osmtest/osmt_multicast.c: some refinements to the multicast flow
      opensm/osm_lid_mgr.c: ignore and overwrite guid2lid (windows)
      opensm/osm_sa_mcmember_record.c: bad return state when leaving mcast
      opensm/osmtest: fixing some comments in mcast flow of osmtest


From sashak at voltaire.com  Mon Nov 17 22:04:38 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 08:04:38 +0200
Subject: [ofa-general] Re: OpenSM handling of defunct SMs
In-Reply-To: <f0e08f230811170759j264fe3e8i2781221b774fa71@mail.gmail.com>
References: <f0e08f230811170759j264fe3e8i2781221b774fa71@mail.gmail.com>
Message-ID: <20081118060438.GJ10251@sashak.voltaire.com>

Hi Hal,

On 10:59 Mon 17 Nov     , Hal Rosenstock wrote:
> 
> What I observe is that OpenSM 3.2.2 continues to poll/retry SMInfo for
> a now defunct SM which spams the OpenSM log.
> 
> It looks like SMs are removed from the sm_guid_tbl only when the port
> is dropped/removed. Shouldn't it also be removed subsequent to a trap
> 144 which is indicating that the capability mask changed (and the new
> capability no longer include IsSM) ? I don't see this anywhere in the
> code. Am I missing something ?

It looks like a bug.

> 
> If so, should osm_port_info_rcv.c:__osm_pi_rcv_process_endport remove
> these so rather than:
> 
>                 p_sm_tbl = &sm->p_subn->sm_guid_tbl;
>                 p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
>                 if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
>                         /* clean it up */
>                         p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
> 
>                 if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {
> 
> it should be something like:
>                 p_sm_tbl = &sm->p_subn->sm_guid_tbl;
>                 if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {
>                     p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
>                     if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
>                             /* clean it up */
>                             p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
>                     ...
>                 } else
>                     p_sm = (osm_remote_sm_t *)
> cl_qmap_remove(p_sm_tbl, port_guid);

Yes, I guess it should be something like this. Would you care about the
patch?

Sasha


From james_ at catbus.co.uk  Tue Nov 18 02:04:02 2008
From: james_ at catbus.co.uk (James Beal)
Date: Tue, 18 Nov 2008 10:04:02 +0000
Subject: [ofa-general] srp_daemon and partitions.
In-Reply-To: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>
References: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>
Message-ID: <0E7ABECE-3A66-45B6-8C14-02AAC9FBC16F@catbus.co.uk>

If this is not the correct list for questions of this nature, would  
someone be so kind as to tell me where people would be interested in  
such a question ?


On 15 Nov 2008, at 10:36, James Beal wrote:

>
> We are currently investigating infiniband and we are so far very  
> impressed with the ease of use of the OFED stack. However we seem to  
> have run into an issue with the srp disc discovery.
>
> We wish to protect the storage from unwanted use. In a fibre channel  
> san environment this would be done in two ways, firstly presentation  
> ( configuring the controller as to which luns each WWN can access )  
> and secondly zoning which is configuring the switches that make the  
> fabric as to which ports can communicate. If we can't do this it  
> would restrict IB to a single use eg as a replacement for fibre  
> switches.
>
> I can't see how to specify to either srp_daemon or ibsrpdm which  
> pkey to use when discovering discs and a quick look at the source  
> code doesn't inspire confidence as I can see pkey=ffff as a string  
> in the code.
>
> I did try the following:
>
> One host with one adapter communicating with DDN controller, with  
> no  access control ( pkeys )
>
> The correct lun information was discovered.
>
> root at isg-dev6:~# ibsrpdm -c
> id_ext  
> = 
> 50001ff3000501f0 
> ,ioc_guid 
> = 
> 50001ff3000501f0 
> ,dgid 
> = 
> fe8000000000000050001ff4000501f0,pkey=ffff,service_id=f0010500f31f0050
>
>
> Access control was reasserted, and can be seen as the lun can no
> longer be discovered.
>
> root at isg-dev6:~# ibsrpdm -c
>
> The device was created by "hand"  with the pkey set to the correct  
> value
>
> echo
> "id_ext 
> = 
> 50001ff3000501f0 
> ,ioc_guid 
> = 
> 50001ff3000501f0 
> ,dgid 
> = 
> fe8000000000000050001ff4000501f0 
> ,pkey=1001,service_id=f0010500f31f0050" > /sys/class/infiniband_srp/  
> srp-mthca0-1/add_target
>
> And the device can be seen.
>
> multipath -ll
> 360001ff001f0dbac01000800000a6a6cdm-0 DDN     ,S2A 9900
> [size=5.2T][features=0][hwhandler=0]
> \_ round-robin 0 [prio=1][enabled]
> \_ 5:0:0:1 sdb 8:16  [active][ready]
>
>
> So the issue appears to be with ibsrpdm/srp_daemon not allowing the  
> pkey to be set
>
> The following message suggests the same.
>
> user_mad: process ibsrpdm did not enable P_Key index support.
> user_mad:   Documentation/infiniband/user_mad.txt has info on the new
> ABI.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Tue Nov 18 02:43:27 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 12:43:27 +0200
Subject: [ofa-general] [PATCH] opensm/osm_subnet: don't reassign zeroed
	config params
Message-ID: <20081118104327.GN10251@sashak.voltaire.com>


If string config parameter is NULL and input is null_str don't reassign
it again (and don't print useless "Loading Option" message).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_subnet.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index dc35a04..d787fe8 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -623,7 +623,8 @@ opts_unpack_charp(IN char *p_req_key,
 		  IN char *p_key, IN char *p_val_str, IN char **p_val)
 {
 	if (!strcmp(p_req_key, p_key) && p_val_str) {
-		if ((*p_val == NULL) || strcmp(p_val_str, *p_val)) {
+		const char *current_str = *p_val ? *p_val : null_str ;
+		if (strcmp(p_val_str, current_str)) {
 			log_config_value(p_key, "%s", p_val_str);
 			/* special case the "(null)" string */
 			if (strcmp(null_str, p_val_str) == 0) {
-- 
1.6.0.3.517.g759a


From vlad at lists.openfabrics.org  Tue Nov 18 03:21:20 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 18 Nov 2008 03:21:20 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081118-0200 daily build status
Message-ID: <20081118112120.71977E60C8D@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From monis at Voltaire.COM  Tue Nov 18 03:34:45 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 18 Nov 2008 13:34:45 +0200
Subject: [ofa-general] [PATCH] ipoib: Fix loss of connectivity after
	bonding failover on both sides
In-Reply-To: <490B448C.5080306@Voltaire.COM>
References: <490B448C.5080306@Voltaire.COM>
Message-ID: <4922A855.2010109@Voltaire.COM>

The patch assumes that the path query succeeds and therefore copies the HA from
the kernel neighbor structure to ipoib_neigh after path query is sent. If path query fails (e.g. 
request timeout) the next won't be triggered by finding that HA was updated in ipoib_strart_xmit().
This leads to a longer time that the destination node remains unaccessible.

The patch below is identical to Yossi's patch but without the copy of HA in neigh_add_path.


diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index fddded7..ec433bf 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -709,26 +709,26 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		neigh = *to_ipoib_neigh(skb->dst->neighbour);
 
-		if (neigh->ah)
-			if (unlikely((memcmp(&neigh->dgid.raw,
-					    skb->dst->neighbour->ha + 4,
-					    sizeof(union ib_gid))) ||
-					 (neigh->dev != dev))) {
-				spin_lock_irqsave(&priv->lock, flags);
-				/*
-				 * It's safe to call ipoib_put_ah() inside
-				 * priv->lock here, because we know that
-				 * path->ah will always hold one more reference,
-				 * so ipoib_put_ah() will never do more than
-				 * decrement the ref count.
-				 */
+		if (unlikely((memcmp(&neigh->dgid.raw,
+				skb->dst->neighbour->ha + 4,
+				sizeof(union ib_gid))) ||
+				(neigh->dev != dev))) {
+			spin_lock_irqsave(&priv->lock, flags);
+			/*
+			 * It's safe to call ipoib_put_ah() inside
+			 * priv->lock here, because we know that
+			 * path->ah will always hold one more reference,
+			 * so ipoib_put_ah() will never do more than
+			 * decrement the ref count.
+			 */
+			if (neigh->ah)
 				ipoib_put_ah(neigh->ah);
-				list_del(&neigh->list);
-				ipoib_neigh_free(dev, neigh);
-				spin_unlock_irqrestore(&priv->lock, flags);
-				ipoib_path_lookup(skb, dev);
-				return NETDEV_TX_OK;
-			}
+			list_del(&neigh->list);
+			ipoib_neigh_free(dev, neigh);
+			spin_unlock_irqrestore(&priv->lock, flags);
+			ipoib_path_lookup(skb, dev);
+			return NETDEV_TX_OK;
+		}
 
 		if (ipoib_cm_get(neigh)) {
 			if (ipoib_cm_up(neigh)) {


From vlad at dev.mellanox.co.il  Tue Nov 18 03:38:05 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 18 Nov 2008 13:38:05 +0200
Subject: [ofa-general] OFED-1.4-rc5 is available
Message-ID: <4922A91D.8060107@dev.mellanox.co.il>

Hi, 
OFED-1.4-rc5 release is available on 
http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc5.tgz 


To get BUILD_ID run ofed_info 

Please report any issues in bugzilla https://bugs.openfabrics.org/ for
OFED 1.4 

Vladimir & Tziporet

======================================================================== 

Release information: 
------------------------------ 
Linux Operating Systems: 
       - RedHat EL4 up4:       2.6.9-42.ELsmp      * 
       - RedHat EL4 up5:       2.6.9-55.ELsmp 
       - RedHat EL4 up6:       2.6.9-67.ELsmp 
       - RedHat EL4 up7:       2.6.9-78.ELsmp 
       - RedHat EL5:           2.6.18-8.el5 
       - RedHat EL5 up1:       2.6.18-53.el5 
       - RedHat EL5 up2:       2.6.18-92.el5 
       - CentOS 5.2:           2.6.18-92.el5 
       - Fedora C9:            2.6.25-14.fc9         * 
       - SLES10:               2.6.16.21-0.8-smp 
       - SLES10 SP1:           2.6.16.46-0.12-smp 
       - SLES10 SP1 up1:       2.6.16.53-0.16-smp 
       - SLES10 SP2:           2.6.16.60-0.21-smp 
       - OpenSuSE 10.3:        2.6.22.5-31          * 
       - kernel.org:           2.6.26 and 2.6.27 

     * Minimal QA for these versions 

Systems: 
       * x86_64 
       * x86 
       * ia64 
       * ppc64 


Main Changes from OFED-1.4-rc4
==============================
- Updated MPI packages: mvapich-1.1.0-3141, mvapich2-1.2p1-1
- Updated bonding package: ib-bonding-0.9.0-34
- Updated qperf: qperf-0.4.2-1
- 8 bugs fixed (see attached for details) 

- Attached kernel git tree changes: 


Tasks that should be completed for the release: 
================================ 
1. High priority bug fixes
2. Documentation update

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed_kernel-1.4-rc4_rc5.log
Type: text/x-log
Size: 13169 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081118/92bb5f6a/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4-rc5-fixed-bugs.csv
Type: text/csv
Size: 1137 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081118/92bb5f6a/attachment.csv>

From sashak at voltaire.com  Tue Nov 18 04:30:00 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 14:30:00 +0200
Subject: [ofa-general] Re: [PATCH] opensm/opensm/osm_state_mgr.c: Add check
	for valid physical port before using pointer.
In-Reply-To: <20081112185457.GD27271@sashak.voltaire.com>
References: <20081104095744.35893d4a.weiny2@llnl.gov>
	<20081110201333.GM313@sashak.voltaire.com>
	<20081110131140.52561f42.weiny2@llnl.gov>
	<20081112185457.GD27271@sashak.voltaire.com>
Message-ID: <20081118123000.GO10251@sashak.voltaire.com>

Hi,

On 20:54 Wed 12 Nov     , Sasha Khapyorsky wrote:
> > 
> > I was wondering if it would return invalid ports ever.  It would be easy for it
> > to return only valid ports but perhaps that should be another function to
> > preserve functionality?

Looked at this. Another problematic place where this function is used is
osm_sa_link_record.c - there when "any" port becomes invalid (which is
possible case) it starts an endless recursion :(. So we will need to fix
the function behavior.

One option is to scan all ports and to return valid one. Another solution
would be to update locally stored in OpenSM NodeInfo on each receive
(something like below). Then osm_node_get_any_physp_ptr() will return a
port where this node was accessed last.

In this way it also could catch potential OtherLocalSetting changes (in
NodeInfo, such as SystemImageGUID, etc.).

Could anybody see any downsides with such approach?

Sasha


diff --git a/opossum/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index 20b16d1..7d41cab 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -785,6 +785,8 @@ __osm_ni_rcv_process_existing(IN osm_sm_t * sm,
 		break;
 	}
 
+	p_node->node_info = *p_ni;
+
 	__osm_ni_rcv_set_links(sm, p_node, port_num, p_ni_context);
 
 	OSM_LOG_EXIT(sm->p_log);


From sashak at voltaire.com  Tue Nov 18 04:51:30 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 14:51:30 +0200
Subject: [ofa-general] [PATCH] opensm/osm_trap_rcv.c: separate port disabling
	code
Message-ID: <20081118125130.GR10251@sashak.voltaire.com>


Separate port disabling code (activated with "babbling_port_policy")
into disable_port() function.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_trap_rcv.c |  108 ++++++++++++++++--------------------------
 1 files changed, 41 insertions(+), 67 deletions(-)

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 3b05775..5de283b 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -232,6 +232,44 @@ static int __print_num_received(IN uint32_t num_received)
 		return 0;
 }
 
+static int disable_port(osm_sm_t *sm, osm_physp_t *p)
+{
+	uint8_t payload[IB_SMP_DATA_SIZE];
+	osm_madw_context_t context;
+	ib_port_info_t *pi = (ib_port_info_t *)payload;
+	int ret;
+
+	/* If trap 131, might want to disable peer port if available */
+	/* but peer port has been observed not to respond to SM requests */
+
+	OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3810: "
+		"Disabling physical port 0x%016" PRIx64 " num:%u\n",
+		cl_ntoh64(osm_physp_get_port_guid(p)), p->port_num);
+
+	memcpy(payload, &p->port_info, sizeof(ib_port_info_t));
+
+	/* Set port to disabled/down */
+	ib_port_info_set_port_state(pi, IB_LINK_DOWN);
+	ib_port_info_set_port_phys_state(IB_PORT_PHYS_STATE_DISABLED, pi);
+
+	/* Issue set of PortInfo */
+	context.pi_context.node_guid = osm_node_get_node_guid(p->p_node);
+	context.pi_context.port_guid = osm_physp_get_port_guid(p);
+	context.pi_context.set_method = TRUE;
+	context.pi_context.light_sweep = FALSE;
+	context.pi_context.active_transition = FALSE;
+
+	ret = osm_req_set(sm, osm_physp_get_dr_path_ptr(p),
+			  payload, sizeof(payload), IB_MAD_ATTR_PORT_INFO,
+			  cl_hton32(osm_physp_get_port_num(p)),
+			  CL_DISP_MSGID_NONE, &context);
+	if (ret)
+		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 3811: "
+			"Request to set PortInfo failed\n");
+
+	return ret;
+}
+
 /**********************************************************************
  **********************************************************************/
 static void
@@ -454,73 +492,9 @@ __osm_trap_rcv_process_request(IN osm_sm_t * sm,
 					   Threshold for disabling a "babbling" port is exceeded */
 					if (sm->p_subn->opt.
 					    babbling_port_policy
-					    && num_received >= 250) {
-						uint8_t
-						    payload[IB_SMP_DATA_SIZE];
-						ib_port_info_t *p_pi =
-						    (ib_port_info_t *) payload;
-						const ib_port_info_t *p_old_pi;
-						osm_madw_context_t context;
-
-						/* If trap 131, might want to disable peer port if available */
-						/* but peer port has been observed not to respond to SM requests */
-
-						OSM_LOG(sm->p_log, OSM_LOG_ERROR,
-							"ERR 3810: "
-							"Disabling physical port lid:%u num:%u\n",
-							cl_ntoh16(p_ntci->
-								  data_details.
-								  ntc_129_131.
-								  lid),
-							p_ntci->data_details.
-							ntc_129_131.port_num);
-
-						p_old_pi = &p_physp->port_info;
-						memcpy(payload, p_old_pi,
-						       sizeof(ib_port_info_t));
-
-						/* Set port to disabled/down */
-						ib_port_info_set_port_state
-						    (p_pi, IB_LINK_DOWN);
-						ib_port_info_set_port_phys_state
-						    (IB_PORT_PHYS_STATE_DISABLED,
-						     p_pi);
-
-						/* Issue set of PortInfo */
-						context.pi_context.node_guid =
-						    osm_node_get_node_guid
-						    (osm_physp_get_node_ptr
-						     (p_physp));
-						context.pi_context.port_guid =
-						    osm_physp_get_port_guid
-						    (p_physp);
-						context.pi_context.set_method =
-						    TRUE;
-						context.pi_context.light_sweep =
-						    FALSE;
-						context.pi_context.
-						    active_transition = FALSE;
-
-						status =
-						    osm_req_set(sm,
-								osm_physp_get_dr_path_ptr
-								(p_physp),
-								payload,
-								sizeof(payload),
-								IB_MAD_ATTR_PORT_INFO,
-								cl_hton32
-								(osm_physp_get_port_num
-								 (p_physp)),
-								CL_DISP_MSGID_NONE,
-								&context);
-
-						if (status == IB_SUCCESS)
-							goto Exit;
-
-						OSM_LOG(sm->p_log,
-							OSM_LOG_ERROR, "ERR 3811: "
-							"Request to set PortInfo failed\n");
-					}
+					    && num_received >= 250
+					    && disable_port(sm, p_physp) == 0)
+						goto Exit;
 
 					OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
 						"Marking unhealthy physical port by lid:%u num:%u\n",
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Tue Nov 18 04:53:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 14:53:25 +0200
Subject: [ofa-general] [PATCH] opensm: disable switch ports only
In-Reply-To: <20081118125130.GR10251@sashak.voltaire.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
Message-ID: <20081118125325.GS10251@sashak.voltaire.com>


When "babbling port" policy is on disable switch ports even when trap
source is endport. This will allow to handle disable ports remotely
(with ibportstate, etc.).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_trap_rcv.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 5de283b..07c5183 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -239,6 +239,10 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
 	ib_port_info_t *pi = (ib_port_info_t *)payload;
 	int ret;
 
+	/* in case of endport - disable switch's peer port */
+	if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
+		p = p->p_remote_physp;
+
 	/* If trap 131, might want to disable peer port if available */
 	/* but peer port has been observed not to respond to SM requests */
 
-- 
1.6.0.3.517.g759a


From hal.rosenstock at gmail.com  Tue Nov 18 05:04:14 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 18 Nov 2008 08:04:14 -0500
Subject: [ofa-general] Re: [PATCH] opensm: disable switch ports only
In-Reply-To: <20081118125325.GS10251@sashak.voltaire.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
Message-ID: <f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>

On Tue, Nov 18, 2008 at 7:53 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> When "babbling port" policy is on disable switch ports even when trap
> source is endport.

So does disables the peer switch port to an endport which is babbling
? That could be made clearer in the description.

What happens if the end port is switch port 0 ?

> This will allow to handle disable ports remotely
> (with ibportstate, etc.).

I'm not following what you mean by the ibportstate comment here. What
port can ibportstate now disable differently from before ?

-- Hal

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_trap_rcv.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index 5de283b..07c5183 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -239,6 +239,10 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
>        ib_port_info_t *pi = (ib_port_info_t *)payload;
>        int ret;
>
> +       /* in case of endport - disable switch's peer port */
> +       if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
> +               p = p->p_remote_physp;
> +
>        /* If trap 131, might want to disable peer port if available */
>        /* but peer port has been observed not to respond to SM requests */
>
> --
> 1.6.0.3.517.g759a
>
>


From halr at obsidianresearch.com  Tue Nov 18 05:05:27 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Tue, 18 Nov 2008 06:05:27 -0700
Subject: [ofa-general] opensm/osm_port_info_rcv.c: Remove SM from sm_guid_tbl
 when IsSM is not present
Message-ID: <4922BD97.403@obsidianresearch.com>

Sasha,

The following patch removes the SM from the sm_guid_table when IsSM is 
not present. Compile tested only as I don't have an environment to 
recreate this anymore.

-- Hal


From sashak at voltaire.com  Tue Nov 18 05:29:22 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 15:29:22 +0200
Subject: [ofa-general] Re: [PATCH] opensm: disable switch ports only
In-Reply-To: <f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
Message-ID: <20081118132922.GT10251@sashak.voltaire.com>

On 08:04 Tue 18 Nov     , Hal Rosenstock wrote:
> On Tue, Nov 18, 2008 at 7:53 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> >
> > When "babbling port" policy is on disable switch ports even when trap
> > source is endport.
> 
> So does disables the peer switch port to an endport which is babbling
> ?

Yes.

> That could be made clearer in the description.

Ok.

> What happens if the end port is switch port 0 ?

When it should work as usual (it doesn't have remote port).

> > This will allow to handle disable ports remotely
> > (with ibportstate, etc.).
> 
> I'm not following what you mean by the ibportstate comment here. What
> port can ibportstate now disable differently from before ?

It is the same, but when endport is disabled how could we reenable this
remotely via downed link?

Sasha

> 
> -- Hal
> 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > ---
> >  opensm/opensm/osm_trap_rcv.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> > index 5de283b..07c5183 100644
> > --- a/opensm/opensm/osm_trap_rcv.c
> > +++ b/opensm/opensm/osm_trap_rcv.c
> > @@ -239,6 +239,10 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
> >        ib_port_info_t *pi = (ib_port_info_t *)payload;
> >        int ret;
> >
> > +       /* in case of endport - disable switch's peer port */
> > +       if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
> > +               p = p->p_remote_physp;
> > +
> >        /* If trap 131, might want to disable peer port if available */
> >        /* but peer port has been observed not to respond to SM requests */
> >
> > --
> > 1.6.0.3.517.g759a
> >
> >


From sashak at voltaire.com  Tue Nov 18 05:30:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 15:30:25 +0200
Subject: [ofa-general] Re: opensm/osm_port_info_rcv.c: Remove SM from
	sm_guid_tbl when IsSM is not present
In-Reply-To: <4922BD97.403@obsidianresearch.com>
References: <4922BD97.403@obsidianresearch.com>
Message-ID: <20081118133025.GU10251@sashak.voltaire.com>

Hi Hal,

On 06:05 Tue 18 Nov     , Hal Rosenstock wrote:
>
> The following patch removes the SM from the sm_guid_table when IsSM is not 
> present. Compile tested only as I don't have an environment to recreate 
> this anymore.

Did you forget the patch?

Sasha


From rdreier at cisco.com  Tue Nov 18 08:01:58 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 18 Nov 2008 08:01:58 -0800
Subject: [ofa-general] [PATCH] ipoib: Fix loss of connectivity after
	bonding failover on both sides
In-Reply-To: <4922A855.2010109@Voltaire.COM> (Moni Shoua's message of "Tue, 18
	Nov 2008 13:34:45 +0200")
References: <490B448C.5080306@Voltaire.COM> <4922A855.2010109@Voltaire.COM>
Message-ID: <adavdulhvk9.fsf@cisco.com>

 > The patch below is identical to Yossi's patch but without the copy of HA in neigh_add_path.

Why did Yossi include that copy?  Does this patch still fix everything?

 - R.


From hal.rosenstock at gmail.com  Tue Nov 18 08:05:46 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 18 Nov 2008 11:05:46 -0500
Subject: [ofa-general] Re: [PATCH] opensm: disable switch ports only
In-Reply-To: <20081118132922.GT10251@sashak.voltaire.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
	<20081118132922.GT10251@sashak.voltaire.com>
Message-ID: <f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>

On Tue, Nov 18, 2008 at 8:29 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 08:04 Tue 18 Nov     , Hal Rosenstock wrote:
>> On Tue, Nov 18, 2008 at 7:53 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> >
>> > When "babbling port" policy is on disable switch ports even when trap
>> > source is endport.
>>
>> So does disables the peer switch port to an endport which is babbling
>> ?
>
> Yes.
>
>> That could be made clearer in the description.
>
> Ok.
>
>> What happens if the end port is switch port 0 ?
>
> When it should work as usual (it doesn't have remote port).
>
>> > This will allow to handle disable ports remotely
>> > (with ibportstate, etc.).
>>
>> I'm not following what you mean by the ibportstate comment here. What
>> port can ibportstate now disable differently from before ?
>
> It is the same,

OK.

> but when endport is disabled how could we reenable this
> remotely via downed link?

I don't understand what you mean. ibportstate does not allow disabling
of end port.

-- Hal

> Sasha
>
>>
>> -- Hal
>>
>> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>> > ---
>> >  opensm/opensm/osm_trap_rcv.c |    4 ++++
>> >  1 files changed, 4 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
>> > index 5de283b..07c5183 100644
>> > --- a/opensm/opensm/osm_trap_rcv.c
>> > +++ b/opensm/opensm/osm_trap_rcv.c
>> > @@ -239,6 +239,10 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
>> >        ib_port_info_t *pi = (ib_port_info_t *)payload;
>> >        int ret;
>> >
>> > +       /* in case of endport - disable switch's peer port */
>> > +       if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
>> > +               p = p->p_remote_physp;
>> > +
>> >        /* If trap 131, might want to disable peer port if available */
>> >        /* but peer port has been observed not to respond to SM requests */
>> >
>> > --
>> > 1.6.0.3.517.g759a
>> >
>> >
>


From monis at Voltaire.COM  Tue Nov 18 08:11:49 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 18 Nov 2008 18:11:49 +0200
Subject: [ofa-general] [PATCH] ipoib: Fix loss of connectivity
	after	bonding failover on both sides
In-Reply-To: <adavdulhvk9.fsf@cisco.com>
References: <490B448C.5080306@Voltaire.COM> <4922A855.2010109@Voltaire.COM>
	<adavdulhvk9.fsf@cisco.com>
Message-ID: <4922E945.3030102@Voltaire.COM>

Roland Dreier wrote:
>  > The patch below is identical to Yossi's patch but without the copy of HA in neigh_add_path.
> 
> Why did Yossi include that copy?  Does this patch still fix everything?
> 
>  - R.
Yossi's intention was to save compares (gid size long) in ipoib_start_xmit().
The thought was that once the condition to start a new path query is met there 
is no need to meet it again (especially when the cost is high)
The only thing is what happens when path query fails (I explained it above) and this is 
why I think its better to remove the copy.

The new patch still fixes the basic problem that it intends to (as explained by Yossi)

thanks
 MoniS


From halr at obsidianresearch.com  Tue Nov 18 08:12:58 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Tue, 18 Nov 2008 09:12:58 -0700
Subject: [ofa-general] [PATCH] opensm/osm_port_info_rcv.c: Remove SM from
 sm_guid_tbl when IsSM is not present
Message-ID: <4922E98A.7020403@obsidianresearch.com>

Sasha,

The following patch (attached this time:-) removes the SM from the 
sm_guid_table when IsSM is not present. Compile tested only as I don't 
have an environment to recreate this anymore.

-- Hal
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-pir-issm1
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081118/d8c81f34/attachment.ksh>

From sashak at voltaire.com  Tue Nov 18 08:56:32 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 18 Nov 2008 18:56:32 +0200
Subject: [ofa-general] Re: [PATCH] opensm: disable switch ports only
In-Reply-To: <f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
	<20081118132922.GT10251@sashak.voltaire.com>
	<f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>
Message-ID: <20081118165632.GW10251@sashak.voltaire.com>

On 11:05 Tue 18 Nov     , Hal Rosenstock wrote:
> 
> > but when endport is disabled how could we reenable this
> > remotely via downed link?
> 
> I don't understand what you mean. ibportstate does not allow disabling
> of end port.

Right. And this is the reason to disable switch external port. And also
if we have (hypothetically) the tool which is able to disable/enable
endport, we will not be able to access this endport via downed link,
only local reset/reboot will help.

Sasha


From hal.rosenstock at gmail.com  Tue Nov 18 09:01:37 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 18 Nov 2008 12:01:37 -0500
Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: disable switch ports
	only
In-Reply-To: <20081118165632.GW10251@sashak.voltaire.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
	<20081118132922.GT10251@sashak.voltaire.com>
	<f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>
	<20081118165632.GW10251@sashak.voltaire.com>
Message-ID: <f0e08f230811180901p5c788f17i6f3eee6cfc0a54f8@mail.gmail.com>

On Tue, Nov 18, 2008 at 11:56 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 11:05 Tue 18 Nov     , Hal Rosenstock wrote:
>>
>> > but when endport is disabled how could we reenable this
>> > remotely via downed link?
>>
>> I don't understand what you mean. ibportstate does not allow disabling
>> of end port.
>
> Right. And this is the reason to disable switch external port. And also
> if we have (hypothetically) the tool which is able to disable/enable
> endport, we will not be able to access this endport via downed link,
> only local reset/reboot will help.

Yes, disabling the switch peer port is the alternative to disabling
the end port. The latter is not allowed and I agree that disabling the
switch peer port is a better choice IMO as the admin can't shoot
himself in the foot and have to reset/reboot to reenable.

-- Hal

>
> Sasha
>


From weiny2 at llnl.gov  Tue Nov 18 14:06:08 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 18 Nov 2008 14:06:08 -0800
Subject: [ofa-general] Re: [PATCH] opensm/opensm/osm_state_mgr.c: Add check
 for valid physical port before using pointer.
In-Reply-To: <20081118123000.GO10251@sashak.voltaire.com>
References: <20081104095744.35893d4a.weiny2@llnl.gov>
	<20081110201333.GM313@sashak.voltaire.com>
	<20081110131140.52561f42.weiny2@llnl.gov>
	<20081112185457.GD27271@sashak.voltaire.com>
	<20081118123000.GO10251@sashak.voltaire.com>
Message-ID: <20081118140608.19ac0963.weiny2@llnl.gov>

I am not sure this will fix my bug.

The stack trace in my bug ended with:

   #0  osm_vendor_get (h_bind=0x0, mad_size=256, p_vw=0x69bbe8) at

The h_bind was being extracted from the osm_physp_t object.  Would this fix
ensure that the h_bind pointer was valid in the osm_physp_t object returned?

I used the "osm_physp_is_valid" function because the port_guid in osm_physp_t
object was only set after port_info returned valid data which would also ensure
that h_bind was set up correctly.  That happens through the call path:

   osm_pi_rcv_process->osm_physp_init->osm_dr_path_init

Currently osm_node_t->node_info is set when __osm_ni_rcv_process_new calls
osm_node_new.

osm_node_new calls osm_node_init_physp->osm_physp_init->osm_dr_path_init; but
only on the portnum which in the NodeInfo SMP.  Perhaps osm_physp_init needs to
be called again as in this patch:


diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index 20b16d1..5749a66 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -785,6 +785,9 @@ __osm_ni_rcv_process_existing(IN osm_sm_t * sm,
                break;
        }
 
+       p_node->node_info = *p_ni;
+       osm_node_init_physp(p_node, p_madw);
+
        __osm_ni_rcv_set_links(sm, p_node, port_num, p_ni_context);
 
        OSM_LOG_EXIT(sm->p_log);


Thoughts?
Ira


On Tue, 18 Nov 2008 14:30:00 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi,
> 
> On 20:54 Wed 12 Nov     , Sasha Khapyorsky wrote:
> > > 
> > > I was wondering if it would return invalid ports ever.  It would be easy for it
> > > to return only valid ports but perhaps that should be another function to
> > > preserve functionality?
> 
> Looked at this. Another problematic place where this function is used is
> osm_sa_link_record.c - there when "any" port becomes invalid (which is
> possible case) it starts an endless recursion :(. So we will need to fix
> the function behavior.
> 
> One option is to scan all ports and to return valid one. Another solution
> would be to update locally stored in OpenSM NodeInfo on each receive
> (something like below). Then osm_node_get_any_physp_ptr() will return a
> port where this node was accessed last.
> 
> In this way it also could catch potential OtherLocalSetting changes (in
> NodeInfo, such as SystemImageGUID, etc.).
> 
> Could anybody see any downsides with such approach?
> 
> Sasha
> 
> 
> diff --git a/opossum/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
> index 20b16d1..7d41cab 100644
> --- a/opensm/opensm/osm_node_info_rcv.c
> +++ b/opensm/opensm/osm_node_info_rcv.c
> @@ -785,6 +785,8 @@ __osm_ni_rcv_process_existing(IN osm_sm_t * sm,
>  		break;
>  	}
>  
> +	p_node->node_info = *p_ni;
> +
>  	__osm_ni_rcv_set_links(sm, p_node, port_num, p_ni_context);
>  
>  	OSM_LOG_EXIT(sm->p_log);


From meier3 at llnl.gov  Tue Nov 18 17:10:37 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Tue, 18 Nov 2008 17:10:37 -0800
Subject: [ofa-general] [PATCH] Opensm: main exit codes
Message-ID: <4923678D.3080701@llnl.gov>

Hey Sasha,

  I thought it would be useful to define a set of exit codes for opensm.  A quick examination of main.c
showed a few different ways to terminate.  How about this patch?  Obviously this doesn't catch every
possible exit scenario, but its a start that can be built upon.

>From d38854b804caac77ba7985fdf2314e412420cdad Mon Sep 17 00:00:00 2001
From: Tim Meier <meier3 at llnl.gov>
Date: Tue, 18 Nov 2008 16:51:14 -0800
Subject: [PATCH] Opensm: main exit codes

Defined a set of exits codes and modified main() to use them as much as
possible.

Signed-off-by: Tim Meier <meier3 at llnl.gov>
---
 opensm/include/opensm/osm_opensm.h |   24 ++++++++++++++++++++++++
 opensm/opensm/main.c               |   30 ++++++++++++++++--------------
 2 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h
index c121be4..5e78dba 100644
--- a/opensm/include/opensm/osm_opensm.h
+++ b/opensm/include/opensm/osm_opensm.h
@@ -87,6 +87,30 @@ BEGIN_C_DECLS
 *      Steve King, Intel
 *
 *********/
+/****d* OpenSM: OpenSM/osm_exit_type_t
+* NAME
+*       osm_exit_type_t
+*
+* DESCRIPTION
+*       Enumerates the possible exit codes that
+*       are provided by OpenSM.
+*
+* SYNOPSIS
+*/
+typedef enum _osm_exit_type {
+       OSM_EXIT_TYPE_NORMAL = 0,
+       OSM_EXIT_TYPE_GENERIC_ERR,
+       OSM_EXIT_TYPE_USAGE,
+       OSM_EXIT_TYPE_FORK_ERR,
+       OSM_EXIT_TYPE_DIFFERENT_DEBUG_MODE,
+       OSM_EXIT_TYPE_DUPLICATE_OSM_GUID,
+       OSM_EXIT_TYPE_CONFIG_PARSE_ERR,
+       OSM_EXIT_TYPE_CONF_FILE_WRITE_ERR,
+       OSM_EXIT_TYPE_INVALID_ARG_VAL,
+       OSM_EXIT_TYPE_UNKNOWN_CMDLINE_ARG,
+       OSM_EXIT_TYPE_UNKNOWN
+} osm_exit_type_t;
+/***********/
 /****d* OpenSM: OpenSM/osm_routing_engine_type_t
 * NAME
 *       osm_routing_engine_type_t
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 53648d6..d3aa55c 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -347,7 +347,7 @@ static void show_usage(void)
        printf("--help, -h, -?\n"
               "          Display this usage info then exit.\n\n");
        fflush(stdout);
-       exit(2);
+       exit(OSM_EXIT_TYPE_USAGE);
 }

 /**********************************************************************
@@ -451,17 +451,17 @@ static int daemonize(osm_opensm_t * osm)

        if ((pid = fork()) < 0) {
                perror("fork");
-               exit(-1);
+               exit(OSM_EXIT_TYPE_FORK_ERR);
        } else if (pid > 0)
-               exit(0);
+               exit(OSM_EXIT_TYPE_NORMAL);

        setsid();

        if ((pid = fork()) < 0) {
                perror("fork");
-               exit(-1);
+               exit(OSM_EXIT_TYPE_FORK_ERR);
        } else if (pid > 0)
-               exit(0);
+               exit(OSM_EXIT_TYPE_NORMAL);

        close(0);
        close(1);
@@ -516,6 +516,7 @@ int main(int argc, char *argv[])
 {
        osm_opensm_t osm;
        osm_subn_opt_t opt;
+       int exit_code = OSM_EXIT_TYPE_NORMAL;
        ib_net64_t sm_key = 0;
        ib_api_status_t status;
        uint32_t temp, dbg_lvl;
@@ -595,7 +596,7 @@ int main(int argc, char *argv[])
                        "ERROR: OpenSM and Complib were compiled using different modes\n");
                fprintf(stderr, "ERROR: OpenSM debug:%d Complib debug:%d \n",
                        osm_is_debug(), cl_is_debug());
-               exit(1);
+               exit(OSM_EXIT_TYPE_DIFFERENT_DEBUG_MODE);
        }
 #if defined (_DEBUG_) && defined (OSM_VENDOR_INTF_OPENIB)
        enable_stack_dump(1);
@@ -615,7 +616,7 @@ int main(int argc, char *argv[])
                                               long_option, NULL);
                switch (next_option) {
                case 12: /* --version - already printed above */
-                       exit(0);
+                       exit(OSM_EXIT_TYPE_NORMAL);
                        break;
                case 'F':
                        if (config_file_done)
@@ -623,7 +624,7 @@ int main(int argc, char *argv[])
                        printf("Reloading config from `%s`:\n", optarg);
                        if (osm_subn_parse_conf_file(optarg, &opt)) {
                                printf("cannot parse config file.\n");
-                               exit(1);
+                               exit(OSM_EXIT_TYPE_CONFIG_PARSE_ERR);
                        }
                        printf("Rescaning command line:\n");
                        config_file_done = 1;
@@ -755,7 +756,7 @@ int main(int argc, char *argv[])
                        if (temp > 7) {
                                fprintf(stderr,
                                        "ERROR: LMC must be 7 or less.\n");
-                               return (-1);
+                               exit(OSM_EXIT_TYPE_INVALID_ARG_VAL);
                        }
                        opt.lmc = (uint8_t) temp;
                        printf(" LMC = %d\n", temp);
@@ -821,7 +822,7 @@ int main(int argc, char *argv[])
                        if (0 > temp || 15 < temp) {
                                fprintf(stderr,
                                        "ERROR: priority must be between 0 and 15\n");
-                               return (-1);
+                               exit (OSM_EXIT_TYPE_INVALID_ARG_VAL);
                        }
                        opt.sm_priority = (uint8_t) temp;
                        printf(" Priority = %d\n", temp);
@@ -931,7 +932,7 @@ int main(int argc, char *argv[])
                case -1:
                        break;  /* done with option */
                default:        /* something wrong */
-                       abort();
+                       exit(OSM_EXIT_TYPE_UNKNOWN_CMDLINE_ARG);
                }
        }
        while (next_option != -1);
@@ -945,7 +946,7 @@ int main(int argc, char *argv[])
                status = osm_subn_write_conf_file(conf_template, &opt);
                if (status)
                        printf("\nosm_subn_write_conf_file failed!\n");
-               exit(status);
+               exit(status? OSM_EXIT_TYPE_CONF_FILE_WRITE_ERR: OSM_EXIT_TYPE_NORMAL);
        }

        if (vendor_debug)
@@ -967,7 +968,7 @@ int main(int argc, char *argv[])
                /* We will just exit, and not go to Exit, since we don't
                   want the destroy to be called. */
                complib_exit();
-               return (status);
+               exit (status);
        }

        /*
@@ -982,6 +983,7 @@ int main(int argc, char *argv[])
                printf("\nError from osm_opensm_bind (0x%X)\n", status);
                printf
                    ("Perhaps another instance of OpenSM is already running\n");
+               exit_code = OSM_EXIT_TYPE_DUPLICATE_OSM_GUID;
                goto Exit;
        }

@@ -1021,5 +1023,5 @@ Exit:
        osm_opensm_destroy(&osm);
        complib_exit();

-       exit(0);
+       exit(exit_code);
 }
--
1.5.4.5


-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
925.422.3341
meier3 at llnl.gov
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-Opensm-main-exit-codes.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081118/52dee4bd/attachment.ksh>

From kliteyn at dev.mellanox.co.il  Wed Nov 19 01:43:14 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 19 Nov 2008 11:43:14 +0200
Subject: [ofa-general] [PATCH] opensm/osm_lid_mgr.c: cosmetics in log message
Message-ID: <4923DFB2.8000305@dev.mellanox.co.il>

Sasha,

Small log message cosmetics fix.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_lid_mgr.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index c135d4a..c90292a 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -1042,7 +1042,7 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr,
 		    (op_vls != ib_port_info_get_op_vls(p_old_pi))) {
 			OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
 				"Sending Link Down to GUID 0x%016"
-				PRIx64 "port %d due to op_vls or "
+				PRIx64 " port %d due to op_vls or "
 				"mtu change. MTU:%u,%u VL_CAP:%u,%u\n",
 				cl_ntoh64(osm_physp_get_port_guid(p_physp)),
 				port_num, mtu,
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Wed Nov 19 01:51:48 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 19 Nov 2008 11:51:48 +0200
Subject: [ofa-general] [PATCH] opensm/osm_state_mgr.c: bug fix in unicast
	cache
Message-ID: <4923E1B4.2030600@dev.mellanox.co.il>

Hi Sasha,

When there are errors during initialization and new
heavy sweep is forced, unicast cache might hold a
snapshot of the previous routing, and since there
might be no *topology* changes, ucast cache will
apply that cached routing, which might be wrong.

This patch invalidates cache explicitly if there
were initialization errors in addition to few other
cases.

This fix addresses bug #1398.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_state_mgr.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 841438c..d00e8ff 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1064,6 +1064,18 @@ static void do_sweep(osm_sm_t * sm)
 	}

 	/*
+	 * Unicast cache should be invalidated if:
+	 *  - every sweep is a heavy sweep
+	 *  - there were errors during initialization
+	 *  - subnet re-route is requested
+	 */
+	if (sm->p_subn->opt.use_ucast_cache &&
+	    (sm->p_subn->opt.force_heavy_sweep ||
+	     sm->p_subn->subnet_initialization_error ||
+	     sm->p_subn->force_reroute))
+		osm_ucast_cache_invalidate(&sm->ucast_mgr);
+
+	/*
 	 * If we don't need to do a heavy sweep and we want to do a reroute,
 	 * just reroute only.
 	 */
@@ -1079,10 +1091,6 @@ static void do_sweep(osm_sm_t * sm)
 		/* Re-program the switches fully */
 		sm->p_subn->ignore_existing_lfts = TRUE;

-		/* we want to re-route, so cache should be invalidated */
-		if (sm->p_subn->opt.use_ucast_cache)
-			osm_ucast_cache_invalidate(&sm->ucast_mgr);
-
 		osm_ucast_mgr_process(&sm->ucast_mgr);

 		/* Reset flag */
-- 
1.5.1.4


From vlad at lists.openfabrics.org  Wed Nov 19 03:43:01 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 19 Nov 2008 03:43:01 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081119-0200 daily build status
Message-ID: <20081119114302.18511E60E83@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From dorfman.eli at gmail.com  Wed Nov 19 04:08:27 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Wed, 19 Nov 2008 14:08:27 +0200
Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] opensm: disable
	switch ports only
In-Reply-To: <f0e08f230811180901p5c788f17i6f3eee6cfc0a54f8@mail.gmail.com>
References: <20081118125130.GR10251@sashak.voltaire.com>	<20081118125325.GS10251@sashak.voltaire.com>	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>	<20081118132922.GT10251@sashak.voltaire.com>	<f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>	<20081118165632.GW10251@sashak.voltaire.com>
	<f0e08f230811180901p5c788f17i6f3eee6cfc0a54f8@mail.gmail.com>
Message-ID: <492401BB.3050808@gmail.com>

Hal Rosenstock wrote:
> On Tue, Nov 18, 2008 at 11:56 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>> On 11:05 Tue 18 Nov     , Hal Rosenstock wrote:
>>>> but when endport is disabled how could we reenable this
>>>> remotely via downed link?
>>> I don't understand what you mean. ibportstate does not allow disabling
>>> of end port.
>> Right. And this is the reason to disable switch external port. And also
>> if we have (hypothetically) the tool which is able to disable/enable
>> endport, we will not be able to access this endport via downed link,
>> only local reset/reboot will help.
> 
> Yes, disabling the switch peer port is the alternative to disabling
> the end port. The latter is not allowed and I agree that disabling the
> switch peer port is a better choice IMO as the admin can't shoot
> himself in the foot and have to reset/reboot to reenable.
> 

More generic approach would be to disable the port with the least hop count.
this will address the case of inter switch link where the most remote port (from opensm) is sending traps.
in that case we would like to disable the nearest switch port.

Eli. 
 

From hal.rosenstock at gmail.com  Wed Nov 19 07:00:25 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 19 Nov 2008 10:00:25 -0500
Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: disable switch ports
	only
In-Reply-To: <492401BB.3050808@gmail.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
	<20081118132922.GT10251@sashak.voltaire.com>
	<f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>
	<20081118165632.GW10251@sashak.voltaire.com>
	<f0e08f230811180901p5c788f17i6f3eee6cfc0a54f8@mail.gmail.com>
	<492401BB.3050808@gmail.com>
Message-ID: <f0e08f230811190700v35a83c5cl6736ea2f4d734ec5@mail.gmail.com>

On Wed, Nov 19, 2008 at 7:08 AM, Eli Dorfman <dorfman.eli at gmail.com> wrote:

> More generic approach would be to disable the port with the least hop count.
> this will address the case of inter switch link where the most remote port (from opensm) is sending traps.
> in that case we would like to disable the nearest switch port.

Yes, that's another approach and has the potential advantage of
disabling fewer ports when more ports are babbling. It does assume a
closer interswitch link which doesn't affect any other endports.

Anyhow, IMO the trap rate issue has been around long enough to have
been fixed in those SMAs.

-- Hal

> Eli.


From sashak at voltaire.com  Wed Nov 19 09:50:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 19 Nov 2008 19:50:25 +0200
Subject: [ofa-general] [PATCH] opensm: fix QoS config bug
Message-ID: <20081119175025.GH6183@sashak.voltaire.com>


Then file is not given or OpenSM cannot open it config verification
procedure is not running and as result QoS parameters still have wrong
values - OpenSM crashes later when '-Q' is used.

This addresses bug #1401.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_subnet.h |    1 +
 opensm/opensm/main.c               |    2 ++
 opensm/opensm/osm_subnet.c         |    8 +++++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
index 2bcd232..d97d5f4 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -1100,6 +1100,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t * const p_opt);
 *	Assumes the conf file is part of the cache dir which defaults to
 *	OSM_DEFAULT_CACHE_DIR or OSM_CACHE_DIR the name is opensm.opts
 *********/
+int osm_subn_verify_config(osm_subn_opt_t * const p_opt);
 
 END_C_DECLS
 #endif				/* _OSM_SUBNET_H_ */
diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index 53648d6..999e92f 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -948,6 +948,8 @@ int main(int argc, char *argv[])
 		exit(status);
 	}
 
+	osm_subn_verify_config(&opt);
+
 	if (vendor_debug)
 		osm_vendor_set_debug(osm.p_vendor, vendor_debug);
 
diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index d787fe8..c41962d 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -949,7 +949,7 @@ static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix,
 	subn_verify_sl2vl(&set->sl2vl, prefix, dflt->sl2vl);
 }
 
-static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
+int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts)
 {
 	if (p_opts->lmc > 7) {
 		log_report(" Invalid Cached Option Value:lmc = %u:"
@@ -1024,6 +1024,8 @@ static void subn_verify_conf_file(IN osm_subn_opt_t * const p_opts)
 		    OSM_PERFMGR_DEFAULT_MAX_OUTSTANDING_QUERIES;
 	}
 #endif
+
+	return 0;
 }
 
 /**********************************************************************
@@ -1285,7 +1287,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts)
 	}
 	fclose(opts_file);
 
-	subn_verify_conf_file(p_opts);
+	osm_subn_verify_config(p_opts);
 
 	return 0;
 }
@@ -1340,7 +1342,7 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn)
 	}
 	fclose(opts_file);
 
-	subn_verify_conf_file(&p_subn->opt);
+	osm_subn_verify_config(&p_subn->opt);
 
 	osm_parse_prefix_routes_file(p_subn);
 
-- 
1.6.0.3.517.g759a


From sashak at voltaire.com  Wed Nov 19 10:30:20 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 19 Nov 2008 20:30:20 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_port_info_rcv.c: Remove SM from
	sm_guid_tbl when IsSM is not present
In-Reply-To: <4922E98A.7020403@obsidianresearch.com>
References: <4922E98A.7020403@obsidianresearch.com>
Message-ID: <20081119183020.GJ6183@sashak.voltaire.com>

Hi Hal,

On 09:12 Tue 18 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> The following patch (attached this time:-) removes the SM from the 
> sm_guid_table when IsSM is not present. Compile tested only as I don't have 
> an environment to recreate this anymore.
>
> -- Hal

> opensm/osm_port_info_rcv.c: Remove SM from sm_guid_tbl when IsSM is 
> not present in PortInfo:CapabilityMask
> 
> SM should be removed from the sm_guid_tbl subsequent to a trap 144
> indicating the capability mask changed (and the new capabilities
> no longer include IsSM).
> 
> As a result of this, move clearing of SM state to be conditionalized on 
> IsSM present rather than regardless of whether IsSM is set
> 
> Prior to this patch, the OpenSM log is spammed with error messages on
> SubnGets of SMInfo attribute.
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> 
> diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
> index 47eb457..97ec5b3 100644
> --- a/opensm/opensm/osm_port_info_rcv.c
> +++ b/opensm/opensm/osm_port_info_rcv.c
> @@ -149,17 +149,17 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>  			 */
>  			__osm_pi_rcv_set_sm(sm, p_physp);
>  	} else {
> -		/*
> -		   Before querying the SM - we want to make sure we clean its state, so
> -		   if the querying fails we recognize that this SM is not active.
> -		 */
>  		p_sm_tbl = &sm->p_subn->sm_guid_tbl;
> -		p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
> -		if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
> -			/* clean it up */
> -			p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
> -
>  		if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {
> +			/*
> +			 * Before querying the SM - we want to make sure we
> +			 * clean its state, so if the querying fails we
> +			 * recognize that this SM is not active.
> +			 */
> +			p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
> +			if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
> +				/* clean it up */
> +				p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
>  			if (sm->p_subn->opt.ignore_other_sm)
>  				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
>  					"Ignoring SM on port 0x%" PRIx64 "\n",
> @@ -171,7 +171,8 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>  					cl_ntoh64(port_guid));
>  
>  				/*
> -				   This port indicates it's an SM and it's not our own port.
> +				   This port indicates it's an SM and
> +				   it's not our own port.
>  				   Acquire the SMInfo Attribute.
>  				 */
>  				memset(&context, 0, sizeof(context));
> @@ -190,7 +191,8 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>  						"Failure requesting SMInfo (%s)\n",
>  						ib_get_err_str(status));
>  			}
> -		}
> +		} else
> +			cl_qmap_remove(p_sm_tbl, port_guid);

Isn't it should be freed too? Something like:

	p_sm = cl_qmap_remove(p_sm_tbl, port_guid);
	free(p_sm);

Sasha

>  	}
>  
>  	OSM_LOG_EXIT(sm->p_log);


From sashak at voltaire.com  Wed Nov 19 10:48:36 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 19 Nov 2008 20:48:36 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_lid_mgr.c: cosmetics in log
	message
In-Reply-To: <4923DFB2.8000305@dev.mellanox.co.il>
References: <4923DFB2.8000305@dev.mellanox.co.il>
Message-ID: <20081119184836.GK6183@sashak.voltaire.com>

On 11:43 Wed 19 Nov     , Yevgeny Kliteynik wrote:
> Sasha,
> 
> Small log message cosmetics fix.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From hal.rosenstock at gmail.com  Wed Nov 19 10:52:33 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Wed, 19 Nov 2008 13:52:33 -0500
Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_port_info_rcv.c:
	Remove SM from sm_guid_tbl when IsSM is not present
In-Reply-To: <20081119183020.GJ6183@sashak.voltaire.com>
References: <4922E98A.7020403@obsidianresearch.com>
	<20081119183020.GJ6183@sashak.voltaire.com>
Message-ID: <f0e08f230811191052q51f74facra7ebd12faeb010e@mail.gmail.com>

Sasha,

On Wed, Nov 19, 2008 at 1:30 PM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Hal,
>
> On 09:12 Tue 18 Nov     , Hal Rosenstock wrote:
>> Sasha,
>>
>> The following patch (attached this time:-) removes the SM from the
>> sm_guid_table when IsSM is not present. Compile tested only as I don't have
>> an environment to recreate this anymore.
>>
>> -- Hal
>
>> opensm/osm_port_info_rcv.c: Remove SM from sm_guid_tbl when IsSM is
>> not present in PortInfo:CapabilityMask
>>
>> SM should be removed from the sm_guid_tbl subsequent to a trap 144
>> indicating the capability mask changed (and the new capabilities
>> no longer include IsSM).
>>
>> As a result of this, move clearing of SM state to be conditionalized on
>> IsSM present rather than regardless of whether IsSM is set
>>
>> Prior to this patch, the OpenSM log is spammed with error messages on
>> SubnGets of SMInfo attribute.
>>
>> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
>>
>> diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
>> index 47eb457..97ec5b3 100644
>> --- a/opensm/opensm/osm_port_info_rcv.c
>> +++ b/opensm/opensm/osm_port_info_rcv.c
>> @@ -149,17 +149,17 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>>                        */
>>                       __osm_pi_rcv_set_sm(sm, p_physp);
>>       } else {
>> -             /*
>> -                Before querying the SM - we want to make sure we clean its state, so
>> -                if the querying fails we recognize that this SM is not active.
>> -              */
>>               p_sm_tbl = &sm->p_subn->sm_guid_tbl;
>> -             p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
>> -             if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
>> -                     /* clean it up */
>> -                     p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
>> -
>>               if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {
>> +                     /*
>> +                      * Before querying the SM - we want to make sure we
>> +                      * clean its state, so if the querying fails we
>> +                      * recognize that this SM is not active.
>> +                      */
>> +                     p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
>> +                     if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
>> +                             /* clean it up */
>> +                             p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
>>                       if (sm->p_subn->opt.ignore_other_sm)
>>                               OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
>>                                       "Ignoring SM on port 0x%" PRIx64 "\n",
>> @@ -171,7 +171,8 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>>                                       cl_ntoh64(port_guid));
>>
>>                               /*
>> -                                This port indicates it's an SM and it's not our own port.
>> +                                This port indicates it's an SM and
>> +                                it's not our own port.
>>                                  Acquire the SMInfo Attribute.
>>                                */
>>                               memset(&context, 0, sizeof(context));
>> @@ -190,7 +191,8 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>>                                               "Failure requesting SMInfo (%s)\n",
>>                                               ib_get_err_str(status));
>>                       }
>> -             }
>> +             } else
>> +                     cl_qmap_remove(p_sm_tbl, port_guid);
>
> Isn't it should be freed too? Something like:
>
>        p_sm = cl_qmap_remove(p_sm_tbl, port_guid);
>        free(p_sm);

Oops; my bad; revised patch shortly.

-- Hal

> Sasha
>
>>       }
>>
>>       OSM_LOG_EXIT(sm->p_log);
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From halr at obsidianresearch.com  Wed Nov 19 10:55:07 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Wed, 19 Nov 2008 11:55:07 -0700
Subject: [ofa-general] [PATCHv2] opensm/osm_port_info_rcv.c: Remove SM from
 sm_guid_tbl when IsSM is not
Message-ID: <4924610B.90809@obsidianresearch.com>

Sasha,

The following patch removes the SM from the sm_guid_table when IsSM is 
not present.
v2 of this fixes the memory leak you pointed out in the original 
version. Compile tested only as I don't have an environment to recreate 
this anymore.

-- Hal

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-pir-issm2
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081119/389d76bb/attachment.ksh>

From sashak at voltaire.com  Wed Nov 19 11:00:59 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 19 Nov 2008 21:00:59 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: bug fix in unicast
	cache
In-Reply-To: <4923E1B4.2030600@dev.mellanox.co.il>
References: <4923E1B4.2030600@dev.mellanox.co.il>
Message-ID: <20081119190059.GM6183@sashak.voltaire.com>

Hi Yevgeny,

On 11:51 Wed 19 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> When there are errors during initialization and new
> heavy sweep is forced, unicast cache might hold a
> snapshot of the previous routing, and since there
> might be no *topology* changes, ucast cache will
> apply that cached routing, which might be wrong.
> 
> This patch invalidates cache explicitly if there
> were initialization errors in addition to few other
> cases.
> 
> This fix addresses bug #1398.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/opensm/osm_state_mgr.c |   16 ++++++++++++----
>  1 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index 841438c..d00e8ff 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -1064,6 +1064,18 @@ static void do_sweep(osm_sm_t * sm)
>  	}
> 
>  	/*
> +	 * Unicast cache should be invalidated if:
> +	 *  - every sweep is a heavy sweep
> +	 *  - there were errors during initialization
> +	 *  - subnet re-route is requested
> +	 */
> +	if (sm->p_subn->opt.use_ucast_cache &&
> +	    (sm->p_subn->opt.force_heavy_sweep ||

Why 'opt.force_heavy_sweep' should be there? It is possible to enforce
heavy sweep without routing cache just by using:

opt.force_heavy_sweep TRUE
opt.use_ucast_cache FALSE

Sasha

> +	     sm->p_subn->subnet_initialization_error ||
> +	     sm->p_subn->force_reroute))
> +		osm_ucast_cache_invalidate(&sm->ucast_mgr);
> +
> +	/*
>  	 * If we don't need to do a heavy sweep and we want to do a reroute,
>  	 * just reroute only.
>  	 */
> @@ -1079,10 +1091,6 @@ static void do_sweep(osm_sm_t * sm)
>  		/* Re-program the switches fully */
>  		sm->p_subn->ignore_existing_lfts = TRUE;
> 
> -		/* we want to re-route, so cache should be invalidated */
> -		if (sm->p_subn->opt.use_ucast_cache)
> -			osm_ucast_cache_invalidate(&sm->ucast_mgr);
> -
>  		osm_ucast_mgr_process(&sm->ucast_mgr);
> 
>  		/* Reset flag */
> -- 
> 1.5.1.4
> 


From yossi.openib at gmail.com  Wed Nov 19 11:30:27 2008
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Wed, 19 Nov 2008 21:30:27 +0200
Subject: [ofa-general] [PATCH] ipoib: Fix loss of connectivity
	after	bonding failover on both sides
In-Reply-To: <adavdulhvk9.fsf@cisco.com>
References: <490B448C.5080306@Voltaire.COM> <4922A855.2010109@Voltaire.COM>
	<adavdulhvk9.fsf@cisco.com>
Message-ID: <49246953.4020101@gmail.com>

I included that copy to avoid the logic of releasing/allocating ipoib neighbour
for every packet xmit'ed before the patch query completes. I thought that it's
good enough to do it just once, for the first time. Therefore, to have the mgid
test pass for the second xmit, I copied the mgid even if path query fails.
 It turns out that it's not a good thing to do that, because if a path query fails
nothing will trigger it, but ARP refresh, and it takes too much time. In case of
SM failover, the first path query indeed fails.
 So, the best thing is probably to remove this "optimization". Besides that, the
patch works same as before.


Roland Dreier wrote:
>  > The patch below is identical to Yossi's patch but without the copy of HA in neigh_add_path.
> 
> Why did Yossi include that copy?  Does this patch still fix everything?
> 
>  - R.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Wed Nov 19 11:33:32 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 19 Nov 2008 21:33:32 +0200
Subject: [ofa-general] Re: [PATCHv2] opensm/osm_port_info_rcv.c: Remove SM
	from sm_guid_tbl when IsSM is not
In-Reply-To: <4924610B.90809@obsidianresearch.com>
References: <4924610B.90809@obsidianresearch.com>
Message-ID: <20081119193332.GN6183@sashak.voltaire.com>

Hi Hal,

On 11:55 Wed 19 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> The following patch removes the SM from the sm_guid_table when IsSM is not 
> present.
> v2 of this fixes the memory leak you pointed out in the original version. 
> Compile tested only as I don't have an environment to recreate this 
> anymore.
>
> -- Hal
>

> opensm/osm_port_info_rcv.c: Remove SM from sm_guid_tbl when IsSM is 
> not present in PortInfo:CapabilityMask
> 
> SM should be removed from the sm_guid_tbl subsequent to a trap 144
> indicating the capability mask changed (and the new capabilities
> no longer include IsSM).
> 
> As a result of this, move clearing of SM state to be conditionalized on 
> IsSM present rather than regardless of whether IsSM is set.
> 
> Prior to this patch, the OpenSM log is spammed with error messages on
> SubnGets of SMInfo attribute.
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> v2 fixes memory leak pointed out by Sasha.
> 
> diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c
> index 47eb457..5988dc3 100644
> --- a/opensm/opensm/osm_port_info_rcv.c
> +++ b/opensm/opensm/osm_port_info_rcv.c
> @@ -149,17 +149,17 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>  			 */
>  			__osm_pi_rcv_set_sm(sm, p_physp);
>  	} else {
> -		/*
> -		   Before querying the SM - we want to make sure we clean its state, so
> -		   if the querying fails we recognize that this SM is not active.
> -		 */
>  		p_sm_tbl = &sm->p_subn->sm_guid_tbl;
> -		p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
> -		if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
> -			/* clean it up */
> -			p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
> -
>  		if (p_pi->capability_mask & IB_PORT_CAP_IS_SM) {
> +			/*
> +			 * Before querying the SM - we want to make sure we
> +			 * clean its state, so if the querying fails we
> +			 * recognize that this SM is not active.
> +			 */
> +			p_sm = (osm_remote_sm_t *) cl_qmap_get(p_sm_tbl, port_guid);
> +			if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
> +				/* clean it up */
> +				p_sm->smi.pri_state = 0xF0 & p_sm->smi.pri_state;
>  			if (sm->p_subn->opt.ignore_other_sm)
>  				OSM_LOG(sm->p_log, OSM_LOG_VERBOSE,
>  					"Ignoring SM on port 0x%" PRIx64 "\n",
> @@ -171,7 +171,8 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>  					cl_ntoh64(port_guid));
>  
>  				/*
> -				   This port indicates it's an SM and it's not our own port.
> +				   This port indicates it's an SM and
> +				   it's not our own port.
>  				   Acquire the SMInfo Attribute.
>  				 */
>  				memset(&context, 0, sizeof(context));
> @@ -190,6 +191,9 @@ __osm_pi_rcv_process_endport(IN osm_sm_t * sm,
>  						"Failure requesting SMInfo (%s)\n",
>  						ib_get_err_str(status));
>  			}
> +		} else {
> +			p_sm = (osm_remote_sm_t *) cl_qmap_remove(p_sm_tbl, port_guid);
> +			free(p_sm);

Sorry about my simplified example. Actually it should be:

	if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_tbl))
		free(p_sm);

Since many ports may not have IsSM bit.

The patch is applied with this fix. Thanks.

Sasha


From sashak at voltaire.com  Wed Nov 19 11:38:06 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 19 Nov 2008 21:38:06 +0200
Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: disable switch
	ports only
In-Reply-To: <492401BB.3050808@gmail.com>
References: <20081118125130.GR10251@sashak.voltaire.com>
	<20081118125325.GS10251@sashak.voltaire.com>
	<f0e08f230811180504m5108162dj62693023b5b36b1c@mail.gmail.com>
	<20081118132922.GT10251@sashak.voltaire.com>
	<f0e08f230811180805t1bd30d31v18c7ef2441df66ed@mail.gmail.com>
	<20081118165632.GW10251@sashak.voltaire.com>
	<f0e08f230811180901p5c788f17i6f3eee6cfc0a54f8@mail.gmail.com>
	<492401BB.3050808@gmail.com>
Message-ID: <20081119193806.GO6183@sashak.voltaire.com>

Hi Eli,

On 14:08 Wed 19 Nov     , Eli Dorfman wrote:
> 
> More generic approach would be to disable the port with the least hop count.
> this will address the case of inter switch link where the most remote port (from opensm) is sending traps.
> in that case we would like to disable the nearest switch port.

Seems reasonable for me. Would you care about a patch?

Sasha


From yosefe at Voltaire.COM  Wed Nov 19 11:53:27 2008
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Wed, 19 Nov 2008 21:53:27 +0200
Subject: [ofa-general] [PATCH] ipoib: do not join broadcast group if
 interface is brought down
Message-ID: <49246EB7.3070607@Voltaire.COM>

Because ipoib_workqueue is not flushed when ipoib interface is brought down,
ipoib_mcast_join() may trigger a join to the broadcast group after priv->broadcast
was set to NULL (during cleanup). This will cause ipoib to be joined to the
broadcast group when interface is down.
As a side effect, this breaks the optimization of setting qkey only when joining
the broadcast group.

Signed-off-by: Yossi Etigin <yosefe at voltaire.com>

--

Fix bugzilla 1370.

Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2008-11-19 21:33:54.000000000 +0200
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2008-11-19 21:40:12.000000000 +0200
@@ -565,7 +565,8 @@ void ipoib_mcast_join_task(struct work_s
 			ipoib_warn(priv, "ib_query_port failed\n");
 	}
 
-	if (!priv->broadcast) {
+	rtnl_lock();
+	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && !priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
 		broadcast = ipoib_mcast_alloc(dev, 1);
@@ -576,6 +577,7 @@ void ipoib_mcast_join_task(struct work_s
 				queue_delayed_work(ipoib_workqueue,
 						   &priv->mcast_join_task, HZ);
 			mutex_unlock(&mcast_mutex);
+			rtnl_unlock();
 			return;
 		}
 
@@ -587,6 +589,7 @@ void ipoib_mcast_join_task(struct work_s
 		__ipoib_mcast_add(dev, priv->broadcast);
 		spin_unlock_irq(&priv->lock);
 	}
+	rtnl_unlock();
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
 		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
-- 
--Yossi


From kliteyn at dev.mellanox.co.il  Wed Nov 19 14:00:06 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 20 Nov 2008 00:00:06 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: bug fix in unicast
	cache
In-Reply-To: <20081119190059.GM6183@sashak.voltaire.com>
References: <4923E1B4.2030600@dev.mellanox.co.il>
	<20081119190059.GM6183@sashak.voltaire.com>
Message-ID: <49248C66.2010403@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 11:51 Wed 19 Nov     , Yevgeny Kliteynik wrote:
>> Hi Sasha,
>>
>> When there are errors during initialization and new
>> heavy sweep is forced, unicast cache might hold a
>> snapshot of the previous routing, and since there
>> might be no *topology* changes, ucast cache will
>> apply that cached routing, which might be wrong.
>>
>> This patch invalidates cache explicitly if there
>> were initialization errors in addition to few other
>> cases.
>>
>> This fix addresses bug #1398.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/opensm/osm_state_mgr.c |   16 ++++++++++++----
>>  1 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
>> index 841438c..d00e8ff 100644
>> --- a/opensm/opensm/osm_state_mgr.c
>> +++ b/opensm/opensm/osm_state_mgr.c
>> @@ -1064,6 +1064,18 @@ static void do_sweep(osm_sm_t * sm)
>>  	}
>>
>>  	/*
>> +	 * Unicast cache should be invalidated if:
>> +	 *  - every sweep is a heavy sweep
>> +	 *  - there were errors during initialization
>> +	 *  - subnet re-route is requested
>> +	 */
>> +	if (sm->p_subn->opt.use_ucast_cache &&
>> +	    (sm->p_subn->opt.force_heavy_sweep ||
> 
> Why 'opt.force_heavy_sweep' should be there? It is possible to enforce
> heavy sweep without routing cache just by using:
> 
> opt.force_heavy_sweep TRUE
> opt.use_ucast_cache FALSE

Well, it doesn't have to be there.
The opt.force_heavy_sweep is kind of debug mode of opensm,
so I just wanted to disable cache in that case.
Want me to remove it and repost the patch?

-- Yevgeny

> Sasha
> 
>> +	     sm->p_subn->subnet_initialization_error ||
>> +	     sm->p_subn->force_reroute))
>> +		osm_ucast_cache_invalidate(&sm->ucast_mgr);
>> +
>> +	/*
>>  	 * If we don't need to do a heavy sweep and we want to do a reroute,
>>  	 * just reroute only.
>>  	 */
>> @@ -1079,10 +1091,6 @@ static void do_sweep(osm_sm_t * sm)
>>  		/* Re-program the switches fully */
>>  		sm->p_subn->ignore_existing_lfts = TRUE;
>>
>> -		/* we want to re-route, so cache should be invalidated */
>> -		if (sm->p_subn->opt.use_ucast_cache)
>> -			osm_ucast_cache_invalidate(&sm->ucast_mgr);
>> -
>>  		osm_ucast_mgr_process(&sm->ucast_mgr);
>>
>>  		/* Reset flag */
>> -- 
>> 1.5.1.4
>>
> 


From tziporet at dev.mellanox.co.il  Wed Nov 19 16:02:22 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Wed, 19 Nov 2008 18:02:22 -0600
Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support
In-Reply-To: <adaskpvl3pz.fsf@cisco.com>
References: <4907348E.7060508@mellanox.co.il>
	<490A8FA9.7080802@pobox.com>	<aday7047jos.fsf@cisco.com>
	<490DA91A.1030703@pobox.com>	<adaprlew1wd.fsf@cisco.com>
	<490DD27C.4070109@pobox.com>	<491C41F0.3080304@mellanox.co.il>
	<adaskpvl3pz.fsf@cisco.com>
Message-ID: <4924A90E.9050205@mellanox.co.il>

Roland Dreier wrote:
> This is 2.6.29 material, and I should be able to get to it in the next
> few weeks.
>   
Great

Tziporet


From sashak at voltaire.com  Wed Nov 19 16:42:59 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 20 Nov 2008 02:42:59 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: bug fix in unicast
	cache
In-Reply-To: <49248C66.2010403@dev.mellanox.co.il>
References: <4923E1B4.2030600@dev.mellanox.co.il>
	<20081119190059.GM6183@sashak.voltaire.com>
	<49248C66.2010403@dev.mellanox.co.il>
Message-ID: <20081120004259.GP6486@sashak.voltaire.com>

On 00:00 Thu 20 Nov     , Yevgeny Kliteynik wrote:
> Sasha Khapyorsky wrote:
>> Hi Yevgeny,
>> On 11:51 Wed 19 Nov     , Yevgeny Kliteynik wrote:
>>> Hi Sasha,
>>>
>>> When there are errors during initialization and new
>>> heavy sweep is forced, unicast cache might hold a
>>> snapshot of the previous routing, and since there
>>> might be no *topology* changes, ucast cache will
>>> apply that cached routing, which might be wrong.
>>>
>>> This patch invalidates cache explicitly if there
>>> were initialization errors in addition to few other
>>> cases.
>>>
>>> This fix addresses bug #1398.
>>>
>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>> ---
>>>  opensm/opensm/osm_state_mgr.c |   16 ++++++++++++----
>>>  1 files changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/opensm/opensm/osm_state_mgr.c 
>>> b/opensm/opensm/osm_state_mgr.c
>>> index 841438c..d00e8ff 100644
>>> --- a/opensm/opensm/osm_state_mgr.c
>>> +++ b/opensm/opensm/osm_state_mgr.c
>>> @@ -1064,6 +1064,18 @@ static void do_sweep(osm_sm_t * sm)
>>>  	}
>>>
>>>  	/*
>>> +	 * Unicast cache should be invalidated if:
>>> +	 *  - every sweep is a heavy sweep
>>> +	 *  - there were errors during initialization
>>> +	 *  - subnet re-route is requested
>>> +	 */
>>> +	if (sm->p_subn->opt.use_ucast_cache &&
>>> +	    (sm->p_subn->opt.force_heavy_sweep ||
>> Why 'opt.force_heavy_sweep' should be there? It is possible to enforce
>> heavy sweep without routing cache just by using:
>> opt.force_heavy_sweep TRUE
>> opt.use_ucast_cache FALSE
>
> Well, it doesn't have to be there.
> The opt.force_heavy_sweep is kind of debug mode of opensm,
> so I just wanted to disable cache in that case.
> Want me to remove it and repost the patch?

Yes, please.

Sasha


From dorfman.eli at gmail.com  Thu Nov 20 00:00:38 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Thu, 20 Nov 2008 10:00:38 +0200
Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_trap_rcv.c disable the
 port with the least hop count
Message-ID: <49251926.9090509@gmail.com>

disable the port with the least hop count.
this will address the case of inter switch link where the
most remote port (from opensm) is sending traps.
in that case we would like to disable the nearest switch port (from opensm).

Signed-off-by: Eli Dorfman <elid at voltaire.com>
---
 opensm/opensm/osm_trap_rcv.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 07c5183..d1dfbd4 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -239,8 +239,8 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
 	ib_port_info_t *pi = (ib_port_info_t *)payload;
 	int ret;
 
-	/* in case of endport - disable switch's peer port */
-	if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
+	/* select the nearest port to master opensm */
+	if (p->dr_path.hop_count > p->p_remote_physp->dr_path.hop_count)
 		p = p->p_remote_physp;
 
 	/* If trap 131, might want to disable peer port if available */
-- 
1.5.5


From kliteyn at dev.mellanox.co.il  Thu Nov 20 00:33:19 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 20 Nov 2008 10:33:19 +0200
Subject: [ofa-general] [PATCH v2] opensm/osm_state_mgr.c: bug fix in unicast
	cache
Message-ID: <492520CF.4080001@dev.mellanox.co.il>

Hi Sasha,

When there are errors during initialization and new
heavy sweep is forced, unicast cache might hold a
snapshot of the previous routing, and since there
might be no *topology* changes, unicast cache will
apply that cached routing, which might be wrong.

This patch invalidates cache explicitly if there
were initialization errors in addition to few other
cases.

V2: don't invalidate cache when
    opt.force_heavy_sweep is on.

This fix addresses bug #1398.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/opensm/osm_state_mgr.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 841438c..788da51 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1064,6 +1064,15 @@ static void do_sweep(osm_sm_t * sm)
 	}

 	/*
+	 * Unicast cache should be invalidated if there were errors
+	 * during initialization or if subnet re-route is requested.
+	 */
+	if (sm->p_subn->opt.use_ucast_cache &&
+	    (sm->p_subn->subnet_initialization_error ||
+	     sm->p_subn->force_reroute))
+		osm_ucast_cache_invalidate(&sm->ucast_mgr);
+
+	/*
 	 * If we don't need to do a heavy sweep and we want to do a reroute,
 	 * just reroute only.
 	 */
@@ -1079,10 +1088,6 @@ static void do_sweep(osm_sm_t * sm)
 		/* Re-program the switches fully */
 		sm->p_subn->ignore_existing_lfts = TRUE;

-		/* we want to re-route, so cache should be invalidated */
-		if (sm->p_subn->opt.use_ucast_cache)
-			osm_ucast_cache_invalidate(&sm->ucast_mgr);
-
 		osm_ucast_mgr_process(&sm->ucast_mgr);

 		/* Reset flag */
-- 
1.5.1.4


From jackm at dev.mellanox.co.il  Thu Nov 20 02:11:45 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 20 Nov 2008 12:11:45 +0200
Subject: [ofa-general] Race condition in userspace libraries with
	create/destroy qp
Message-ID: <200811201211.46527.jackm@dev.mellanox.co.il>

Roland,

Mazal Tov again on the birth of your son.  I hope all is well.
Has your latency improved (by some miracle)?

There seems to be a race in libmlx4 (which our regression testing found).

mlx4_create_qp and mlx4_destroy_qp are not atomic WRT each other. If one thread is
destroying a QP while another is creating a qp, there is a race hole.  The destroying thread
can lose its timeslice after it has deleted the QP from kernel space, but before it has cleared
it from userspace store (mlx4_clear_qp).
If the other thread creates a qp during this break, it gets the same QP base number and overwrites
the destroyed QPs entry with mlx4_store_qp().

When the destroying thread resumes, it clears the new entry from the userspace store via
mlx4_clear_qp.

I'm debating between a couple of options:
1. move the mlx4_qp_clear to precede ibv_cmd_destroy_qp. However, what if we're still getting
   completions for this qp? Ouch.

2. Create a mutex for this purpose, and use it to force the create and destroy qp operations
   to be atomic WRT  the ibv_cmd_xxx_qp operations and the store/clear qp operations.

3. Force kernel space to avoid allocating a just-deleted qp number (this is my least favorite option).

My preference is for #2, as being the simplest to implement and having no side-effects.

What do you think?

- Jack

(BTW libmthca has the same issue).

========================================================

From file libmlx4/src/verbs.c:

mlx4_create_qp snippet:

        ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
                                &resp, sizeof resp);
        if (ret)
                goto err_rq_db;

        ret = mlx4_store_qp(to_mctx(pd->context), qp->ibv_qp.qp_num, qp);
        if (ret)
                goto err_destroy;

mlx4_destroy_qp snippet:

        ret = ibv_cmd_destroy_qp(ibqp);
        if (ret)
                return ret;
==> CAN LOSE TIME SLICE HERE!!!
        mlx4_lock_cqs(ibqp);

        __mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
                        ibqp->srq ? to_msrq(ibqp->srq) : NULL);
        if (ibqp->send_cq != ibqp->recv_cq)
                __mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL);

        mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);

        mlx4_unlock_cqs(ibqp);


From vlad at lists.openfabrics.org  Thu Nov 20 03:25:44 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 20 Nov 2008 03:25:44 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081120-0200 daily build status
Message-ID: <20081120112545.0F20CE60DF1@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From kliteyn at dev.mellanox.co.il  Thu Nov 20 03:58:27 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 20 Nov 2008 13:58:27 +0200
Subject: [ofa-general] [PATCH] opensm/osm_switch.h: use updated LFT for
	routing
Message-ID: <492550E3.90805@dev.mellanox.co.il>

Hi Sasha,

Function osm_switch_get_port_by_lid() was using the switch's
LFT, so this LFT might not be updated to recent routing.

I think that this was also relevant before the LFT simplification.
One immediate outcome of this bug is opensm.fdbs file - when it
is dumped from the switch LFT (and not from lft_buf), it sometimes
doesn't match the lst file.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 opensm/include/opensm/osm_switch.h |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index caa0bc5..f06931c 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -411,7 +411,11 @@ osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw,
 {
 	if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO)
 		return OSM_NO_PATH;
-	return p_sw->lft[lid_ho];
+
+	if (p_sw->lft_buf)
+		return p_sw->lft_buf[lid_ho];
+	else
+		return p_sw->lft[lid_ho];
 }
 /*
 * PARAMETERS
-- 
1.5.1.4


From kliteyn at dev.mellanox.co.il  Thu Nov 20 04:46:11 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 20 Nov 2008 14:46:11 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <20081031043226.GH16455@sashak.voltaire.com>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
Message-ID: <49255C13.5030503@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 23:51 Thu 30 Oct     , Yevgeny Kliteynik wrote:
>> Sure, why not. That way the memory would be freed faster.
> 
> Patch?
> 
> Sasha
> 

I can do something like the following patch, but I have
some strange feeling that I'm missing something...
Can there be some flow that would cause lft_buf to be
freed while not all the lft blocks were received yet,
and then remaining blocks might change switch->lft
(after the switch->lft_buf was already freed)?
I can't think of any particular example, just a general
concern...

-- Yevgeny

Free lft_buf when newly received lft block
makes switch's lft identical to lft_buf.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
  opensm/include/opensm/osm_switch.h |    7 +++++++
  opensm/opensm/osm_ucast_mgr.c      |    7 -------
  2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index f06931c..af8a50e 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -729,6 +729,13 @@ osm_switch_set_lft_block(IN osm_switch_t * const p_sw,
  		return IB_INVALID_PARAMETER;

  	memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE);
+
+	if (p_sw->lft_buf &&
+	    !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) {
+		free(p_sw->lft_buf);
+		p_sw->lft_buf = NULL;
+	}
+
  	return IB_SUCCESS;
  }
  /*
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 175817c..7f1a816 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -399,13 +399,6 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
  		goto Exit;
  	}

-	if (!p_sw->need_update &&
-	    !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) {
-		free(p_sw->lft_buf);
-		p_sw->lft_buf = NULL;
-		goto Exit;
-	}
-
  	for (block_id_ho = 0;
  	     osm_switch_get_lft_block(p_sw, block_id_ho, block);
  	     block_id_ho++) {
-- 
1.5.1.4


From yevgenyp at mellanox.co.il  Thu Nov 20 06:55:18 2008
From: yevgenyp at mellanox.co.il (Yevgeny Petrilin)
Date: Thu, 20 Nov 2008 16:55:18 +0200
Subject: [ofa-general] mlx4_en: Memory leak on completion queue free.
Message-ID: <49257A56.9030609@mellanox.co.il>

If port is being destroyed without being activated before,
CQ resources are not freed.

Signed-off-by: Yevgeny Petrilin <yevgenyp at mellanox.co.il>
---
Hello Jeff,
this regression fix for 2.6.28

 drivers/net/mlx4/en_cq.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/mlx4/en_cq.c b/drivers/net/mlx4/en_cq.c
index 1368a80..1a936f4 100644
--- a/drivers/net/mlx4/en_cq.c
+++ b/drivers/net/mlx4/en_cq.c
@@ -68,6 +68,8 @@ int mlx4_en_create_cq(struct mlx4_en_priv *priv,
 	err = mlx4_en_map_buffer(&cq->wqres.buf);
 	if (err)
 		mlx4_free_hwq_res(mdev->dev, &cq->wqres, cq->buf_size);
+	else
+		cq->buf = (struct mlx4_cqe *) cq->wqres.buf.direct.buf;

 	return err;
 }
@@ -82,7 +84,6 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq)
 	cq->mcq.arm_db     = cq->wqres.db.db + 1;
 	*cq->mcq.set_ci_db = 0;
 	*cq->mcq.arm_db    = 0;
-	cq->buf = (struct mlx4_cqe *) cq->wqres.buf.direct.buf;
 	memset(cq->buf, 0, cq->buf_size);

 	err = mlx4_cq_alloc(mdev->dev, cq->size, &cq->wqres.mtt, &mdev->priv_uar,
-- 
1.5.4


From tziporet at mellanox.co.il  Thu Nov 20 07:21:56 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 20 Nov 2008 17:21:56 +0200
Subject: [ofa-general] OFED 1.4 - delay the GA to Dec 4
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01006654@mtlexch01.mtl.com>

Hi All,

I have Just reviewed bugs status with Vlad. 
We have 11 major and critical bugs, and we will not be able to fix all
of them in one week
Thus - I delay the GA release to Dec 4 (since we have thanks-giving
holiday next week)
I also suggest we will create RC6 by end of next week - since most of
the bugs are assigned to people in Israel and we do not have vacation
next week

We will review the release status at the EWG meeting next week.
Bug owners - please reply with status update and also update bug report

Bugs list:
1370    	blo  	vlad at mellanox.co.il  	Ping over IPoIB I/F
fails after ifconfig down and up
1242 	cri 	yannick.cote at qlogic.com kernel panic while running
mpi2007 against ofed1.4 -- ib_...
1198 	cri 	yosefe at voltaire.com 	hang during ipoib
create_child/ifdown
1348 	maj 	amirv at mellanox.co.il 	Sdp sockets doesnt closed after
programs end
1349 	maj 	amirv at mellanox.co.il 	Kernel panic on sdp
1289 	maj 	jackm at mellanox.co.il 	Ib and ipoib doesnt respond
while running multiple tests ...
1389 	maj 	jackm at mellanox.co.il 	poll_cq sometimes fail in a
multithreaded test
1401 	maj 	sashak at voltaire.com 	segmentation fault when running
opensm -Q
1377 	maj 	vu at mellanox.com 	Deadlock occured during HA test
1380 	maj 	vu at mellanox.com 	Cannot unload ib_srpt module on
SRP target
1395 	maj 	vu at mellanox.com 	kernel panic during SRP HA test 


Tziporet & Vlad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081120/5f94c6ae/attachment.html>

From vst at vlnb.net  Thu Nov 20 07:24:18 2008
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Thu, 20 Nov 2008 18:24:18 +0300
Subject: [ofa-general] SRP/mlx4 interrupts throttling performance
In-Reply-To: <49189567.1010804@harr.org>
References: <48E386F6.5040502@fusionio.com>	<48E6498A.3070002@mellanox.com>	<48E65FE0.2060602@harr.org>		<48E67ACC.1020903@harr.org>	<48E695F9.80703@harr.org>		<48E9E681.8090600@vlnb.net>	<48EA2F42.80008@harr.org>	<e2e108260810070233q7dbcd377p16b094ea5a6b74a7@mail.gmail.com>	<48EB8CBC.30303@harr.org>
	<48EB96C5.2060202@vlnb.net>	<48EBA581.4040301@mellanox.com>
	<48EBA72B.4000909@harr.org>	<48EBBDB1.1080203@harr.org>
	<48EBE6B6.4060804@mellanox.com>	<48ECEA4D.7080504@harr.org>
	<48ED3489.4030905@harr.org>	<48F79CF8.3010905@vlnb.net>
	<48FE6C84.7030300@harr.org>	<48FEDA26.4080304@vlnb.net>
	<48FF2D1A.8000101@harr.org>	<48FF5F42.2050902@vlnb.net>
	<48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org>
	<490210EE.2070000@vlnb.net> <49022553.1020804@harr.org>
	<490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org>
	<4911D827.10705@vlnb.net> <49121715.4040804@harr.org>
	<4912C684.5000505@vlnb.net> <491307C7.50008@harr.org>
	<49131A85.2010102@vlnb.net> <49189567.1010804@harr.org>
Message-ID: <49258122.6040808@vlnb.net>

Cameron Harr wrote:
> New results, with markers.
> ----
> type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=1 iops=65612.40
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 iops=54934.31
> type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=1 iops=82514.57
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 iops=79680.42
> type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=1 iops=60439.73
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 iops=51510.68
> type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=1 iops=102735.07
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 iops=78558.77
> type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=1 iops=62941.35
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 iops=51924.17
> type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=1 iops=120961.39
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 iops=75411.52
> type=randwrite  bs=512  drives=1 scst_threads=1 srptthread=0 iops=50891.13
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 iops=50199.90
> type=randwrite  bs=512  drives=2 scst_threads=1 srptthread=0 iops=58711.87
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 iops=74504.65
> type=randwrite  bs=512  drives=1 scst_threads=2 srptthread=0 iops=61043.73
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 iops=49951.89
> type=randwrite  bs=512  drives=2 scst_threads=2 srptthread=0 iops=83195.60
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 iops=75224.25
> type=randwrite  bs=512  drives=1 scst_threads=3 srptthread=0 iops=60277.98
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 iops=49874.57
> type=randwrite  bs=512  drives=2 scst_threads=3 srptthread=0 iops=84851.43
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 iops=73238.46

I think srptthread=0 performs worse in this case, because with it part 
of processing done in SIRQ, but seems scheduler make it be done on the 
same CPU as fct0-worker, which does the data transfer to your SSD device 
job. And this thread is always consumes about 100% CPU, so it has less 
CPU time, hence less overall performance.

So, try to affine fctX-worker, SCST threads and SIRQ processing on 
different CPUs and check again. You can affine threads using utility 
from http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, 
how to affine IRQ see Documentation/IRQ-affinity.txt in your kernel tree.

Vlad


From vst at vlnb.net  Thu Nov 20 07:26:15 2008
From: vst at vlnb.net (Vladislav Bolkhovitin)
Date: Thu, 20 Nov 2008 18:26:15 +0300
Subject: [ofa-general] SRP/mlx4 interrupts throttling performance
In-Reply-To: <4910A49B.1050004@harr.org>
References: <48E386F6.5040502@fusionio.com>
	<48E38BAF.5000801@harr.org>		<48E6498A.3070002@mellanox.com>
	<48E65FE0.2060602@harr.org>		<48E67ACC.1020903@harr.org>
	<48E695F9.80703@harr.org>		<48E9E681.8090600@vlnb.net>
	<48EA2F42.80008@harr.org>	<e2e108260810070233q7dbcd377p16b094ea5a6b74a7@mail.gmail.com>
	<48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net>
	<48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org>
	<48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com>
	<48ECEA4D.7080504@harr.org> <48F79CA9.8090806@vlnb.net>
	<49022438.9030903@harr.org> <490B45B0.7030208@vlnb.net>
	<4910A49B.1050004@harr.org>
Message-ID: <49258197.3020904@vlnb.net>

Cameron Harr wrote:
> Vladislav Bolkhovitin wrote:
>> Cameron Harr wrote:
>>> Vladislav Bolkhovitin wrote:
>>>>> ** Sometimes the benchmark "zombied" (process doing no work, but 
>>>>> process can't be killed) after running a certain amount of time. 
>>>>> However, it wasn't repeatable in a reliable way, so I mark that 
>>>>> this particular run has zombied before.
>>>> That means that there is a bug somewhere. Usually such bugs are 
>>>> found in few hours of code auditing (srpt driver is pretty simple) 
>>>> or by using kernel debug facilities (example diff to .config 
>>>> attached). I personally always prefer put my effort on fixing real 
>>>> things, not inventing various workarounds, like srpt_thread in this 
>>>> case.
>>>>
>>>> So I would:
>>>>
>>>>   1. Completely remove srpt thread and all related code. It doesn't do
>>>> anything, which can't be done in SIRQ context (tasklet)
>>>>
>>>>   2. Audit the code to check if it does any action, which it 
>>>> shouldn't do on SIRQ and fix it. This step isn't required, but 
>>>> usually it saves a lot of time of puzzled debugging in the future.
>>>>
>>>>   3. Change in srpt_handle_rdma_comp() and  srpt_handle_new_iu()
>>>> SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC.
> 
> I'm assuming you didn't want me to implement this change this time, correct?

Seems, I've already done that in the patch you use ;)


From olga.shern at gmail.com  Thu Nov 20 07:51:22 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Thu, 20 Nov 2008 17:51:22 +0200
Subject: [ofa-general] ***SPAM*** Re: [ewg] OFED 1.4 - delay the GA to Dec 4
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01006654@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD01006654@mtlexch01.mtl.com>
Message-ID: <bc457d660811200751s53f647bfp422ded6d11491de5@mail.gmail.com>

>
> 1370            blo     vlad at mellanox.co.il     Ping over IPoIB I/F fails
> after ifconfig down and up
>

Yossi have sent a patch that fixes this

> 1198    cri     yosefe at voltaire.com     hang during ipoib
> create_child/ifdown

We sent patch to Roland some time ago. But it was decided in EWG meeting that
because:
 1. It is rarely that user will run such test
2. This is an old bug that wasn't introduced in OFED 1.4
we will not add the patch to OFED 1.4

If you think this is another bug we should open a new one


> 1289    maj     jackm at mellanox.co.il    Ib and ipoib doesnt respond while
> running multiple tests ...
>

It seems that this was already fixed - need only retest this and
verify that this is indeed fixed


From vuhuong at mellanox.com  Thu Nov 20 11:06:23 2008
From: vuhuong at mellanox.com (Vu Pham)
Date: Thu, 20 Nov 2008 11:06:23 -0800
Subject: [ofa-general] srp_daemon and partitions.
In-Reply-To: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>
References: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>
Message-ID: <4925B52F.3030106@mellanox.com>

Hi James,

it's srp_daemon and ibsrpdm bug. We'll try to fix it to provide zoning 
thru pkey.

>
> We wish to protect the storage from unwanted use. In a fibre channel 
> san environment this would be done in two ways, firstly presentation ( 
> configuring the controller as to which luns each WWN can access ) and 
> secondly zoning which is configuring the switches that make the fabric 
> as to which ports can communicate. If we can't do this it would 
> restrict IB to a single use eg as a replacement for fibre switches.
>
Does DDN has management sw to set the access control list (configuring 
the controller as to which luns each WWN can access)?
OFED's srp target / scst mid-layer can provide this

-vu


From vuhuong at mellanox.com  Thu Nov 20 11:06:23 2008
From: vuhuong at mellanox.com (Vu Pham)
Date: Thu, 20 Nov 2008 11:06:23 -0800
Subject: [ofa-general] srp_daemon and partitions.
In-Reply-To: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>
References: <774A4005-446E-40D1-A70E-DBCBF12219F0@catbus.co.uk>
Message-ID: <4925B52F.3030106@mellanox.com>

Hi James,

it's srp_daemon and ibsrpdm bug. We'll try to fix it to provide zoning 
thru pkey.

>
> We wish to protect the storage from unwanted use. In a fibre channel 
> san environment this would be done in two ways, firstly presentation ( 
> configuring the controller as to which luns each WWN can access ) and 
> secondly zoning which is configuring the switches that make the fabric 
> as to which ports can communicate. If we can't do this it would 
> restrict IB to a single use eg as a replacement for fibre switches.
>
Does DDN has management sw to set the access control list (configuring 
the controller as to which luns each WWN can access)?
OFED's srp target / scst mid-layer can provide this

-vu


From michael.oevermann at tu-berlin.de  Thu Nov 20 11:41:44 2008
From: michael.oevermann at tu-berlin.de (Michael Oevermann)
Date: Thu, 20 Nov 2008 20:41:44 +0100
Subject: [ofa-general] infiniband problem, no NICs
Message-ID: <4925BD78.4030003@tu-berlin.de>

Hi all,

I have "inherited" a small cluster with a head node and four compute
nodes which I have to administer.  The nodes are connected via infiniband (OFED), but the head is not. 
I am a complete novice to the infiniband stuff and here is my problem:

The infiniband configuration seems to be OK. The usual tests suggested in the OFED install guide give 
the expected output, e.g.


ibv_devinfo on the nodes:


************************* oscar_cluster *************************
--------- n01---------
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0002:c902:0025:930c
sys_image_guid: 0002:c902:0025:930f
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0140001
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 1
port_lmc: 0x00

etc. for the other nodes.

sminfo on the nodes:

************************* oscar_cluster *************************
--------- n01---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6881 priority 0 
state 3 SMINFO_MASTER
--------- n02---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6882 priority 0 
state 3 SMINFO_MASTER
--------- n03---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6883 priority 0 
state 3 SMINFO_MASTER
--------- n04---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6884 priority 0 
state 3 SMINFO_MASTER


However, when I directly start a mpi job (without using a scheduler) via:

/usr/mpi/gcc4/openmpi-1.2.2-1/bin/mpirun -np 4 -hostfile 
/home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1

I get the error message:

0,1,0]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,2]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,3]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,1]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

MPI with normal GB Etherrnet and IP networking just works fine, but the 
infinband doesn't. The MPI libs I am using
for the test are definitely compiled with IB support and the tests have 
been run successfully on
the cluster before.

Any suggestions what is going wrong here?

Best regards and thanks for any help!

Michael


From rdreier at cisco.com  Thu Nov 20 14:50:51 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 20 Nov 2008 14:50:51 -0800
Subject: [ofa-general] Re: Race condition in userspace libraries with
	create/destroy qp
In-Reply-To: <200811201211.46527.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Thu, 20 Nov 2008 12:11:45 +0200")
References: <200811201211.46527.jackm@dev.mellanox.co.il>
Message-ID: <adavduiggfo.fsf@cisco.com>

 > mlx4_create_qp and mlx4_destroy_qp are not atomic WRT each other. If one thread is
 > destroying a QP while another is creating a qp, there is a race hole.  The destroying thread
 > can lose its timeslice after it has deleted the QP from kernel space, but before it has cleared
 > it from userspace store (mlx4_clear_qp).
 > If the other thread creates a qp during this break, it gets the same QP base number and overwrites
 > the destroyed QPs entry with mlx4_store_qp().

Yes, looks like a real bug.

 > 2. Create a mutex for this purpose, and use it to force the create and destroy qp operations
 >    to be atomic WRT  the ibv_cmd_xxx_qp operations and the store/clear qp operations.

This looks like the best solution.

I wonder if we should just add this synchronization in libibverbs rather
than individual drivers?  I notice that libcxgb3 seems to have the same
bug AFAICS.  But maybe it's better to just keep the simple rule that
driver libraries are responsible for locking their own data structures.

 - R.


From weiny2 at llnl.gov  Thu Nov 20 16:38:09 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 20 Nov 2008 16:38:09 -0800
Subject: [ofa-general] [PATCH 0/3] ibnetdiscover library "libibnetdisc"
Message-ID: <20081120163809.26a3c499.weiny2@llnl.gov>

The following 3 patches implement "libibnetdisc" which provides the
functionality of ibnetdiscover in a C library.

I mentioned this to Sasha at the last Sonoma conference and posted the bulk of
this code to the list a few months ago.  This libary is still providing the 85%
performance speed up of iblinkinfo.pl on our clusters.

This new series is heavily tested and, for our hardware, preserves the
functionality of ibnetdiscover.  Since I don't have a Xsigo box to test on I
can only verify that it compiles correctly.

Ira


From weiny2 at llnl.gov  Thu Nov 20 16:38:15 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 20 Nov 2008 16:38:15 -0800
Subject: [ofa-general] [PATCH 3/3] Convert ibnetdiscover to use new ibnetdisc
	library.
Message-ID: <20081120163815.5cd110fb.weiny2@llnl.gov>

>From e2b8bac5d651c2278719d511dee2ab2e8ad05706 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 20 Nov 2008 09:29:57 -0800
Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library.

   Removed -e and -v since they were somewhat redundant with the -d option.

   All other functionality is preserved

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/Makefile.am         |    4 +-
 infiniband-diags/man/ibnetdiscover.8 |   10 +-
 infiniband-diags/src/ibnetdiscover.c |  910 ++++++++++------------------------
 3 files changed, 254 insertions(+), 670 deletions(-)

diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am
index 8f26749..420c69e 100644
--- a/infiniband-diags/Makefile.am
+++ b/infiniband-diags/Makefile.am
@@ -35,9 +35,9 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \
 src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c
 src_ibaddr_CFLAGS = -Wall $(DBGFLAGS)
 
-src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c
+src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/ibdiag_common.c
 src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS)
-src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir)
+src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -libnetdisc
 
 src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c
 src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS)
diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8
index 958efa9..768d392 100644
--- a/infiniband-diags/man/ibnetdiscover.8
+++ b/infiniband-diags/man/ibnetdiscover.8
@@ -5,7 +5,7 @@ ibnetdiscover \- discover InfiniBand topology
 
 .SH SYNOPSIS
 .B ibnetdiscover
-[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map <node-name-map>] [\-p(orts)] [\-h(elp)] [<topology-file>]
+[\-d(ebug)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map <node-name-map>] [\-p(orts)] [\-h(elp)] [<topology-file>]
 
 .SH DESCRIPTION
 .PP
@@ -37,7 +37,7 @@ List of connected switches
 List of connected routers
 .TP
 \fB\-s\fR, \fB\-\-show\fR
-Show more information
+Show progress information during discovery.
 .TP
 \fB\-\-node\-name\-map\fR <node-name-map>
 Specify a node name map.  The node name map file maps GUIDs to more user friendly
@@ -57,15 +57,9 @@ using the util_name -h syntax.
 # Debugging flags
 .PP
 \-d      raise the IB debugging level.
-        May be used several times (-ddd or -d -d -d).
-.PP
-\-e      show send and receive errors (timeouts and others)
 .PP
 \-h      show the usage message
 .PP
-\-v      increase the application verbosity level.
-        May be used several times (-vv or -v -v -v)
-.PP
 \-V      show the version info.
 
 # Other common flags:
diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
index 2cfaa8a..d8ead48 100644
--- a/infiniband-diags/src/ibnetdiscover.c
+++ b/infiniband-diags/src/ibnetdiscover.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
  * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -47,483 +48,108 @@
 #include <errno.h>
 #include <inttypes.h>
 
-#include <infiniband/common.h>
-#include <infiniband/umad.h>
-#include <infiniband/mad.h>
 #include <infiniband/complib/cl_nodenamemap.h>
+#include <infiniband/ibnetdisc.h>
+#include <infiniband/common.h>
 
-#include "ibnetdiscover.h"
-#include "grouping.h"
 #include "ibdiag_common.h"
 
-static char *node_type_str[] = {
-	"???",
-	"ca",
-	"switch",
-	"router",
-	"iwarp rnic"
-};
-
-static char *linkwidth_str[] = {
-	"??",
-	"1x",
-	"4x",
-	"??",
-	"8x",
-	"??",
-	"??",
-	"??",
-	"12x"
-};
-
-static char *linkspeed_str[] = {
-	"???",
-	"SDR",
-	"DDR",
-	"???",
-	"QDR"
-};
-
-static int timeout = 2000;		/* ms */
-static int dumplevel = 0;
+static int debug;
 static int verbose;
-static FILE *f;
+#define LIST_CA_NODE	 (1 << IBND_CA_NODE)
+#define LIST_SWITCH_NODE (1 << IBND_SWITCH_NODE)
+#define LIST_ROUTER_NODE (1 << IBND_ROUTER_NODE)
 
 char *argv0 = "ibnetdiscover";
+static FILE *f;
 
 static char *node_name_map_file = NULL;
 static nn_map_t *node_name_map = NULL;
 
-Node *nodesdist[MAXHOPS+1];     /* last is Ca list */
-Node *mynode;
-int maxhops_discovered = 0;
-
-struct ChassisList *chassis = NULL;
-
-static char *
-get_linkwidth_str(int linkwidth)
-{
-	if (linkwidth > 8)
-		return linkwidth_str[0];
-	else
-		return linkwidth_str[linkwidth];
-}
-
-static char *
-get_linkspeed_str(int linkspeed)
-{
-	if (linkspeed > 4)
-		return linkspeed_str[0];
-	else
-		return linkspeed_str[linkspeed];
-}
-
-static inline const char*
-node_type_str2(Node *node)
-{
-	switch(node->type) {
-	case SWITCH_NODE: return "SW";
-	case CA_NODE:     return "CA";
-	case ROUTER_NODE: return "RT";
-	}
-	return "??";
-}
-
-void
-decode_port_info(void *pi, Port *port)
-{
-	mad_decode_field(pi, IB_PORT_LID_F, &port->lid);
-	mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc);
-	mad_decode_field(pi, IB_PORT_STATE_F, &port->state);
-	mad_decode_field(pi, IB_PORT_PHYS_STATE_F, &port->physstate);
-	mad_decode_field(pi, IB_PORT_LINK_WIDTH_ACTIVE_F, &port->linkwidth);
-	mad_decode_field(pi, IB_PORT_LINK_SPEED_ACTIVE_F, &port->linkspeed);
-}
-
-
-int
-get_port(Port *port, int portnum, ib_portid_t *portid)
-{
-	char portinfo[64];
-	void *pi = portinfo;
-
-	port->portnum = portnum;
-
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout))
-		return -1;
-	decode_port_info(pi, port);
-
-	DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s",
-		portid2str(portid), portnum, port->lid, port->state, port->physstate, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed));
-	return 1;
-}
-/*
- * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
- */
-int
-get_node(Node *node, Port *port, ib_portid_t *portid)
-{
-	char portinfo[64];
-	char switchinfo[64];
-	void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc;
-	void *si = switchinfo;
-
-	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout))
-		return -1;
-
-	mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid);
-	mad_decode_field(ni, IB_NODE_TYPE_F, &node->type);
-	mad_decode_field(ni, IB_NODE_NPORTS_F, &node->numports);
-	mad_decode_field(ni, IB_NODE_DEVID_F, &node->devid);
-	mad_decode_field(ni, IB_NODE_VENDORID_F, &node->vendid);
-	mad_decode_field(ni, IB_NODE_SYSTEM_GUID_F, &node->sysimgguid);
-	mad_decode_field(ni, IB_NODE_PORT_GUID_F, &node->portguid);
-	mad_decode_field(ni, IB_NODE_LOCAL_PORT_F, &node->localport);
-	port->portnum = node->localport;
-	port->portguid = node->portguid;
-
-	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout))
-		return -1;
-
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout))
-		return -1;
-	decode_port_info(pi, port);
-
-	if (node->type != SWITCH_NODE)
-		return 0;
-
-	node->smalid = port->lid;
-	node->smalmc = port->lmc;
-
-	/* after we have the sma information find out the real PortInfo for this port */
-	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, node->localport, timeout))
-	        return -1;
-	decode_port_info(pi, port);
-
-        if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
-                node->smaenhsp0 = 0;	/* assume base SP0 */
-	else
-        	mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0);
-
-	DEBUG("portid %s: got switch node %" PRIx64 " '%s'",
-	      portid2str(portid), node->nodeguid, node->nodedesc);
-	return 1;
-}
-
-static int
-extend_dpath(ib_dr_path_t *path, int nextport)
-{
-	if (path->cnt+2 >= sizeof(path->p))
-		return -1;
-	++path->cnt;
-	if (path->cnt > maxhops_discovered)
-		maxhops_discovered = path->cnt;
-	path->p[path->cnt] = nextport;
-	return path->cnt;
-}
-
-static void
-dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port)
-{
-	if (!dumplevel)
-		return;
-
-	fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n",
-		portid2str(path), prompt,
-		(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-		node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum,
-		port->lid, port->lid + (1 << port->lmc) - 1,
-		clean_nodedesc(node->nodedesc));
-}
-
-#define HASHGUID(guid)		((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
-#define HTSZ 137
-
-static Node *nodestbl[HTSZ];
-
-static Node *
-find_node(Node *new)
-{
-	int hash = HASHGUID(new->nodeguid) % HTSZ;
-	Node *node;
-
-	for (node = nodestbl[hash]; node; node = node->htnext)
-		if (node->nodeguid == new->nodeguid)
-			return node;
-
-	return NULL;
-}
-
-static Node *
-create_node(Node *temp, ib_portid_t *path, int dist)
-{
-	Node *node;
-	int hash = HASHGUID(temp->nodeguid) % HTSZ;
-
-	node = malloc(sizeof(*node));
-	if (!node)
-		return NULL;
-
-	memcpy(node, temp, sizeof(*node));
-	node->dist = dist;
-	node->path = *path;
-
-	node->htnext = nodestbl[hash];
-	nodestbl[hash] = node;
-
-	if (node->type != SWITCH_NODE)
-		dist = MAXHOPS; 	/* special Ca list */
-
-	node->dnext = nodesdist[dist];
-	nodesdist[dist] = node;
-
-	return node;
-}
-
-static Port *
-find_port(Node *node, Port *port)
-{
-	Port *old;
-
-	for (old = node->ports; old; old = old->next)
-		if (old->portnum == port->portnum)
-			return old;
-
-	return NULL;
-}
-
-static Port *
-create_port(Node *node, Port *temp)
-{
-	Port *port;
-
-	port = malloc(sizeof(*port));
-	if (!port)
-		return NULL;
-
-	memcpy(port, temp, sizeof(*port));
-	port->node = node;
-	port->next = node->ports;
-	node->ports = port;
-
-	return port;
-}
-
-static void
-link_ports(Node *node, Port *port, Node *remotenode, Port *remoteport)
-{
-	DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u",
-		node->nodeguid, node, port, port->portnum,
-		remotenode->nodeguid, remotenode, remoteport, remoteport->portnum);
-	if (port->remoteport)
-		port->remoteport->remoteport = NULL;
-	if (remoteport->remoteport)
-		remoteport->remoteport->remoteport = NULL;
-	port->remoteport = remoteport;
-	remoteport->remoteport = port;
-}
-
-static int
-handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist)
-{
-	Node node_buf;
-	Port port_buf;
-	Node *remotenode, *oldnode;
-	Port *remoteport, *oldport;
-
-	memset(&node_buf, 0, sizeof(node_buf));
-	memset(&port_buf, 0, sizeof(port_buf));
-
-	DEBUG("handle node %p port %p:%d dist %d", node, port, portnum, dist);
-	if (port->physstate != 5)	/* LinkUp */
-		return -1;
-
-	if (extend_dpath(&path->drpath, portnum) < 0)
-		return -1;
-
-	if (get_node(&node_buf, &port_buf, path) < 0) {
-		IBWARN("NodeInfo on %s failed, skipping port",
-			portid2str(path));
-		path->drpath.cnt--;	/* restore path */
-		return -1;
-	}
-
-	oldnode = find_node(&node_buf);
-	if (oldnode)
-		remotenode = oldnode;
-	else if (!(remotenode = create_node(&node_buf, path, dist + 1)))
-		IBERROR("no memory");
-
-	oldport = find_port(remotenode, &port_buf);
-	if (oldport) {
-		remoteport = oldport;
-		if (node != remotenode || port != remoteport)
-			IBWARN("port moving...");
-	} else if (!(remoteport = create_port(remotenode, &port_buf)))
-		IBERROR("no memory");
-
-	dump_endnode(path, oldnode ? "known remote" : "new remote",
-		     remotenode, remoteport);
-
-	link_ports(node, port, remotenode, remoteport);
-
-	path->drpath.cnt--;	/* restore path */
-	return 0;
-}
-
-/*
- * Return 1 if found, 0 if not, -1 on errors.
- */
-static int
-discover(ib_portid_t *from)
-{
-	Node node_buf;
-	Port port_buf;
-	Node *node;
-	Port *port;
-	int i;
-	int dist = 0;
-	ib_portid_t *path;
-
-	DEBUG("from %s", portid2str(from));
-
-	memset(&node_buf, 0, sizeof(node_buf));
-	memset(&port_buf, 0, sizeof(port_buf));
-
-	if (get_node(&node_buf, &port_buf, from) < 0) {
-		IBWARN("can't reach node %s", portid2str(from));
-		return -1;
-	}
-
-	node = create_node(&node_buf, from, 0);
-	if (!node)
-		IBERROR("out of memory");
-
-	mynode = node;
-
-	port = create_port(node, &port_buf);
-	if (!port)
-		IBERROR("out of memory");
-
-	if (node->type != SWITCH_NODE &&
-	    handle_port(node, port, from, node->localport, 0) < 0)
-		return 0;
-
-	for (dist = 0; dist < MAXHOPS; dist++) {
-
-		for (node = nodesdist[dist]; node; node = node->dnext) {
-
-			path = &node->path;
-
-			DEBUG("dist %d node %p", dist, node);
-			dump_endnode(path, "processing", node, port);
-
-			for (i = 1; i <= node->numports; i++) {
-				if (i == node->localport)
-					continue;
-
-				if (get_port(&port_buf, i, path) < 0) {
-					IBWARN("can't reach node %s port %d", portid2str(path), i);
-					continue;
-				}
-
-				port = find_port(node, &port_buf);
-				if (port)
-					continue;
-
-				port = create_port(node, &port_buf);
-				if (!port)
-					IBERROR("out of memory");
-
-				/* If switch, set port GUID to node GUID */
-				if (node->type == SWITCH_NODE)
-					port->portguid = node->portguid;
-
-				handle_port(node, port, path, i, dist);
-			}
-		}
-	}
+static int timeout_ms = 2000;
+static int dumplevel = 0;
 
-	return 0;
-}
 
 char *
-node_name(Node *node)
+node_name(ibnd_node_t *node)
 {
 	static char buf[256];
 
-	switch(node->type) {
-	case SWITCH_NODE:
-		sprintf(buf, "\"%s", "S");
-		break;
-	case CA_NODE:
+	switch(node->info.type) {
+	case IBND_CA_NODE:
 		sprintf(buf, "\"%s", "H");
 		break;
-	case ROUTER_NODE:
+	case IBND_SWITCH_NODE:
+		sprintf(buf, "\"%s", "S");
+		break;
+	case IBND_ROUTER_NODE:
 		sprintf(buf, "\"%s", "R");
 		break;
 	default:
 		sprintf(buf, "\"%s", "?");
 		break;
 	}
-	sprintf(buf+2, "-%016" PRIx64 "\"", node->nodeguid);
+	sprintf(buf+2, "-%016" PRIx64 "\"", node->info.nodeguid);
 
 	return buf;
 }
 
 void
-list_node(Node *node)
+list_node(ibnd_node_t *node, void *user_data)
 {
-	char *node_type;
-	char *nodename = remap_node_name(node_name_map, node->nodeguid,
+	char *nodename = remap_node_name(node_name_map, node->info.nodeguid,
 					      node->nodedesc);
 
-	switch(node->type) {
-	case SWITCH_NODE:
-		node_type = "Switch";
-		break;
-	case CA_NODE:
-		node_type = "Ca";
-		break;
-	case ROUTER_NODE:
-		node_type = "Router";
-		break;
-	default:
-		node_type = "???";
-		break;
-	}
 	fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n",
-		node_type,
-		node->nodeguid, node->numports, node->devid, node->vendid,
+		ibnd_node_type_str(node),
+		node->info.nodeguid, node->info.numports, node->info.devid,
+		node->info.vendid,
 		nodename);
 
 	free(nodename);
 }
 
 void
-out_ids(Node *node, int group, char *chname)
+list_nodes(ibnd_fabric_t *fabric, int list)
+{
+	if (list & LIST_CA_NODE) {
+		ibnd_iter_nodes_type(fabric, list_node, IBND_CA_NODE, NULL);
+	}
+	if (list & LIST_SWITCH_NODE) {
+		ibnd_iter_nodes_type(fabric, list_node, IBND_SWITCH_NODE, NULL);
+	}
+	if (list & LIST_ROUTER_NODE) {
+		ibnd_iter_nodes_type(fabric, list_node, IBND_ROUTER_NODE, NULL);
+	}
+}
+
+void
+out_ids(ibnd_node_t *node, int group, char *chname)
 {
-	fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid);
-	if (node->sysimgguid)
-		fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid);
+	fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->info.vendid, node->info.devid);
+	if (node->info.sysimgguid)
+		fprintf(f, "sysimgguid=0x%" PRIx64, node->info.sysimgguid);
 	if (group
 	    && node->chrecord && node->chrecord->chassisnum) {
 		fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum);
 		if (chname)
-			fprintf(f, " (%s)", chname);
-		if (is_xsigo_tca(node->nodeguid) && node->ports->remoteport)
-			fprintf(f, " slot %d", node->ports->remoteport->portnum);
+			fprintf(f, " (%s)", clean_nodedesc(chname));
+		if (ibnd_is_xsigo_tca(node->info.nodeguid)
+				&& node->ports[1]
+				&& node->ports[1]->remoteport)
+			fprintf(f, " slot %d", node->ports[1]->remoteport->portnum);
 	}
 	fprintf(f, "\n");
 }
 
+
 uint64_t
-out_chassis(int chassisnum)
+out_chassis(ibnd_fabric_t *fabric, int chassisnum)
 {
 	uint64_t guid;
 
 	fprintf(f, "\nChassis %d", chassisnum);
-	guid = get_chassis_guid(chassisnum);
+	guid = ibnd_get_chassis_guid(fabric, chassisnum);
 	if (guid)
 		fprintf(f, " (guid 0x%" PRIx64 ")", guid);
 	fprintf(f, "\n");
@@ -531,54 +157,49 @@ out_chassis(int chassisnum)
 }
 
 void
-out_switch(Node *node, int group, char *chname)
+out_switch(ibnd_node_t *node, int group, char *chname)
 {
 	char *str;
+	char  str2[256];
 	char *nodename = NULL;
 
 	out_ids(node, group, chname);
-	fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid);
-	fprintf(f, "(%" PRIx64 ")", node->portguid);
-	/* Currently, only if Voltaire chassis */
-	if (group
-	    && node->chrecord && node->chrecord->chassisnum
-	    && node->vendid == VTR_VENDOR_ID) {
-		str = get_chassis_type(node->chrecord->chassistype);
+	fprintf(f, "switchguid=0x%" PRIx64, node->info.nodeguid);
+	fprintf(f, "(%" PRIx64 ")", node->info.nodeportguid);
+	if (group) {
+		str = ibnd_get_chassis_type(node);
 		if (str)
 			fprintf(f, "%s ", str);
-		str = get_chassis_slot(node->chrecord->chassisslot);
+		str = ibnd_get_chassis_slot_str(node, str2, 256);
 		if (str)
-			fprintf(f, "%s ", str);
-		fprintf(f, "%d Chip %d", node->chrecord->slotnum, node->chrecord->anafanum);
+			fprintf(f, "%s", str);
 	}
 
-	nodename = remap_node_name(node_name_map, node->nodeguid,
+	nodename = remap_node_name(node_name_map, node->info.nodeguid,
 				node->nodedesc);
 
 	fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n",
-		node->numports, node_name(node),
+		node->info.numports, node_name(node),
 		nodename,
-		node->smaenhsp0 ? "enhanced" : "base",
+		node->sw_info.smaenhsp0 ? "enhanced" : "base",
 		node->smalid, node->smalmc);
 
 	free(nodename);
 }
 
 void
-out_ca(Node *node, int group, char *chname)
+out_ca(ibnd_node_t *node, int group, char *chname)
 {
 	char *node_type;
 	char *node_type2;
-	char *nodename = remap_node_name(node_name_map, node->nodeguid,
-					      node->nodedesc);
 
 	out_ids(node, group, chname);
-	switch(node->type) {
-	case CA_NODE:
+	switch(node->info.type) {
+	case IBND_CA_NODE:
 		node_type = "ca";
 		node_type2 = "Ca";
 		break;
-	case ROUTER_NODE:
+	case IBND_ROUTER_NODE:
 		node_type = "rt";
 		node_type2 = "Rt";
 		break;
@@ -588,37 +209,37 @@ out_ca(Node *node, int group, char *chname)
 		break;
 	}
 
-	fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid);
+	fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->info.nodeguid);
 	fprintf(f, "%s\t%d %s\t\t# \"%s\"",
-		node_type2, node->numports, node_name(node),
-		nodename);
-	if (group && is_xsigo_hca(node->nodeguid))
+		node_type2, node->info.numports, node_name(node),
+		clean_nodedesc(node->nodedesc));
+	if (group && ibnd_is_xsigo_hca(node->info.nodeguid))
 		fprintf(f, " (scp)");
 	fprintf(f, "\n");
-
-	free(nodename);
 }
 
+#define OUT_BUFFER_SIZE 16
 static char *
-out_ext_port(Port *port, int group)
+out_ext_port(ibnd_port_t *port, int group)
 {
-	char *str = NULL;
+	static char mapping[OUT_BUFFER_SIZE];
 
-	/* Currently, only if Voltaire chassis */
-	if (group
-	    && port->node->chrecord && port->node->vendid == VTR_VENDOR_ID)
-		str = portmapstring(port);
+	if (group && port->ext_portnum != 0) {
+		snprintf(mapping, OUT_BUFFER_SIZE,
+			"[ext %d]", port->ext_portnum);
+		return (mapping);
+	}
 
-	return (str);
+	return (NULL);
 }
 
 void
-out_switch_port(Port *port, int group)
+out_switch_port(ibnd_port_t *port, int group)
 {
 	char *ext_port_str = NULL;
 	char *rem_nodename = NULL;
 
-	DEBUG("port %p:%d remoteport %p", port, port->portnum, port->remoteport);
+	DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport);
 	fprintf(f, "[%d]", port->portnum);
 
 	ext_port_str = out_ext_port(port, group);
@@ -626,7 +247,7 @@ out_switch_port(Port *port, int group)
 		fprintf(f, "%s", ext_port_str);
 
 	rem_nodename = remap_node_name(node_name_map,
-				port->remoteport->node->nodeguid,
+				port->remoteport->node->info.nodeguid,
 				port->remoteport->node->nodedesc);
 
 	ext_port_str = out_ext_port(port->remoteport, group);
@@ -634,17 +255,17 @@ out_switch_port(Port *port, int group)
 		node_name(port->remoteport->node),
 		port->remoteport->portnum,
 		ext_port_str ? ext_port_str : "");
-	if (port->remoteport->node->type != SWITCH_NODE)
-		fprintf(f, "(%" PRIx64 ") ", port->remoteport->portguid);
+	if (port->remoteport->node->info.type != IBND_SWITCH_NODE)
+		fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid);
 	fprintf(f, "\t\t# \"%s\" lid %d %s%s",
 		rem_nodename,
-		port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid,
-		get_linkwidth_str(port->linkwidth),
-		get_linkspeed_str(port->linkspeed));
+		port->remoteport->node->info.type == IBND_SWITCH_NODE ?  port->remoteport->node->smalid : port->remoteport->info.lid,
+		ibnd_linkwidth_str(port->info.link_width_active),
+		ibnd_linkspeed_str(port->info.link_speed_active));
 
-	if (is_xsigo_tca(port->remoteport->portguid))
+	if (ibnd_is_xsigo_tca(port->remoteport->guid))
 		fprintf(f, " slot %d", port->portnum);
-	else if (is_xsigo_hca(port->remoteport->portguid))
+	else if (ibnd_is_xsigo_hca(port->remoteport->guid))
 		fprintf(f, " (scp)");
 	fprintf(f, "\n");
 
@@ -652,68 +273,80 @@ out_switch_port(Port *port, int group)
 }
 
 void
-out_ca_port(Port *port, int group)
+out_ca_port(ibnd_port_t *port, int group)
 {
 	char *str = NULL;
 	char *rem_nodename = NULL;
 
 	fprintf(f, "[%d]", port->portnum);
-	if (port->node->type != SWITCH_NODE)
-		fprintf(f, "(%" PRIx64 ") ", port->portguid);
+	if (port->node->info.type != IBND_SWITCH_NODE)
+		fprintf(f, "(%" PRIx64 ") ", port->guid);
 	fprintf(f, "\t%s[%d]",
 		node_name(port->remoteport->node),
 		port->remoteport->portnum);
 	str = out_ext_port(port->remoteport, group);
 	if (str)
 		fprintf(f, "%s", str);
-	if (port->remoteport->node->type != SWITCH_NODE)
-		fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid);
+	if (port->remoteport->node->info.type != IBND_SWITCH_NODE)
+		fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid);
 
 	rem_nodename = remap_node_name(node_name_map,
-				port->remoteport->node->nodeguid,
+				port->remoteport->node->info.nodeguid,
 				port->remoteport->node->nodedesc);
 
 	fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n",
-		port->lid, port->lmc, rem_nodename,
-		port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid,
-		get_linkwidth_str(port->linkwidth),
-		get_linkspeed_str(port->linkspeed));
+		port->info.lid, port->info.lmc, rem_nodename,
+		port->remoteport->node->info.type == IBND_SWITCH_NODE ?  port->remoteport->node->smalid : port->remoteport->info.lid,
+		ibnd_linkwidth_str(port->info.link_width_active),
+		ibnd_linkspeed_str(port->info.link_speed_active));
 
 	free(rem_nodename);
 }
 
 int
-dump_topology(int listtype, int group)
+dump_topology(int group, ibnd_fabric_t *fabric)
 {
-	Node *node;
-	Port *port;
-	int i = 0, dist = 0;
+	ibnd_node_t *node;
+	ibnd_port_t *port;
+	int i = 0, dist = 0, p = 0;
 	time_t t = time(0);
 	uint64_t chguid;
 	char *chname = NULL;
 
-	if (!listtype) {
-		fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
-		fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered);
-		fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid);
-	}
+	fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
+	fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered);
+	fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n",
+		fabric->from_node->info.nodeguid, fabric->from_node->info.nodeportguid);
 
 	/* Make pass on switches */
-	if (group && !listtype) {
-		ChassisList *ch = NULL;
+	if (group) {
+		ibnd_chassis_list_t *ch = NULL;
 
 		/* Chassis based switches first */
-		for (ch = chassis; ch; ch = ch->next) {
+		for (ch = fabric->chassis; ch; ch = ch->next) {
 			int n = 0;
 
 			if (!ch->chassisnum)
 				continue;
-			chguid = out_chassis(ch->chassisnum);
-			if (chname)
-				free(chname);
+			chguid = out_chassis(fabric, ch->chassisnum);
+
 			chname = NULL;
-			if (is_xsigo_guid(chguid)) {
-				for (node = nodesdist[MAXHOPS]; node; node = node->dnext) {
+/**
+ * Hal will this work for Xsigo?
+ */
+			if (ibnd_is_xsigo_guid(chguid)) {
+				for (node = ch->nodes; node; node = node->chassis_next) {
+					if (ibnd_is_xsigo_hca(node->info.nodeguid)) {
+						chname = node->nodedesc;
+						fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc));
+					}
+				}
+
+#if 0
+/**
+ * vs. this?
+ */
+				for (node = fabric->nodesdist[MAXHOPS]; node; node = node->dnext) {
 					if (!node->chrecord ||
 					    !node->chrecord->chassisnum)
 						continue;
@@ -721,209 +354,171 @@ dump_topology(int listtype, int group)
 					if (node->chrecord->chassisnum != ch->chassisnum)
 						continue;
 
-					if (is_xsigo_hca(node->nodeguid)) {
-						chname = remap_node_name(node_name_map,
-								node->nodeguid,
-								node->nodedesc);
-						fprintf(f, "Hostname: %s\n", chname);
+					if (ibnd_is_xsigo_hca(node->nodeguid)) {
+						chname = node->nodedesc;
+						fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc));
 					}
 				}
+#endif
 			}
 
 			fprintf(f, "\n# Spine Nodes");
-			for (n = 1; n <= (SPINES_MAX_NUM+1); n++) {
+			for (n = 1; n <= SPINES_MAX_NUM; n++) {
 				if (ch->spinenode[n]) {
 					out_switch(ch->spinenode[n], group, chname);
-					for (port = ch->spinenode[n]->ports; port; port = port->next, i++)
-						if (port->remoteport)
+					for (p = 1; p <= ch->spinenode[n]->info.numports; p++) {
+						port = ch->spinenode[n]->ports[p];
+						if (port && port->remoteport)
 							out_switch_port(port, group);
+					}
 				}
 			}
 			fprintf(f, "\n# Line Nodes");
-			for (n = 1; n <= (LINES_MAX_NUM+1); n++) {
+			for (n = 1; n <= LINES_MAX_NUM; n++) {
 				if (ch->linenode[n]) {
 					out_switch(ch->linenode[n], group, chname);
-					for (port = ch->linenode[n]->ports; port; port = port->next, i++)
-						if (port->remoteport)
+					for (p = 1; p <= ch->linenode[n]->info.numports; p++) {
+						port = ch->linenode[n]->ports[p];
+						if (port && port->remoteport)
 							out_switch_port(port, group);
+					}
 				}
 			}
 
 			fprintf(f, "\n# Chassis Switches");
-			for (dist = 0; dist <= maxhops_discovered; dist++) {
-
-				for (node = nodesdist[dist]; node; node = node->dnext) {
-
-					/* Non Voltaire chassis */
-					if (node->vendid == VTR_VENDOR_ID)
-						continue;
-					if (!node->chrecord ||
-					    !node->chrecord->chassisnum)
-						continue;
-
-					if (node->chrecord->chassisnum != ch->chassisnum)
-						continue;
-
+			for (node = ch->nodes; node; node = node->chassis_next) {
+				if (node->info.type == IBND_SWITCH_NODE) {
 					out_switch(node, group, chname);
-					for (port = node->ports; port; port = port->next, i++)
-						if (port->remoteport)
+					for (p = 1; p <= node->info.numports; p++) {
+						port = node->ports[p];
+						if (port && port->remoteport)
 							out_switch_port(port, group);
-
+					}
 				}
-
 			}
 
 			fprintf(f, "\n# Chassis CAs");
-			for (node = nodesdist[MAXHOPS]; node; node = node->dnext) {
-				if (!node->chrecord ||
-				    !node->chrecord->chassisnum)
-					continue;
-
-				if (node->chrecord->chassisnum != ch->chassisnum)
-					continue;
-
-				out_ca(node, group, chname);
-				for (port = node->ports; port; port = port->next, i++)
-					if (port->remoteport)
-						out_ca_port(port, group);
-
+			for (node = ch->nodes; node; node = node->chassis_next) {
+				if (node->info.type == IBND_CA_NODE) {
+					out_ca(node, group, chname);
+					for (p = 1; p <= node->info.numports; p++) {
+						port = node->ports[p];
+						if (port && port->remoteport)
+							out_ca_port(port, group);
+					}
+				}
 			}
 
 		}
 
-	} else {
-		for (dist = 0; dist <= maxhops_discovered; dist++) {
-
-			for (node = nodesdist[dist]; node; node = node->dnext) {
-
-				DEBUG("SWITCH: dist %d node %p", dist, node);
-				if (!listtype)
-					out_switch(node, group, chname);
-				else {
-					if (listtype & LIST_SWITCH_NODE)
-						list_node(node);
-					continue;
-				}
-
-				for (port = node->ports; port; port = port->next, i++)
-					if (port->remoteport)
+	} else { /* !group */
+		for (node = fabric->switches; node; node = node->type_next) {
+				DEBUG("SWITCH: dist %d node %p\n", dist, node);
+				out_switch(node, group, chname);
+				for (p = 1; p <= node->info.numports; p++) {
+					port = node->ports[p];
+					if (port && port->remoteport)
 						out_switch_port(port, group);
-			}
+				}
 		}
 	}
 
-	if (chname)
-		free(chname);
 	chname = NULL;
-	if (group && !listtype) {
-
+	if (group) {
 		fprintf(f, "\nNon-Chassis Nodes\n");
-
-		for (dist = 0; dist <= maxhops_discovered; dist++) {
-
-			for (node = nodesdist[dist]; node; node = node->dnext) {
-
-				DEBUG("SWITCH: dist %d node %p", dist, node);
+		for (node = fabric->switches; node; node = node->type_next) {
+				DEBUG("SWITCH: dist %d node %p\n", dist, node);
 				/* Now, skip chassis based switches */
 				if (node->chrecord &&
 				    node->chrecord->chassisnum)
 					continue;
 				out_switch(node, group, chname);
 
-				for (port = node->ports; port; port = port->next, i++)
-					if (port->remoteport)
+				for (p = 1; p <= node->info.numports; p++) {
+					port = node->ports[p];
+					if (port && port->remoteport)
 						out_switch_port(port, group);
-			}
-
+				}
 		}
 
 	}
 
 	/* Make pass on CAs */
-	for (node = nodesdist[MAXHOPS]; node; node = node->dnext) {
-
-		DEBUG("CA: dist %d node %p", dist, node);
-		if (!listtype) {
-			/* Now, skip chassis based CAs */
-			if (group && node->chrecord &&
-			    node->chrecord->chassisnum)
-				continue;
-			out_ca(node, group, chname);
-		} else {
-			if (((listtype & LIST_CA_NODE) && (node->type == CA_NODE)) ||
-			    ((listtype & LIST_ROUTER_NODE) && (node->type == ROUTER_NODE)))
-				list_node(node);
+	for (node = fabric->ch_adapters; node; node = node->type_next) {
+		DEBUG("CA: dist %d node %p\n", dist, node);
+		/* Now, skip chassis based CAs */
+		if (group && node->chrecord &&
+		    node->chrecord->chassisnum)
 			continue;
-		}
+		out_ca(node, group, chname);
 
-		for (port = node->ports; port; port = port->next, i++)
-			if (port->remoteport)
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (port && port->remoteport)
 				out_ca_port(port, group);
+		}
 	}
 
-	if (chname)
-		free(chname);
+	/* make pass on routers */
+	for (node = fabric->routers; node; node = node->type_next) {
+		DEBUG("RT: dist %d node %p\n", dist, node);
+		/* Now, skip chassis based CAs */
+		if (group && node->chrecord &&
+		    node->chrecord->chassisnum)
+			continue;
+		out_ca(node, group, chname);
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (port && port->remoteport)
+				out_ca_port(port, group);
+		}
+	}
 
 	return i;
 }
 
-void dump_ports_report ()
+
+void dump_ports_report (ibnd_node_t *node, void *user_data)
 {
-	int b, n = 0, p;
-	Node *node;
-	Port *port;
-
-	// If switch and LID == 0, search of other switch ports with
-	// valid LID and assign it to all ports of that switch
-	for (b = 0; b <= MAXHOPS; b++)
-		for (node = nodesdist[b]; node; node = node->dnext)
-			if (node->type == SWITCH_NODE) {
-				int swlid = 0;
-				for (p = 0, port = node->ports;
-				     p < node->numports && port && !swlid;
-				     port = port->next)
-					if (port->lid != 0)
-						swlid = port->lid;
-				for (p = 0, port = node->ports;
-				     p < node->numports && port;
-				     port = port->next)
-					port->lid = swlid;
-			}
+	int p = 0;
+	ibnd_port_t *port = NULL;
+
+	/* for each port */
+	for (p = node->info.numports, port = node->ports[p];
+	     p > 0;
+	     port = node->ports[--p]) {
+		if (port == NULL)
+			continue;
 
-	for (b = 0; b <= MAXHOPS; b++)
-		for (node = nodesdist[b]; node; node = node->dnext) {
-			for (p = 0, port = node->ports;
-			     p < node->numports && port;
-			     p++, port = port->next) {
-				fprintf(stdout,
-					"%2s %5d %2d 0x%016" PRIx64 " %s %s",
-					node_type_str2(port->node), port->lid,
-					port->portnum,
-					port->portguid,
-					get_linkwidth_str(port->linkwidth),
-					get_linkspeed_str(port->linkspeed));
-				if (port->remoteport)
-					fprintf(stdout,
-						" - %2s %5d %2d 0x%016" PRIx64
-						" ( '%s' - '%s' )\n",
-						node_type_str2(port->remoteport->node),
-						port->remoteport->lid,
-						port->remoteport->portnum,
-						port->remoteport->portguid,
-						port->node->nodedesc,
-						port->remoteport->node->nodedesc);
-				else
-					fprintf(stdout, "%36s'%s'\n", "",
-						port->node->nodedesc);
-			}
-			n++;
-		}
+		fprintf(stdout,
+			"%2s %5d %2d 0x%016" PRIx64 " %s %s",
+			ibnd_node_type_str_short(node),
+			node->info.type == IBND_SWITCH_NODE ? node->smalid : port->info.lid,
+			port->portnum,
+			port->guid,
+			ibnd_linkwidth_str(port->info.link_width_active),
+			ibnd_linkspeed_str(port->info.link_speed_active));
+		if (port->remoteport)
+			fprintf(stdout,
+				" - %2s %5d %2d 0x%016" PRIx64
+				" ( '%s' - '%s' )\n",
+				ibnd_node_type_str_short(port->remoteport->node),
+				port->remoteport->node->info.type == IBND_SWITCH_NODE ?
+					port->remoteport->node->smalid : port->remoteport->info.lid,
+				port->remoteport->portnum,
+				port->remoteport->guid,
+				port->node->nodedesc,
+				port->remoteport->node->nodedesc);
+		else
+			fprintf(stdout, "%36s'%s'\n", "",
+				port->node->nodedesc);
+	}
 }
 
 void
 usage(void)
 {
-	fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port "
+	fprintf(stderr, "Usage: %s [-d(ebug)] -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port "
 			"-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) [<topology-file>]\n",
 			argv0);
 	fprintf(stderr, "       --node-name-map <node-name-map> specify a node name map file\n");
@@ -933,20 +528,18 @@ usage(void)
 int
 main(int argc, char **argv)
 {
-	int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS};
-	ib_portid_t my_portid = {0};
-	int udebug = 0, list = 0;
+	int list = 0;
 	char *ca = 0;
 	int ca_port = 0;
 	int group = 0;
 	int ports_report = 0;
+	ibnd_fabric_t *fabric = NULL;
 
 	static char const str_opts[] = "C:P:t:devslgHSRpVhu";
 	static const struct option long_opts[] = {
 		{ "C", 1, 0, 'C'},
 		{ "P", 1, 0, 'P'},
 		{ "debug", 0, 0, 'd'},
-		{ "err_show", 0, 0, 'e'},
 		{ "verbose", 0, 0, 'v'},
 		{ "show", 0, 0, 's'},
 		{ "list", 0, 0, 'l'},
@@ -982,23 +575,17 @@ main(int argc, char **argv)
 			ca_port = strtoul(optarg, 0, 0);
 			break;
 		case 'd':
-			ibdebug++;
-			madrpc_show_errors(1);
-			umad_debug(udebug);
-			udebug++;
+			debug = 1;
+			ibnd_debug(1);
 			break;
 		case 't':
-			timeout = strtoul(optarg, 0, 0);
+			timeout_ms = strtoul(optarg, 0, 0);
 			break;
 		case 'v':
 			verbose++;
-			dumplevel++;
 			break;
 		case 's':
-			dumplevel = 1;
-			break;
-		case 'e':
-			madrpc_show_errors(1);
+			ibnd_show_progress(1);
 			break;
 		case 'l':
 			list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE;
@@ -1007,13 +594,13 @@ main(int argc, char **argv)
 			group = 1;
 			break;
 		case 'S':
-			list = LIST_SWITCH_NODE;
+			list |= LIST_SWITCH_NODE;
 			break;
 		case 'H':
-			list = LIST_CA_NODE;
+			list |= LIST_CA_NODE;
 			break;
 		case 'R':
-			list = LIST_ROUTER_NODE;
+			list |= LIST_ROUTER_NODE;
 			break;
 		case 'V':
 			fprintf(stderr, "%s %s\n", argv0, get_build_version() );
@@ -1030,22 +617,25 @@ main(int argc, char **argv)
 	argv += optind;
 
 	if (argc && !(f = fopen(argv[0], "w")))
-		IBERROR("can't open file %s for writing", argv[0]);
+		fprintf(stderr, "can't open file %s for writing", argv[0]);
 
-	madrpc_init(ca, ca_port, mgmt_classes, 2);
 	node_name_map = open_node_name_map(node_name_map_file);
 
-	if (discover(&my_portid) < 0)
-		IBERROR("discover");
-
-	if (group)
-		chassis = group_nodes();
+	if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) {
+		fprintf(stderr, "discover failed\n");
+		exit(1);
+	}
 
 	if (ports_report)
-		dump_ports_report();
+		ibnd_iter_nodes(fabric,
+				dump_ports_report,
+				NULL);
+	else if (list)
+		list_nodes(fabric, list);
 	else
-		dump_topology(list, group);
+		dump_topology(group, fabric);
 
+	ibnd_destroy_fabric(fabric);
 	close_node_name_map(node_name_map);
 	exit(0);
 }
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Nov 20 16:38:12 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 20 Nov 2008 16:38:12 -0800
Subject: [ofa-general] [PATCH 1/3] Create a new library libibnetdisc
Message-ID: <20081120163812.6230375d.weiny2@llnl.gov>

>From 663b13de4253c4d87c73e8d2f50c9b798fa3a4d8 Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Fri, 14 Nov 2008 15:36:03 -0800
Subject: [PATCH] Create a new library libibnetdisc

This encompasses the functionality of ibnetdiscover in a C library.  It returns
a single "ibnd_fabric_t" object which represents the data found during the
scan.  The NodeInfo, PortInfo, and SwitchInfo are preserved from the queries
made on the fabric to be used by the calling function as they see fit.

This greatly benefits some diags like iblinkinfo.pl.  This diag in particular
was re-written using this library in C and has shown an 85% speed up on a ~1000
node cluster.

Previous iblinkinfo.pl
   real    3m35.876s
   user    0m13.210s
   sys     1m1.046s

New iblinkinfotest
   real    0m32.869s
   user    0m0.067s
   sys     0m0.140s

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 libibnetdisc/AUTHORS                        |    1 +
 libibnetdisc/COPYING                        |  384 ++++++++++++
 libibnetdisc/ChangeLog                      |    4 +
 libibnetdisc/Makefile.am                    |   73 +++
 libibnetdisc/autogen.sh                     |   11 +
 libibnetdisc/configure.in                   |   68 +++
 libibnetdisc/include/infiniband/ibnetdisc.h |  306 ++++++++++
 libibnetdisc/libibnetdisc.spec.in           |   94 +++
 libibnetdisc/libibnetdisc.ver               |    9 +
 libibnetdisc/man/ibnd_debug.3               |    2 +
 libibnetdisc/man/ibnd_destroy_fabric.3      |    2 +
 libibnetdisc/man/ibnd_discover_fabric.3     |   43 ++
 libibnetdisc/man/ibnd_find_node_dr.3        |    2 +
 libibnetdisc/man/ibnd_find_node_guid.3      |   25 +
 libibnetdisc/man/ibnd_iter_nodes.3          |   24 +
 libibnetdisc/man/ibnd_iter_nodes_type.3     |    2 +
 libibnetdisc/man/ibnd_linkspeed_str.3       |    2 +
 libibnetdisc/man/ibnd_linkstate_str.3       |    2 +
 libibnetdisc/man/ibnd_linkwidth_str.3       |   26 +
 libibnetdisc/man/ibnd_node_type_str.3       |    2 +
 libibnetdisc/man/ibnd_node_type_str_short.3 |    2 +
 libibnetdisc/man/ibnd_physstate_str.3       |    2 +
 libibnetdisc/man/ibnd_update_node.3         |   21 +
 libibnetdisc/src/chassis.c                  |  820 +++++++++++++++++++++++++
 libibnetdisc/src/chassis.h                  |   82 +++
 libibnetdisc/src/ibnetdisc.c                |  863 +++++++++++++++++++++++++++
 libibnetdisc/src/libibnetdisc.map           |   27 +
 libibnetdisc/test/iblinkinfotest.c          |  395 ++++++++++++
 libibnetdisc/test/ibnetdisctest.c           |  588 ++++++++++++++++++
 libibnetdisc/test/testleaks.c               |  261 ++++++++
 30 files changed, 4143 insertions(+), 0 deletions(-)
 create mode 100644 libibnetdisc/AUTHORS
 create mode 100644 libibnetdisc/COPYING
 create mode 100644 libibnetdisc/ChangeLog
 create mode 100644 libibnetdisc/Makefile.am
 create mode 100755 libibnetdisc/autogen.sh
 create mode 100644 libibnetdisc/configure.in
 create mode 100644 libibnetdisc/include/infiniband/ibnetdisc.h
 create mode 100644 libibnetdisc/libibnetdisc.spec.in
 create mode 100644 libibnetdisc/libibnetdisc.ver
 create mode 100644 libibnetdisc/man/ibnd_debug.3
 create mode 100644 libibnetdisc/man/ibnd_destroy_fabric.3
 create mode 100644 libibnetdisc/man/ibnd_discover_fabric.3
 create mode 100644 libibnetdisc/man/ibnd_find_node_dr.3
 create mode 100644 libibnetdisc/man/ibnd_find_node_guid.3
 create mode 100644 libibnetdisc/man/ibnd_iter_nodes.3
 create mode 100644 libibnetdisc/man/ibnd_iter_nodes_type.3
 create mode 100644 libibnetdisc/man/ibnd_linkspeed_str.3
 create mode 100644 libibnetdisc/man/ibnd_linkstate_str.3
 create mode 100644 libibnetdisc/man/ibnd_linkwidth_str.3
 create mode 100644 libibnetdisc/man/ibnd_node_type_str.3
 create mode 100644 libibnetdisc/man/ibnd_node_type_str_short.3
 create mode 100644 libibnetdisc/man/ibnd_physstate_str.3
 create mode 100644 libibnetdisc/man/ibnd_update_node.3
 create mode 100644 libibnetdisc/src/chassis.c
 create mode 100644 libibnetdisc/src/chassis.h
 create mode 100644 libibnetdisc/src/ibnetdisc.c
 create mode 100644 libibnetdisc/src/libibnetdisc.map
 create mode 100644 libibnetdisc/test/iblinkinfotest.c
 create mode 100644 libibnetdisc/test/ibnetdisctest.c
 create mode 100644 libibnetdisc/test/testleaks.c

diff --git a/libibnetdisc/AUTHORS b/libibnetdisc/AUTHORS
new file mode 100644
index 0000000..d7211f9
--- /dev/null
+++ b/libibnetdisc/AUTHORS
@@ -0,0 +1 @@
+Ira Weiny <weiny2 at llnl.gov>
diff --git a/libibnetdisc/COPYING b/libibnetdisc/COPYING
new file mode 100644
index 0000000..a017728
--- /dev/null
+++ b/libibnetdisc/COPYING
@@ -0,0 +1,384 @@
+This software with the exception of OpenSM is available to you
+under a choice of one of two licenses. You may chose to be
+licensed under the terms of the the OpenIB.org BSD license or
+the GNU General Public License (GPL) Version 2, both included
+below.
+
+OpenSM is licensed under either GNU General Public License (GPL)
+Version 2, or Intel BSD + Patent license. See OpenSM for the
+specific language for the latter licensing terms.
+
+
+Copyright (c) 2004, 2005 Voltaire, Inc.  All rights reserved.
+
+==================================================================
+
+		       OpenIB.org BSD license
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+
+  * Redistributions of source code must retain the above copyright
+    notice, this list of conditions and the following disclaimer.
+
+  * Redistributions in binary form must reproduce the above
+    copyright notice, this list of conditions and the following
+    disclaimer in the documentation and/or other materials provided
+    with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
+ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+
+==================================================================
+
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+                       59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+                            NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Library General
+Public License instead of this License.
diff --git a/libibnetdisc/ChangeLog b/libibnetdisc/ChangeLog
new file mode 100644
index 0000000..d74037e
--- /dev/null
+++ b/libibnetdisc/ChangeLog
@@ -0,0 +1,4 @@
+
+2008-04-09  Ira Weiny <weiny2 at llnl.gov>
+
+	* Added to git tree
diff --git a/libibnetdisc/Makefile.am b/libibnetdisc/Makefile.am
new file mode 100644
index 0000000..b5c0dd0
--- /dev/null
+++ b/libibnetdisc/Makefile.am
@@ -0,0 +1,73 @@
+
+SUBDIRS = .
+
+INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband
+
+lib_LTLIBRARIES = libibnetdisc.la
+sbin_PROGRAMS =
+
+if ENABLE_TEST_UTILS
+sbin_PROGRAMS += test/ibnetdisctest \
+                 test/iblinkinfotest \
+                 test/testleaks
+endif
+
+DBGFLAGS = -g
+
+if HAVE_LD_VERSION_SCRIPT
+libibnetdisc_version_script = -Wl,--version-script=$(srcdir)/src/libibnetdisc.map
+else
+libibnetdisc_version_script =
+endif
+
+libibnetdisc_la_SOURCES = src/ibnetdisc.c src/chassis.c src/chassis.h
+libibnetdisc_la_CFLAGS = -Wall $(DBGFLAGS)
+libibnetdisc_la_LDFLAGS = -version-info $(ibnetdisc_api_version) \
+	-export-dynamic $(libibnetdisc_version_script) \
+	-losmcomp -libmad
+libibnetdisc_la_DEPENDENCIES = $(srcdir)/src/libibnetdisc.map
+
+libibnetdiscincludedir = $(includedir)/infiniband
+
+test_ibnetdisctest_SOURCES = test/ibnetdisctest.c
+test_ibnetdisctest_CFLAGS = -Wall $(DBGFLAGS)
+test_ibnetdisctest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) \
+			-libcommon -libnetdisc
+
+test_iblinkinfotest_SOURCES = test/iblinkinfotest.c
+test_iblinkinfotest_CFLAGS = -Wall $(DBGFLAGS)
+test_iblinkinfotest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) \
+			-libcommon -libnetdisc
+
+test_testleaks_SOURCES = test/testleaks.c
+test_testleaks_CFLAGS = -Wall $(DBGFLAGS)
+test_testleaks_LDFLAGS = -Wl,--rpath -Wl,$(libdir) \
+			-libcommon -libnetdisc
+
+libibnetdiscinclude_HEADERS = $(srcdir)/include/infiniband/ibnetdisc.h
+
+man_MANS = man/ibnd_debug.3 \
+	man/ibnd_destroy_fabric.3 \
+	man/ibnd_discover_fabric.3 \
+	man/ibnd_find_node_dr.3 \
+	man/ibnd_find_node_guid.3 \
+	man/ibnd_iter_nodes.3 \
+	man/ibnd_iter_nodes_type.3 \
+	man/ibnd_linkspeed_str.3 \
+	man/ibnd_linkstate_str.3 \
+	man/ibnd_linkwidth_str.3 \
+	man/ibnd_node_type_str.3 \
+	man/ibnd_physstate_str.3 \
+	man/ibnd_update_node.3
+
+EXTRA_DIST = libibnetdisc.spec.in libibnetdisc.spec \
+	$(srcdir)/src/libibnetdisc.map libibnetdisc.ver autogen.sh
+
+dist-hook:
+	if [ -x $(top_srcdir)/../gen_chlog.sh ] ; then \
+		$(top_srcdir)/../gen_chlog.sh $(PACKAGE) > $(distdir)/ChangeLog ; \
+	fi
+	if [ -x $(top_srcdir)/../gen_ver.sh ] ; then \
+		ver=`$(top_srcdir)/../gen_ver.sh $(PACKAGE)` ; \
+		sed -e '/AC_INIT/s/$(PACKAGE), .*,/$(PACKAGE), '$$ver',/' $(top_srcdir)/configure.in > $(distdir)/configure.in ; \
+	fi
diff --git a/libibnetdisc/autogen.sh b/libibnetdisc/autogen.sh
new file mode 100755
index 0000000..4827884
--- /dev/null
+++ b/libibnetdisc/autogen.sh
@@ -0,0 +1,11 @@
+#! /bin/sh
+
+# create config dir if not exist
+test -d config || mkdir config
+
+set -x
+aclocal -I config
+libtoolize --force --copy
+autoheader
+automake --foreign --add-missing --copy
+autoconf
diff --git a/libibnetdisc/configure.in b/libibnetdisc/configure.in
new file mode 100644
index 0000000..e5bb0f9
--- /dev/null
+++ b/libibnetdisc/configure.in
@@ -0,0 +1,68 @@
+dnl Process this file with autoconf to produce a configure script.
+
+AC_PREREQ(2.57)
+AC_INIT(libibnetdisc, 0.0.1, general at lists.openfabrics.org)
+dnl AC_CONFIG_SRCDIR([src/stack.c])
+AC_CONFIG_AUX_DIR(config)
+AM_CONFIG_HEADER(config.h)
+AM_INIT_AUTOMAKE
+
+AC_SUBST(RELEASE, ${RELEASE:-unknown})
+AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz})
+
+dnl the library version info is available in the file: libibnetdisc.ver
+ibnetdisc_api_version=`grep LIBVERSION $srcdir/libibnetdisc.ver | sed 's/LIBVERSION=//'`
+if test -z $ibnetdisc_api_version; then
+   echo "FAILED to find $srcdir/libibnetdisc.ver"
+   exit 1
+fi
+AC_SUBST(ibnetdisc_api_version)
+AC_DEFINE_UNQUOTED(API_VERSION,
+	["$ibnetdisc_api_version"],
+	[The API version of this library])
+
+dnl Checks for programs
+AC_PROG_CC
+AC_PROG_CPP
+AC_PROG_INSTALL
+AC_PROG_LN_S
+AC_PROG_MAKE_SET
+AM_PROG_LIBTOOL
+
+dnl Checks for header files.
+AC_HEADER_STDC
+AC_CHECK_HEADERS([stdint.h stdlib.h string.h syslog.h unistd.h])
+
+dnl Checks for library functions
+AC_TYPE_SIGNAL
+AC_FUNC_VPRINTF
+AC_CHECK_FUNCS([strrchr strtoul strtoull])
+
+dnl Checks for typedefs, structures, and compiler characteristics.
+AC_C_CONST
+AC_C_INLINE
+AC_STRUCT_TM
+
+AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script,
+    if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then
+        ac_cv_version_script=yes
+    else
+        ac_cv_version_script=no
+    fi)
+
+AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes")
+
+dnl Check if we should include test utilities
+AC_MSG_CHECKING(for --enable-test-utils)
+AC_ARG_ENABLE(test-utils,
+[  --enable-test-utils build additional test utilities (default=no)],
+[case "${enableval}" in
+  yes) tutils=yes ;;
+  no)  tutils=no ;;
+  *) AC_MSG_ERROR(bad value ${enableval} for --enable-test-utils) ;;
+esac],[tutils=no])
+AM_CONDITIONAL(ENABLE_TEST_UTILS, test x$tutils = xyes)
+AC_MSG_RESULT(${tutils=no})
+
+AC_CONFIG_FILES([Makefile libibnetdisc.spec])
+AC_OUTPUT
diff --git a/libibnetdisc/include/infiniband/ibnetdisc.h b/libibnetdisc/include/infiniband/ibnetdisc.h
new file mode 100644
index 0000000..92fa8c4
--- /dev/null
+++ b/libibnetdisc/include/infiniband/ibnetdisc.h
@@ -0,0 +1,306 @@
+/*
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _IBNETDISC_H_
+#define _IBNETDISC_H_
+
+#include <stdio.h>
+#include <infiniband/mad.h>
+
+#define MAXHOPS		63
+
+/* HASH table defines */
+#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
+#define HTSZ 137
+
+#define	IBND_DEBUG(str, args...) \
+	if (ibdebug) printf("%s:%d; "str, __FILE__, __LINE__, ##args)
+#define	IBND_ERROR(str, args...) \
+	fprintf(stderr, "%s:%d; "str, __FILE__, __LINE__, ##args)
+
+/** =========================================================================
+ * ENUM definitions
+ */
+typedef enum {
+	IBND_CA_NODE	= 1,
+	IBND_SWITCH_NODE = 2,
+	IBND_ROUTER_NODE = 3
+} ibnd_node_type_t;
+
+typedef enum {
+	IBND_LINK_DOWN = 1,
+	IBND_LINK_INIT = 2,
+	IBND_LINK_ARMED = 3,
+	IBND_LINK_ACTIVE = 4
+} ibnd_link_state_t;
+
+/** =========================================================================
+ * Node
+ */
+typedef struct switch_info {
+	int smaenhsp0;
+} ibnd_switch_info_t;
+
+typedef struct node_info {
+	int base_ver;
+	int class_ver;
+	int type;
+	int numports;
+	uint64_t sysimgguid;
+	uint64_t nodeguid;
+	uint64_t nodeportguid;
+	uint16_t partition_cap;
+	uint32_t devid;
+	uint32_t revision;
+	int localport;
+	uint32_t vendid;
+} ibnd_node_info_t;
+
+struct port;
+struct ib_fabric;
+struct chassis_record;
+typedef struct node {
+	struct node *next; /* all node list in fabric */
+	struct node *htnext; /* store node in guid hash table */
+	struct node *dnext; /* store node in nodesdist table */
+	struct node *type_next; /* store node in "type" list (ca|switch|router) */
+
+	struct ib_fabric *fabric; /* the fabric node belongs to */
+
+	ib_portid_t path_portid; /* path from "from_node" */
+	int dist; /* num of hops from "from_node" */
+
+	int smalid;
+	int smalmc;
+	ibnd_switch_info_t sw_info;
+	ibnd_node_info_t info;
+
+	char nodedesc[64];
+
+	struct port **ports; /* in order array of port pointers */
+				/* the size of this array is info.numports + 1 */
+				/* items MAY BE NULL!  (ie 0 == switches only) */
+
+	/* chassis info */
+	struct node *chassis_next; /* store node in "chassis" list */
+	struct chassis_record *chrecord;
+
+	void *user_data; /* users can store data here */
+} ibnd_node_t;
+
+/** =========================================================================
+ * Port
+ */
+typedef struct port_info {
+	int lid;
+	int smlid;
+	int link_speed_supported;
+	int link_speed_enabled;
+	int link_speed_active;
+	int link_state;
+	int phys_state;
+	int link_down_def_state;
+	int mkey_prot_bits;
+	int lmc;
+	int neighbor_mtu;
+	int smsl;
+	int init_type;
+	int vl_capability;
+	int vl_high_limit;
+	int vl_arb_high_cap;
+	int vl_arb_low_cap;
+	int init_reply;
+	int mtu_cap;
+	int vl_stall_count;
+	int hoq_lifetime;
+	int oper_vls;
+	int partition_enforce_in;
+	int partition_enforce_out;
+	int filter_raw_in;
+	int filter_raw_out;
+	int mkey_violations;
+	int pkey_violations;
+	int qkey_violations;
+	int guid_capabilities;
+	int client_rereg;
+	int subnet_timeout;
+	int response_time_val;
+	int local_phys_error;
+	int overrun_error;
+	int max_credit_hint;
+	uint32_t link_round_trip;
+	int local_port;
+	int link_width_supported;
+	int link_width_enabled;
+	int link_width_active;
+	int diag_code;
+	int mkey_lease;
+	uint32_t capability_mask;
+	uint64_t mkey;
+	uint64_t gid_prefix;
+} ibnd_port_info_t;
+
+typedef struct port {
+	struct port *htnext;
+	uint64_t guid;
+	int portnum;
+	int ext_portnum; /* optional (!= 0) external port num */
+	ibnd_node_t *node;
+	struct port *remoteport; /* null if SMA, or does not exist */
+	ibnd_port_info_t info;
+	void *user_data; /* users can store data here */
+} ibnd_port_t;
+
+
+/** =========================================================================
+ * Chassis data
+ */
+typedef struct chassis_record {
+	struct chassis_record *next;
+	unsigned char chassisnum;
+	unsigned char chassistype;
+	unsigned char anafanum;
+	unsigned char slotnum;
+	unsigned char chassisslot;
+} ibnd_chassis_record_t;
+
+#define SPINES_MAX_NUM 12
+#define LINES_MAX_NUM 36
+
+typedef struct chassis_list {
+	struct chassis_list *next;
+	uint64_t chassisguid;
+	int chassisnum;
+	int chassistype;
+
+	/* generic grouping by SystemImageGUID */
+	int nodecount;
+	ibnd_node_t *nodes;
+
+	/* specific to voltaire type nodes */
+	ibnd_node_t *spinenode[SPINES_MAX_NUM + 1];
+	ibnd_node_t *linenode[LINES_MAX_NUM + 1];
+} ibnd_chassis_list_t;
+
+/** =========================================================================
+ * Fabric
+ * Main fabric object which is returned and represents the data discovered
+ */
+typedef struct ib_fabric {
+	/* the node which you requested to start on
+	 * "from" parameter in ibnd_discover_fabric
+	 */
+	ibnd_node_t *from_node;
+
+	/* list of all nodes in the system */
+	ibnd_node_t *nodes;
+
+	/* NULL terminated lists of node types */
+	ibnd_node_t *switches;
+	ibnd_node_t *ch_adapters;
+	ibnd_node_t *routers;
+
+	/* list of all chassis found in the fabric */
+	ibnd_chassis_list_t *chassis;
+
+	/* the following are for internal use */
+	void *ibmad_port;
+	ibnd_node_t *nodestbl[HTSZ];
+	ibnd_port_t *portstbl[HTSZ];
+	int maxhops_discovered;
+	ibnd_node_t *nodesdist[MAXHOPS+1];
+	ibnd_chassis_list_t *first_chassis;
+	ibnd_chassis_list_t *current_chassis;
+} ibnd_fabric_t;
+
+
+/** =========================================================================
+ * Initialization (fabric operations)
+ */
+void           ibnd_debug(int i);
+void           ibnd_show_progress(int i);
+
+ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port,
+			int timeout_ms, ib_portid_t *from, int hops);
+	/**
+	 * dev_name: (required) local device name to use to access the fabric
+	 * dev_port: (required) local device port to use to access the fabric
+	 * timeout_ms: (required) gives the timeout for a _SINGLE_ query on
+	 *             the fabric.  So if there are mutiple nodes not
+	 *             responding this may result in a lengthy delay.
+	 * from: (optional) specify the node to start scanning from.
+	 *       If NULL start from the node we are running on.
+	 * hops: (optional) Specify how much of the fabric to traverse.
+	 *       negative value == scan entire fabric
+	 */
+void           ibnd_destroy_fabric(ibnd_fabric_t *fabric);
+
+/** =========================================================================
+ * Node operations
+ */
+typedef void (*ibnd_iter_func_t)(ibnd_node_t *node, void *user_data);
+
+ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid);
+ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str);
+ibnd_node_t *ibnd_update_node(ibnd_node_t *node);
+
+void         ibnd_iter_nodes(ibnd_fabric_t *fabric,
+				ibnd_iter_func_t func,
+				void *user_data);
+void         ibnd_iter_nodes_type(ibnd_fabric_t *fabric,
+				ibnd_iter_func_t func,
+				ibnd_node_type_t node_type,
+				void *user_data);
+
+/** =========================================================================
+ * Str convert functions
+ */
+char          *ibnd_linkwidth_str(int link_width);
+char          *ibnd_linkspeed_str(int link_speed);
+char          *ibnd_linkstate_str(int link_state);
+char          *ibnd_physstate_str(int phys_state);
+const char    *ibnd_node_type_str(ibnd_node_t *node);
+const char    *ibnd_node_type_str_short(ibnd_node_t *node);
+
+/** =========================================================================
+ * Chassis queries
+ */
+uint64_t  ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum);
+char     *ibnd_get_chassis_type(ibnd_node_t *node);
+char     *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size);
+
+int       ibnd_is_xsigo_guid(uint64_t guid);
+int       ibnd_is_xsigo_tca(uint64_t guid);
+int       ibnd_is_xsigo_hca(uint64_t guid);
+
+#endif	/* _IBNETDISC_H_ */
diff --git a/libibnetdisc/libibnetdisc.spec.in b/libibnetdisc/libibnetdisc.spec.in
new file mode 100644
index 0000000..015cd24
--- /dev/null
+++ b/libibnetdisc/libibnetdisc.spec.in
@@ -0,0 +1,94 @@
+
+%define RELEASE @RELEASE@
+%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE}
+
+%if %{?_with_test_utils:1}%{!?_with_test_utils:0}
+%define _enable_test_utils --enable-test-utils
+%endif
+%if %{?_without_test_utils:1}%{!?_without_test_utils:0}
+%define _disable_test_utils --disable-test-utils
+%endif
+
+Summary: OpenFabrics Alliance InfiniBand MAD library
+Name: libibnetdisc
+Version: @VERSION@
+Release: %rel%{?dist}
+License: GPLv2 or BSD
+Group: System Environment/Libraries
+BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
+Source: http://www.openfabrics.org/downloads/management/@TARBALL@
+Url: http://openfabrics.org/
+BuildRequires: opensm-libs, libtool, libibcommon, libibumad
+Requires(post): /sbin/ldconfig
+Requires(postun): /sbin/ldconfig
+
+%description
+libibnetdisc provides a higer level C interface to scaning an IB fabric.
+
+%package devel
+Summary: Development files for the libibnetdisc library
+Group: System Environment/Libraries
+Requires: %{name} = %{version}-%{release}, opensm-devel, libibcommon-devel, libibumad-devel
+Requires(post): /sbin/ldconfig
+Requires(postun): /sbin/ldconfig
+
+%description devel
+Development files for the libibnetdisc library.
+
+%package static
+Summary: Static version of the libibnetdisc library
+Group: System Environment/Libraries
+Requires: %{name} = %{version}-%{release}
+
+%description static
+Static version of the libibnetdisc library
+
+%if %{?_with_test_utils:1}%{!?_with_test_utils:0}
+%package utils
+Summary: Debug utilities built against libibnetdisc
+Group: System Environment/Libraries
+Requires: %{name} = %{version}-%{release}
+
+%description utils
+Debug utilities built against libibnetdisc
+
+%files utils
+%defattr(-,root,root)
+%{_sbindir}/*
+%endif
+
+%prep
+%setup -q
+
+%build
+%configure \
+   %{?_enable_test_utils} \
+   %{?_disable_test_utils}
+make
+
+%install
+make DESTDIR=${RPM_BUILD_ROOT} install
+# remove unpackaged files from the buildroot
+rm -f $RPM_BUILD_ROOT%{_libdir}/*.la
+
+%clean
+rm -rf $RPM_BUILD_ROOT
+
+%post -p /sbin/ldconfig
+%postun -p /sbin/ldconfig
+%post devel -p /sbin/ldconfig
+%postun devel -p /sbin/ldconfig
+
+%files
+%defattr(-,root,root)
+%{_libdir}/libibnetdisc*.so.*
+%doc AUTHORS COPYING ChangeLog
+
+%files devel
+%defattr(-,root,root)
+%{_libdir}/libibnetdisc.so
+%{_includedir}/infiniband/*.h
+
+%files static
+%defattr(-,root,root)
+%{_libdir}/libibnetdisc.a
diff --git a/libibnetdisc/libibnetdisc.ver b/libibnetdisc/libibnetdisc.ver
new file mode 100644
index 0000000..a0a5f3c
--- /dev/null
+++ b/libibnetdisc/libibnetdisc.ver
@@ -0,0 +1,9 @@
+# In this file we track the current API version
+# of the IB net discover interface (and libraries)
+# The version is built of the following
+# tree numbers:
+# API_REV:RUNNING_REV:AGE
+# API_REV - advance on any added API
+# RUNNING_REV - advance any change to the vendor files
+# AGE - number of backward versions the API still supports
+LIBVERSION=1:0:0
diff --git a/libibnetdisc/man/ibnd_debug.3 b/libibnetdisc/man/ibnd_debug.3
new file mode 100644
index 0000000..a4076fc
--- /dev/null
+++ b/libibnetdisc/man/ibnd_debug.3
@@ -0,0 +1,2 @@
+.\".TH IBND_DEBUG 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_discover_fabric.3
diff --git a/libibnetdisc/man/ibnd_destroy_fabric.3 b/libibnetdisc/man/ibnd_destroy_fabric.3
new file mode 100644
index 0000000..8fe20ae
--- /dev/null
+++ b/libibnetdisc/man/ibnd_destroy_fabric.3
@@ -0,0 +1,2 @@
+.\".TH IBND_DESTROY_FABRIC 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_discover_fabric.3
diff --git a/libibnetdisc/man/ibnd_discover_fabric.3 b/libibnetdisc/man/ibnd_discover_fabric.3
new file mode 100644
index 0000000..0db23f4
--- /dev/null
+++ b/libibnetdisc/man/ibnd_discover_fabric.3
@@ -0,0 +1,43 @@
+.TH IBND_DISCOVER_FABRIC 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.SH "NAME"
+ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug \- initialize ibnetdiscover library.
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/ibnetdisc.h>
+.sp
+.BI "ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *from, int hops)"
+.BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)"
+.BI "void ibnd_debug(int i)"
+
+.SH "DESCRIPTION"
+.B ibnd_discover_fabric()
+Discover the fabric connected to the port specified by dev_name and dev_port, using a timeout specified.  The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops".  This gives the user a "sub-fabric" which is "centered" anywhere they chose.
+
+.B ibnd_destroy_fabric()
+free all memory and resources associated with the fabric.
+
+.B ibnd_debug()
+Set the debug level to be printed as library operations take place.
+
+.SH "RETURN VALUE"
+.B ibnd_discover_fabric()
+return NULL on failure, otherwise a valid ibnd_fabric_t object.
+
+.B ibnd_destory_fabric(), ibnd_debug()
+NONE
+
+.SH "EXAMPLES"
+
+.B Discover the entire fabric connected to device "mthca0", port 1.
+
+	ibnd_discover_fabric("mthca0", 1, 100, NULL, 0);
+
+.B Discover only a single node and those nodes connected to it.
+
+	str2drpath(&(port_id.drpath), from, 0, 0);
+
+	ibnd_discover_fabric("mthca0", 1, 100, &port_id, 1);
+
+.SH "AUTHORS"
+.TP
+Ira Weiny <weiny2 at llnl.gov>
diff --git a/libibnetdisc/man/ibnd_find_node_dr.3 b/libibnetdisc/man/ibnd_find_node_dr.3
new file mode 100644
index 0000000..612e501
--- /dev/null
+++ b/libibnetdisc/man/ibnd_find_node_dr.3
@@ -0,0 +1,2 @@
+.\".TH IBND_FIND_NODE_DR 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_find_node_guid.3
diff --git a/libibnetdisc/man/ibnd_find_node_guid.3 b/libibnetdisc/man/ibnd_find_node_guid.3
new file mode 100644
index 0000000..676b528
--- /dev/null
+++ b/libibnetdisc/man/ibnd_find_node_guid.3
@@ -0,0 +1,25 @@
+.TH IBND_FIND_NODE_GUID 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.SH "NAME"
+ibnd_find_node_guid, ibnd_find_node_dr \- given a fabric object find the node object within it which matches the guid or directed route specified.
+
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/ibnetdisc.h>
+.sp
+.BI "ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid)"
+.BI "ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str)"
+
+.SH "DESCRIPTION"
+.B ibnd_find_node_guid()
+Given a fabric object and a guid, return the ibnd_node_t object with that node guid.
+.B ibnd_find_node_dr()
+Given a fabric object and a directed route, return the ibnd_node_t object with
+that directed route.
+
+.SH "RETURN VALUE"
+.B ibnd_find_node_guid(), ibnd_find_node_dr()
+return NULL on failure, otherwise a valid ibnd_node_t object.
+
+.SH "AUTHORS"
+.TP
+Ira Weiny <weiny2 at llnl.gov>
diff --git a/libibnetdisc/man/ibnd_iter_nodes.3 b/libibnetdisc/man/ibnd_iter_nodes.3
new file mode 100644
index 0000000..7199dfb
--- /dev/null
+++ b/libibnetdisc/man/ibnd_iter_nodes.3
@@ -0,0 +1,24 @@
+.TH IBND_ITER_NODES 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.SH "NAME"
+ibnd_iter_nodes, ibnd_iter_nodes_type \- given a fabric object and a function itterate over the nodes in the fabric.
+
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/ibnetdisc.h>
+.sp
+.BI "void ibnd_iter_nodes(ibnd_fabric_t *fabric, ibnd_iter_func_t func, void *user_data)"
+.BI "void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, ibnd_iter_func_t func, ibnd_node_type_t type, void *user_data)"
+
+.SH "DESCRIPTION"
+.B ibnd_iter_nodes()
+Itterate through all the nodes in the fabric and call "func" on them.
+.B ibnd_iter_nodes_type()
+The same as ibnd_iter_nodes except to limit the iteration to the nodes with the specified type.
+
+.SH "RETURN VALUE"
+.B ibnd_iter_nodes(), ibnd_iter_nodes_type()
+NONE
+
+.SH "AUTHORS"
+.TP
+Ira Weiny <weiny2 at llnl.gov>
diff --git a/libibnetdisc/man/ibnd_iter_nodes_type.3 b/libibnetdisc/man/ibnd_iter_nodes_type.3
new file mode 100644
index 0000000..878547c
--- /dev/null
+++ b/libibnetdisc/man/ibnd_iter_nodes_type.3
@@ -0,0 +1,2 @@
+.\".TH IBND_FIND_NODES_TYPE 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_find_nodes.3
diff --git a/libibnetdisc/man/ibnd_linkspeed_str.3 b/libibnetdisc/man/ibnd_linkspeed_str.3
new file mode 100644
index 0000000..128cd3e
--- /dev/null
+++ b/libibnetdisc/man/ibnd_linkspeed_str.3
@@ -0,0 +1,2 @@
+.\".TH IBND_LINKSPEED_STR 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_linkwidth_str.3
diff --git a/libibnetdisc/man/ibnd_linkstate_str.3 b/libibnetdisc/man/ibnd_linkstate_str.3
new file mode 100644
index 0000000..2fa9189
--- /dev/null
+++ b/libibnetdisc/man/ibnd_linkstate_str.3
@@ -0,0 +1,2 @@
+.\".TH IBND_LINKSTATE_STR 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_linkwidth_str.3
diff --git a/libibnetdisc/man/ibnd_linkwidth_str.3 b/libibnetdisc/man/ibnd_linkwidth_str.3
new file mode 100644
index 0000000..2cd4f0a
--- /dev/null
+++ b/libibnetdisc/man/ibnd_linkwidth_str.3
@@ -0,0 +1,26 @@
+.TH IBND_LINKWIDTH_STR 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.SH "NAME"
+ibnd_linkwidth_str, ibnd_linkspeed_str, ibnd_linkstate_str, ibnd_physstate_str, ibnd_node_type_str \- prety string functions.
+
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/ibnetdisc.h>
+.sp
+.BI
+.BI "char          *ibnd_linkwidth_str(int link_width)"
+.BI "char          *ibnd_linkspeed_str(int link_speed)"
+.BI "char          *ibnd_linkstate_str(int link_state)"
+.BI "char          *ibnd_physstate_str(int phys_state)"
+.BI "const char    *ibnd_node_type_str(ibnd_node_t *node)"
+.BI "const char    *ibnd_node_type_str_short(ibnd_node_t *node)"
+
+.SH "DESCRIPTION"
+Return user readable strings for the values given.
+
+.BI "const char    *ibnd_node_type_str_short(ibnd_node_t *node)"
+Returns a shorter abbreviated version of the string.
+
+
+.SH "AUTHORS"
+.TP
+Ira Weiny <weiny2 at llnl.gov>
diff --git a/libibnetdisc/man/ibnd_node_type_str.3 b/libibnetdisc/man/ibnd_node_type_str.3
new file mode 100644
index 0000000..77dbf07
--- /dev/null
+++ b/libibnetdisc/man/ibnd_node_type_str.3
@@ -0,0 +1,2 @@
+.\".TH IBND_NODE_TYPE_STR 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_linkwidth_str.3
diff --git a/libibnetdisc/man/ibnd_node_type_str_short.3 b/libibnetdisc/man/ibnd_node_type_str_short.3
new file mode 100644
index 0000000..62feb6e
--- /dev/null
+++ b/libibnetdisc/man/ibnd_node_type_str_short.3
@@ -0,0 +1,2 @@
+.\".TH IBND_NODE_TYPE_STR_SHORT 3  "Aug 05, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_linkwidth_str.3
diff --git a/libibnetdisc/man/ibnd_physstate_str.3 b/libibnetdisc/man/ibnd_physstate_str.3
new file mode 100644
index 0000000..aeeaeb7
--- /dev/null
+++ b/libibnetdisc/man/ibnd_physstate_str.3
@@ -0,0 +1,2 @@
+.\".TH IBND_PHYSSTATE_STR 3  "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.so man3/ibnd_physstate_str.3
diff --git a/libibnetdisc/man/ibnd_update_node.3 b/libibnetdisc/man/ibnd_update_node.3
new file mode 100644
index 0000000..d3aa206
--- /dev/null
+++ b/libibnetdisc/man/ibnd_update_node.3
@@ -0,0 +1,21 @@
+.TH IBND_UPDATE_NODE 3  "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual"
+.SH "NAME"
+ibnd_update_node \- Update the node specified with new data from the fabric.
+
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/ibnetdisc.h>
+.sp
+.BI "ibnd_node_t *ibnd_update_node(ibnd_node_t *node)"
+
+.SH "DESCRIPTION"
+.B ibnd_update_node()
+Update the node info, port info, and node description of the node specified.
+
+.SH "RETURN VALUE"
+.B ibnd_update_node()
+Return NULL on failure, otherwise a valid ibnd_node_t object which is part of the fabric object.
+
+.SH "AUTHORS"
+.TP
+Ira Weiny <weiny2 at llnl.gov>
diff --git a/libibnetdisc/src/chassis.c b/libibnetdisc/src/chassis.c
new file mode 100644
index 0000000..5f9c073
--- /dev/null
+++ b/libibnetdisc/src/chassis.c
@@ -0,0 +1,820 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*========================================================*/
+/*               FABRIC SCANNER SPECIFIC DATA             */
+/*========================================================*/
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <infiniband/common.h>
+#include <infiniband/mad.h>
+
+#include "chassis.h"
+
+static char *ChassisTypeStr[5] = { "", "ISR9288", "ISR9096", "ISR2012", "ISR2004" };
+static char *ChassisSlotTypeStr[4] = { "", "Line", "Spine", "SRBD" };
+
+char *ibnd_get_chassis_type(ibnd_node_t *node)
+{
+	/* Currently, only if Voltaire chassis */
+	if (node->info.vendid != VTR_VENDOR_ID)
+		return (NULL);
+	if (!node->chrecord)
+		return (NULL);
+	if (node->chrecord->chassistype == UNRESOLVED_CT
+		|| node->chrecord->chassistype > ISR2004_CT)
+		return (NULL);
+	return ChassisTypeStr[node->chrecord->chassistype];
+}
+
+char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size)
+{
+	/* Currently, only if Voltaire chassis */
+	if (node->info.vendid != VTR_VENDOR_ID)
+		return (NULL);
+	if (!node->chrecord)
+		return (NULL);
+	if (node->chrecord->chassisslot == UNRESOLVED_CS
+		|| node->chrecord->chassisslot > SRBD_CS)
+		return (NULL);
+	if (!str)
+		return (NULL);
+	snprintf(str, size, "%s %d Chip %d",
+			ChassisSlotTypeStr[node->chrecord->chassisslot],
+			node->chrecord->slotnum,
+			node->chrecord->anafanum);
+	return (str);
+}
+
+static ibnd_chassis_list_t *find_chassisnum(ibnd_fabric_t *fabric, unsigned char chassisnum)
+{
+	ibnd_chassis_list_t *current;
+
+	for (current = fabric->first_chassis; current; current = current->next) {
+		if (current->chassisnum == chassisnum)
+			return current;
+	}
+
+	return NULL;
+}
+
+static uint64_t topspin_chassisguid(uint64_t guid)
+{
+	/* Byte 3 in system image GUID is chassis type, and */
+	/* Byte 4 is location ID (slot) so just mask off byte 4 */
+	return guid & 0xffffffff00ffffffULL;
+}
+
+int ibnd_is_xsigo_guid(uint64_t guid)
+{
+	if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL)
+		return 1;
+	else
+		return 0;
+}
+
+static int is_xsigo_leafone(uint64_t guid)
+{
+	if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL)
+		return 1;
+	else
+		return 0;
+}
+
+int ibnd_is_xsigo_hca(uint64_t guid)
+{
+	/* NodeType 2 is HCA */
+	if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL)
+		return 1;
+	else
+		return 0;
+}
+
+int ibnd_is_xsigo_tca(uint64_t guid)
+{
+	/* NodeType 3 is TCA */
+	if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL)
+		return 1;
+	else
+		return 0;
+}
+
+static int is_xsigo_ca(uint64_t guid)
+{
+	if (ibnd_is_xsigo_hca(guid) || ibnd_is_xsigo_tca(guid))
+		return 1;
+	else
+		return 0;
+}
+
+static int is_xsigo_switch(uint64_t guid)
+{
+	if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL)
+		return 1;
+	else
+		return 0;
+}
+
+static uint64_t xsigo_chassisguid(ibnd_node_t *node)
+{
+	if (!is_xsigo_ca(node->info.sysimgguid)) {
+		/* Byte 3 is NodeType and byte 4 is PortType */
+		/* If NodeType is 1 (switch), PortType is masked */
+		if (is_xsigo_switch(node->info.sysimgguid))
+			return node->info.sysimgguid & 0xffffffff00ffffffULL;
+		else
+			return node->info.sysimgguid;
+	} else {
+		if (!node->ports || !node->ports[1])
+			return (0);
+
+		/* Is there a peer port ? */
+		if (!node->ports[1]->remoteport)
+			return node->info.sysimgguid;
+
+		/* If peer port is Leaf 1, use its chassis GUID */
+		if (is_xsigo_leafone(node->ports[1]->remoteport->node->info.sysimgguid))
+			return node->ports[1]->remoteport->node->info.sysimgguid &
+			       0xffffffff00ffffffULL;
+		else
+			return node->info.sysimgguid;
+	}
+}
+
+static uint64_t get_chassisguid(ibnd_node_t *node)
+{
+	if (node->info.vendid == TS_VENDOR_ID || node->info.vendid == SS_VENDOR_ID)
+		return topspin_chassisguid(node->info.sysimgguid);
+	else if (node->info.vendid == XS_VENDOR_ID || ibnd_is_xsigo_guid(node->info.sysimgguid))
+		return xsigo_chassisguid(node);
+	else
+		return node->info.sysimgguid;
+}
+
+static ibnd_chassis_list_t *find_chassisguid(ibnd_node_t *node)
+{
+	ibnd_chassis_list_t *current;
+	uint64_t chguid;
+
+	chguid = get_chassisguid(node);
+	for (current = node->fabric->first_chassis; current; current = current->next) {
+		if (current->chassisguid == chguid)
+			return current;
+	}
+
+	return NULL;
+}
+
+uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum)
+{
+	ibnd_chassis_list_t *chassis;
+
+	chassis = find_chassisnum(fabric, chassisnum);
+	if (chassis)
+		return chassis->chassisguid;
+	else
+		return 0;
+}
+
+static int is_router(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_IB_FC_ROUTER ||
+		node->info.devid == VTR_DEVID_IB_IP_ROUTER);
+}
+
+static int is_spine_9096(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SFB4 ||
+		node->info.devid == VTR_DEVID_SFB4_DDR);
+}
+
+static int is_spine_9288(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SFB12 ||
+		node->info.devid == VTR_DEVID_SFB12_DDR);
+}
+
+static int is_spine_2004(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SFB2004);
+}
+
+static int is_spine_2012(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SFB2012);
+}
+
+static int is_spine(ibnd_node_t *node)
+{
+	return (is_spine_9096(node) || is_spine_9288(node) ||
+		is_spine_2004(node) || is_spine_2012(node));
+}
+
+static int is_line_24(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SLB24 ||
+		node->info.devid == VTR_DEVID_SLB24_DDR);
+}
+
+static int is_line_8(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SLB8);
+}
+
+static int is_line_2024(ibnd_node_t *node)
+{
+	return (node->info.devid == VTR_DEVID_SLB2024);
+}
+
+static int is_line(ibnd_node_t *node)
+{
+	return (is_line_24(node) || is_line_8(node) || is_line_2024(node));
+}
+
+int is_chassis_switch(ibnd_node_t *node)
+{
+    return (is_spine(node) || is_line(node));
+}
+
+/* these structs help find Line (Anafa) slot number while using spine portnum */
+int line_slot_2_sfb4[25]        = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 };
+int anafa_line_slot_2_sfb4[25]  = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 };
+int line_slot_2_sfb12[25]       = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 };
+int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 };
+
+/* IPR FCR modules connectivity while using sFB4 port as reference */
+int ipr_slot_2_sfb4_port[25]    = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 };
+
+/* these structs help find Spine (Anafa) slot number while using spine portnum */
+int spine12_slot_2_slb[25]      = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+int spine4_slot_2_slb[25]       = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
+/*	reference                     { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */
+
+static void get_sfb_slot(ibnd_node_t *node, ibnd_port_t *lineport)
+{
+	ibnd_chassis_record_t *ch = node->chrecord;
+
+	ch->chassisslot = SPINE_CS;
+	if (is_spine_9096(node)) {
+		ch->chassistype = ISR9096_CT;
+		ch->slotnum = spine4_slot_2_slb[lineport->portnum];
+		ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum];
+	} else if (is_spine_9288(node)) {
+		ch->chassistype = ISR9288_CT;
+		ch->slotnum = spine12_slot_2_slb[lineport->portnum];
+		ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum];
+	} else if (is_spine_2012(node)) {
+		ch->chassistype = ISR2012_CT;
+		ch->slotnum = spine12_slot_2_slb[lineport->portnum];
+		ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum];
+	} else if (is_spine_2004(node)) {
+		ch->chassistype = ISR2004_CT;
+		ch->slotnum = spine4_slot_2_slb[lineport->portnum];
+		ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum];
+	} else {
+		IBPANIC("Unexpected node found: guid 0x%016" PRIx64,
+		node->info.nodeguid);
+	}
+}
+
+static void get_router_slot(ibnd_node_t *node, ibnd_port_t *spineport)
+{
+	ibnd_chassis_record_t *ch = node->chrecord;
+	int guessnum = 0;
+
+	if (!ch) {
+		if (!(node->chrecord = calloc(1, sizeof(ibnd_chassis_record_t))))
+			IBPANIC("out of mem");
+		ch = node->chrecord;
+	}
+
+	ch->chassisslot = SRBD_CS;
+	if (is_spine_9096(spineport->node)) {
+		ch->chassistype = ISR9096_CT;
+		ch->slotnum = line_slot_2_sfb4[spineport->portnum];
+		ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
+	} else if (is_spine_9288(spineport->node)) {
+		ch->chassistype = ISR9288_CT;
+		ch->slotnum = line_slot_2_sfb12[spineport->portnum];
+		/* this is a smart guess based on nodeguids order on sFB-12 module */
+		guessnum = spineport->node->info.nodeguid % 4;
+		/* module 1 <--> remote anafa 3 */
+		/* module 2 <--> remote anafa 2 */
+		/* module 3 <--> remote anafa 1 */
+		ch->anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2));
+	} else if (is_spine_2012(spineport->node)) {
+		ch->chassistype = ISR2012_CT;
+		ch->slotnum = line_slot_2_sfb12[spineport->portnum];
+		/* this is a smart guess based on nodeguids order on sFB-12 module */
+		guessnum = spineport->node->info.nodeguid % 4;
+		// module 1 <--> remote anafa 3
+		// module 2 <--> remote anafa 2
+		// module 3 <--> remote anafa 1
+		ch->anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2));
+	} else if (is_spine_2004(spineport->node)) {
+		ch->chassistype = ISR2004_CT;
+		ch->slotnum = line_slot_2_sfb4[spineport->portnum];
+		ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum];
+	} else {
+		IBPANIC("Unexpected node found: guid 0x%016" PRIx64,
+		spineport->node->info.nodeguid);
+	}
+}
+
+static void get_slb_slot(ibnd_chassis_record_t *ch, ibnd_port_t *spineport)
+{
+	ch->chassisslot = LINE_CS;
+	if (is_spine_9096(spineport->node)) {
+		ch->chassistype = ISR9096_CT;
+		ch->slotnum = line_slot_2_sfb4[spineport->portnum];
+		ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
+	} else if (is_spine_9288(spineport->node)) {
+		ch->chassistype = ISR9288_CT;
+		ch->slotnum = line_slot_2_sfb12[spineport->portnum];
+		ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum];
+	} else if (is_spine_2012(spineport->node)) {
+		ch->chassistype = ISR2012_CT;
+		ch->slotnum = line_slot_2_sfb12[spineport->portnum];
+		ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum];
+	} else if (is_spine_2004(spineport->node)) {
+		ch->chassistype = ISR2004_CT;
+		ch->slotnum = line_slot_2_sfb4[spineport->portnum];
+		ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum];
+	} else {
+		IBPANIC("Unexpected node found: guid 0x%016" PRIx64,
+		spineport->node->info.nodeguid);
+	}
+}
+
+/* forward declare this */
+static void voltaire_portmap(ibnd_port_t *port);
+/*
+	This function called for every Voltaire node in fabric
+	It could be optimized so, but time overhead is very small
+	and its only diag.util
+*/
+static void fill_voltaire_chassis_record(ibnd_node_t *node)
+{
+	int p = 0;
+	ibnd_port_t *port;
+	ibnd_node_t *remnode = 0;
+	ibnd_chassis_record_t *ch = 0;
+
+	if (node->chrecord) /* somehow this node has already been passed */
+		return;
+
+	if (!(node->chrecord = calloc(1, sizeof(ibnd_chassis_record_t))))
+		IBPANIC("out of mem");
+
+	ch = node->chrecord;
+
+	/* node is router only in case of using unique lid */
+	/* (which is lid of chassis router port) */
+	/* in such case node->ports is actually a requested port... */
+	if (is_router(node)) {
+		/* find the remote node */
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (port && is_spine(port->remoteport->node))
+				get_router_slot(node, port->remoteport);
+		}
+	} else if (is_spine(node)) {
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (!port || !port->remoteport)
+				continue;
+			remnode = port->remoteport->node;
+			if (remnode->info.type != IBND_SWITCH_NODE) {
+				if (!remnode->chrecord)
+					get_router_slot(remnode, port);
+				continue;
+			}
+			if (!ch->chassistype)
+				/* we assume here that remoteport belongs to line */
+				get_sfb_slot(node, port->remoteport);
+
+				/* we could break here, but need to find if more routers connected */
+		}
+
+	} else if (is_line(node)) {
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (!port || port->portnum > 12 || !port->remoteport)
+				continue;
+			/* we assume here that remoteport belongs to spine */
+			get_slb_slot(ch, port->remoteport);
+			break;
+		}
+	}
+
+	/* for each port of this node, map external ports */
+	for (p = 1; p <= node->info.numports; p++) {
+		port = node->ports[p];
+		if (!port)
+			continue;
+		voltaire_portmap(port);
+	}
+
+	return;
+}
+
+static int get_line_index(ibnd_node_t *node)
+{
+	int retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum;
+
+	if (retval > LINES_MAX_NUM || retval < 1)
+		IBPANIC("Internal error");
+	return retval;
+}
+
+static int get_spine_index(ibnd_node_t *node)
+{
+	int retval;
+
+	if (is_spine_9288(node) || is_spine_2012(node))
+		retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum;
+	else
+		retval = node->chrecord->slotnum;
+
+	if (retval > SPINES_MAX_NUM || retval < 1)
+		IBPANIC("Internal error");
+	return retval;
+}
+
+static void insert_line_router(ibnd_node_t *node, ibnd_chassis_list_t *chassislist)
+{
+	int i = get_line_index(node);
+
+	if (chassislist->linenode[i])
+		return;		/* already filled slot */
+
+	chassislist->linenode[i] = node;
+	node->chrecord->chassisnum = chassislist->chassisnum;
+}
+
+static void insert_spine(ibnd_node_t *node, ibnd_chassis_list_t *chassislist)
+{
+	int i = get_spine_index(node);
+
+	if (chassislist->spinenode[i])
+		return;		/* already filled slot */
+
+	chassislist->spinenode[i] = node;
+	node->chrecord->chassisnum = chassislist->chassisnum;
+}
+
+static void pass_on_lines_catch_spines(ibnd_chassis_list_t *chassislist)
+{
+	ibnd_node_t *node, *remnode;
+	ibnd_port_t *port;
+	int i, p;
+
+	for (i = 1; i <= LINES_MAX_NUM; i++) {
+		node = chassislist->linenode[i];
+
+		if (!(node && is_line(node)))
+			continue;	/* empty slot or router */
+
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (!port || port->portnum > 12 || !port->remoteport)
+				continue;
+
+			remnode = port->remoteport->node;
+
+			if (!remnode->chrecord)
+				continue;	/* some error - spine not initialized ? FIXME */
+			insert_spine(remnode, chassislist);
+		}
+	}
+}
+
+static void pass_on_spines_catch_lines(ibnd_chassis_list_t *chassislist)
+{
+	ibnd_node_t *node, *remnode;
+	ibnd_port_t *port;
+	int i, p;
+
+	for (i = 1; i <= SPINES_MAX_NUM; i++) {
+		node = chassislist->spinenode[i];
+		if (!node)
+			continue;	/* empty slot */
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (!port || !port->remoteport)
+				continue;
+			remnode = port->remoteport->node;
+
+			if (!remnode->chrecord)
+				continue;	/* some error - line/router not initialized ? FIXME */
+			insert_line_router(remnode, chassislist);
+		}
+	}
+}
+
+/*
+	Stupid interpolation algorithm...
+	But nothing to do - have to be compliant with VoltaireSM/NMS
+*/
+static void pass_on_spines_interpolate_chguid(ibnd_chassis_list_t *chassislist)
+{
+	ibnd_node_t *node;
+	int i;
+
+	for (i = 1; i <= SPINES_MAX_NUM; i++) {
+		node = chassislist->spinenode[i];
+		if (!node)
+			continue;	/* skip the empty slots */
+
+		/* take first guid minus one to be consistent with SM */
+		chassislist->chassisguid = node->info.nodeguid - 1;
+		break;
+	}
+}
+
+/*
+	This function fills chassislist structure with all nodes
+	in that chassis
+	chassislist structure = structure of one standalone chassis
+*/
+static void build_chassis(ibnd_node_t *node, ibnd_chassis_list_t *chassislist)
+{
+	int p = 0;
+	ibnd_node_t *remnode = 0;
+	ibnd_port_t *port = 0;
+
+	/* we get here with node = chassis_spine */
+	chassislist->chassistype = node->chrecord->chassistype;
+	insert_spine(node, chassislist);
+
+	/* loop: pass on all ports of node */
+	for (p = 1; p <= node->info.numports; p++ ) {
+		port = node->ports[p];
+		if (!port || !port->remoteport)
+			continue;
+		remnode = port->remoteport->node;
+
+		if (!remnode->chrecord)
+			continue; /* some error - line or router not initialized ? FIXME */
+
+		insert_line_router(remnode, chassislist);
+	}
+
+	pass_on_lines_catch_spines(chassislist);
+	/* this pass needed for to catch routers, since routers connected only */
+	/* to spines in slot 1 or 4 and we could miss them first time */
+	pass_on_spines_catch_lines(chassislist);
+
+	/* additional 2 passes needed for to overcome a problem of pure "in-chassis" */
+	/* connectivity - extra pass to ensure that all related chips/modules */
+	/* inserted into the chassislist */
+	pass_on_lines_catch_spines(chassislist);
+	pass_on_spines_catch_lines(chassislist);
+	pass_on_spines_interpolate_chguid(chassislist);
+}
+
+/*========================================================*/
+/*                INTERNAL TO EXTERNAL PORT MAPPING       */
+/*========================================================*/
+
+/*
+Description : On ISR9288/9096 external ports indexing
+              is not matching the internal ( anafa ) port
+              indexes. Use this MAP to translate the data you get from
+              the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.)
+
+
+Module : sLB-24
+                anafa 1             anafa 2
+ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24
+int port | 22 23 24 18 17 16 | 22 23 24 18 17 16
+ext port | 1  2  3  4  5  6  | 7  8  9  10 11 12
+int port | 19 20 21 15 14 13 | 19 20 21 15 14 13
+------------------------------------------------
+
+Module : sLB-8
+                anafa 1             anafa 2
+ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24
+int port | 24 23 22 18 17 16 | 24 23 22 18 17 16
+ext port | 1  2  3  4  5  6  | 7  8  9  10 11 12
+int port | 21 20 19 15 14 13 | 21 20 19 15 14 13
+
+----------->
+                anafa 1             anafa 2
+ext port | -  -  5  -  -  6  | -  -  7  -  -  8
+int port | 24 23 22 18 17 16 | 24 23 22 18 17 16
+ext port | -  -  1  -  -  2  | -  -  3  -  -  4
+int port | 21 20 19 15 14 13 | 21 20 19 15 14 13
+------------------------------------------------
+
+Module : sLB-2024
+
+ext port | 13 14 15 16 17 18 19 20 21 22 23 24
+A1 int port| 13 14 15 16 17 18 19 20 21 22 23 24
+ext port | 1 2 3 4 5 6 7 8 9 10 11 12
+A2 int port| 13 14 15 16 17 18 19 20 21 22 23 24
+---------------------------------------------------
+
+*/
+
+int int2ext_map_slb24[2][25] = {
+					{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 },
+					{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 }
+				};
+int int2ext_map_slb8[2][25] = {
+					{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 6, 6, 6, 1, 1, 1, 5, 5, 5 },
+					{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 8, 8, 8, 3, 3, 3, 7, 7, 7 }
+				};
+int int2ext_map_slb2024[2][25] = {
+					{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 },
+					{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }
+				};
+/*	reference			{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */
+
+/* map internal ports to external ports if appropriate */
+static void
+voltaire_portmap(ibnd_port_t *port)
+{
+	ibnd_chassis_record_t *ch = port->node->chrecord;
+	int portnum = port->portnum;
+	int chipnum = 0;
+	ibnd_node_t *node = port->node;
+
+	if (!ch || !is_line(node) || (portnum < 13 || portnum > 24)) {
+		port->ext_portnum = 0;
+		return;
+	}
+
+	if (ch->anafanum < 1 || ch->anafanum > 2) {
+		port->ext_portnum = 0;
+		return;
+	}
+
+	chipnum = ch->anafanum - 1;
+
+	if (is_line_24(node))
+		port->ext_portnum = int2ext_map_slb24[chipnum][portnum];
+	else if (is_line_2024(node))
+		port->ext_portnum = int2ext_map_slb2024[chipnum][portnum];
+	else
+		port->ext_portnum = int2ext_map_slb8[chipnum][portnum];
+}
+
+static void add_chassislist(ibnd_fabric_t *fabric)
+{
+	if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_list_t))))
+		IBPANIC("out of mem");
+
+	if (fabric->first_chassis == NULL) {
+		fabric->first_chassis = fabric->current_chassis;
+	} else {
+		fabric->current_chassis->next = NULL;
+	}
+}
+
+static void
+add_node_to_chassis(ibnd_chassis_list_t *chassis, ibnd_node_t *node)
+{
+	node->chassis_next = chassis->nodes;
+	if (chassis->nodes)
+		chassis->nodes->chassis_next = node;
+	else
+		chassis->nodes = node;
+}
+
+/*
+	Main grouping function
+	Algorithm:
+	1. pass on every Voltaire node
+	2. catch spine chip for every Voltaire node
+		2.1 build/interpolate chassis around this chip
+		2.2 go to 1.
+	3. pass on non Voltaire nodes (SystemImageGUID based grouping)
+	4. now group non Voltaire nodes by SystemImageGUID
+*/
+ibnd_chassis_list_t *group_nodes(ibnd_fabric_t *fabric)
+{
+	ibnd_node_t *node;
+	int dist;
+	int chassisnum = 0;
+	ibnd_chassis_list_t *chassis;
+
+	fabric->first_chassis = NULL;
+	fabric->current_chassis = NULL;
+
+	/* first pass on switches and build for every Voltaire node */
+	/* an appropriate chassis record (slotnum and position) */
+	/* according to internal connectivity */
+	/* not very efficient but clear code so... */
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
+		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+			if (node->info.vendid == VTR_VENDOR_ID)
+				fill_voltaire_chassis_record(node);
+		}
+	}
+
+	/* separate every Voltaire chassis from each other and build linked list of them */
+	/* algorithm: catch spine and find all surrounding nodes */
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
+		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+			if (node->info.vendid != VTR_VENDOR_ID)
+				continue;
+			if (!node->chrecord || node->chrecord->chassisnum || !is_spine(node))
+				continue;
+			add_chassislist(fabric);
+			fabric->current_chassis->chassisnum = ++chassisnum;
+			build_chassis(node, fabric->current_chassis);
+		}
+	}
+
+	/* now make pass on nodes for chassis which are not Voltaire */
+	/* grouped by common SystemImageGUID */
+	for (dist = 0; dist <= fabric->maxhops_discovered; dist++) {
+		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+			if (node->info.vendid == VTR_VENDOR_ID)
+				continue;
+			if (node->info.sysimgguid) {
+				chassis = find_chassisguid(node);
+				if (chassis)
+					chassis->nodecount++;
+				else {
+					/* Possible new chassis */
+					add_chassislist(fabric);
+					fabric->current_chassis->chassisguid = get_chassisguid(node);
+					fabric->current_chassis->nodecount = 1;
+				}
+			}
+		}
+	}
+
+	/* now, make another pass to see which nodes are part of chassis */
+	/* (defined as chassis->nodecount > 1) */
+	for (dist = 0; dist <= MAXHOPS; ) {
+		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+			if (node->info.vendid == VTR_VENDOR_ID)
+				continue;
+			if (node->info.sysimgguid) {
+				chassis = find_chassisguid(node);
+				if (chassis && chassis->nodecount > 1) {
+					if (!chassis->chassisnum)
+						chassis->chassisnum = ++chassisnum;
+					if (!node->chrecord) {
+						if (!(node->chrecord =
+						calloc(1,
+						sizeof(ibnd_chassis_record_t))))
+							IBPANIC("out of mem");
+						node->chrecord->chassisnum = chassis->chassisnum;
+						add_node_to_chassis(chassis, node);
+					}
+				}
+			}
+		}
+		if (dist == fabric->maxhops_discovered)
+			dist = MAXHOPS;	/* skip to CAs */
+		else
+			dist++;
+	}
+
+	return (fabric->first_chassis);
+}
diff --git a/libibnetdisc/src/chassis.h b/libibnetdisc/src/chassis.h
new file mode 100644
index 0000000..ea271d0
--- /dev/null
+++ b/libibnetdisc/src/chassis.h
@@ -0,0 +1,82 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef _CHASSIS_H_
+#define _CHASSIS_H_
+
+#include <infiniband/ibnetdisc.h>
+
+/*========================================================*/
+/*                CHASSIS RECOGNITION SPECIFIC DATA       */
+/*========================================================*/
+
+/* Device IDs */
+#define VTR_DEVID_IB_FC_ROUTER		0x5a00
+#define VTR_DEVID_IB_IP_ROUTER		0x5a01
+#define VTR_DEVID_ISR9600_SPINE		0x5a02
+#define VTR_DEVID_ISR9600_LEAF		0x5a03
+#define VTR_DEVID_HCA1			0x5a04
+#define VTR_DEVID_HCA2			0x5a44
+#define VTR_DEVID_HCA3			0x6278
+#define VTR_DEVID_SW_6IB4		0x5a05
+#define VTR_DEVID_ISR9024		0x5a06
+#define VTR_DEVID_ISR9288		0x5a07
+#define VTR_DEVID_SLB24			0x5a09
+#define VTR_DEVID_SFB12			0x5a08
+#define VTR_DEVID_SFB4			0x5a0b
+#define VTR_DEVID_ISR9024_12		0x5a0c
+#define VTR_DEVID_SLB8			0x5a0d
+#define VTR_DEVID_RLX_SWITCH_BLADE	0x5a20
+#define VTR_DEVID_ISR9024_DDR		0x5a31
+#define VTR_DEVID_SFB12_DDR		0x5a32
+#define VTR_DEVID_SFB4_DDR		0x5a33
+#define VTR_DEVID_SLB24_DDR		0x5a34
+#define VTR_DEVID_SFB2012		0x5a37
+#define VTR_DEVID_SLB2024		0x5a38
+#define VTR_DEVID_ISR2012		0x5a39
+#define VTR_DEVID_SFB2004		0x5a40
+#define VTR_DEVID_ISR2004		0x5a41
+
+/* Vendor IDs (for chassis based systems) */
+#define VTR_VENDOR_ID			0x8f1	/* Voltaire */
+#define TS_VENDOR_ID			0x5ad	/* Cisco */
+#define SS_VENDOR_ID			0x66a	/* InfiniCon */
+#define XS_VENDOR_ID			0x1397	/* Xsigo */
+
+enum ibnd_chassis_type { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT };
+enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS };
+
+ibnd_chassis_list_t *group_nodes(ibnd_fabric_t *fabric);
+
+#endif	/* _CHASSIS_H_ */
diff --git a/libibnetdisc/src/ibnetdisc.c b/libibnetdisc/src/ibnetdisc.c
new file mode 100644
index 0000000..3f4901a
--- /dev/null
+++ b/libibnetdisc/src/ibnetdisc.c
@@ -0,0 +1,863 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Laboratory
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <time.h>
+#include <string.h>
+#include <getopt.h>
+#include <errno.h>
+#include <inttypes.h>
+
+#include <infiniband/common.h>
+#include <infiniband/umad.h>
+#include <infiniband/mad.h>
+
+#include <infiniband/ibnetdisc.h>
+#include <complib/cl_nodenamemap.h>
+
+#include "chassis.h"
+
+static int timeout_ms = 2000;
+static int show_progress = 0;
+
+static char *linkwidth_str[] = {
+	"??",
+	"1x",
+	"4x",
+	"??",
+	"8x",
+	"??",
+	"??",
+	"??",
+	"12x"
+};
+
+static char *linkspeed_str[] = {
+	"???",
+	"SDR",
+	"DDR",
+	"???",
+	"QDR"
+};
+
+static char *linkstate_str[] = {
+	"No State",
+	"Down",
+	"Init",
+	"Armed",
+	"Active"
+};
+
+static char *physstate_str[] = {
+	"No State",
+	"Sleep",
+	"Polling",
+	"Disabled",
+	"PortConfigTraining",
+	"LinkUp",
+	"LinkErrorRecovery",
+	"Phy Test"
+};
+
+char *
+ibnd_linkwidth_str(int link_width)
+{
+	if (link_width > 8)
+		return linkwidth_str[0];
+	else
+		return linkwidth_str[link_width];
+}
+
+char *
+ibnd_linkspeed_str(int link_speed)
+{
+	if (link_speed > 4)
+		return linkspeed_str[0];
+	else
+		return linkspeed_str[link_speed];
+}
+char *
+ibnd_linkstate_str(int link_state)
+{
+	if (link_state > 4)
+		return linkstate_str[0];
+	else
+		return linkstate_str[link_state];
+}
+
+char *
+ibnd_physstate_str(int phys_state)
+{
+	if (phys_state > 7)
+		return physstate_str[0];
+	else
+		return physstate_str[phys_state];
+}
+
+void
+decode_port_info(void * rcv_buf, ibnd_port_info_t *pi)
+{
+	mad_decode_field(rcv_buf, IB_PORT_LID_F, &pi->lid);
+	mad_decode_field(rcv_buf, IB_PORT_SMLID_F, &pi->smlid);
+
+	mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_SUPPORTED_F, &pi->link_speed_supported);
+	mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_ENABLED_F, &pi->link_speed_enabled);
+	mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_ACTIVE_F, &pi->link_speed_active);
+
+	mad_decode_field(rcv_buf, IB_PORT_LOCAL_PORT_F, &pi->local_port);
+	mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_SUPPORTED_F, &pi->link_width_supported);
+	mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_ENABLED_F, &pi->link_width_enabled);
+
+	mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_ACTIVE_F, &pi->link_width_active);
+
+	mad_decode_field(rcv_buf, IB_PORT_DIAG_F, &pi->diag_code);
+	mad_decode_field(rcv_buf, IB_PORT_MKEY_LEASE_F, &pi->mkey_lease);
+	mad_decode_field(rcv_buf, IB_PORT_CAPMASK_F, &pi->capability_mask);
+	mad_decode_field(rcv_buf, IB_PORT_MKEY_F, &pi->mkey);
+	mad_decode_field(rcv_buf, IB_PORT_GID_PREFIX_F, &pi->gid_prefix);
+
+	mad_decode_field(rcv_buf, IB_PORT_STATE_F, &pi->link_state);
+	mad_decode_field(rcv_buf, IB_PORT_PHYS_STATE_F, &pi->phys_state);
+
+	mad_decode_field(rcv_buf, IB_PORT_LINK_DOWN_DEF_F, &pi->link_down_def_state);
+	mad_decode_field(rcv_buf, IB_PORT_MKEY_PROT_BITS_F, &pi->mkey_prot_bits);
+
+	mad_decode_field(rcv_buf, IB_PORT_LMC_F, &pi->lmc);
+	mad_decode_field(rcv_buf, IB_PORT_NEIGHBOR_MTU_F, &pi->neighbor_mtu);
+	mad_decode_field(rcv_buf, IB_PORT_SMSL_F, &pi->smsl);
+	mad_decode_field(rcv_buf, IB_PORT_INIT_TYPE_F, &pi->init_type);
+
+	mad_decode_field(rcv_buf, IB_PORT_VL_CAP_F, &pi->vl_capability);
+	mad_decode_field(rcv_buf, IB_PORT_VL_HIGH_LIMIT_F, &pi->vl_high_limit);
+	mad_decode_field(rcv_buf, IB_PORT_VL_ARBITRATION_HIGH_CAP_F, &pi->vl_arb_high_cap);
+	mad_decode_field(rcv_buf, IB_PORT_VL_ARBITRATION_LOW_CAP_F, &pi->vl_arb_low_cap);
+
+	mad_decode_field(rcv_buf, IB_PORT_INIT_TYPE_REPLY_F, &pi->init_reply);
+	mad_decode_field(rcv_buf, IB_PORT_MTU_CAP_F, &pi->mtu_cap);
+	mad_decode_field(rcv_buf, IB_PORT_VL_STALL_COUNT_F, &pi->vl_stall_count);
+	mad_decode_field(rcv_buf, IB_PORT_HOQ_LIFE_F, &pi->hoq_lifetime);
+	mad_decode_field(rcv_buf, IB_PORT_OPER_VLS_F, &pi->oper_vls);
+	mad_decode_field(rcv_buf, IB_PORT_PART_EN_INB_F, &pi->partition_enforce_in);
+	mad_decode_field(rcv_buf, IB_PORT_PART_EN_OUTB_F, &pi->partition_enforce_out);
+	mad_decode_field(rcv_buf, IB_PORT_FILTER_RAW_INB_F, &pi->filter_raw_in);
+	mad_decode_field(rcv_buf, IB_PORT_FILTER_RAW_OUTB_F, &pi->filter_raw_out);
+	mad_decode_field(rcv_buf, IB_PORT_MKEY_VIOL_F, &pi->mkey_violations);
+	mad_decode_field(rcv_buf, IB_PORT_PKEY_VIOL_F, &pi->pkey_violations);
+	mad_decode_field(rcv_buf, IB_PORT_QKEY_VIOL_F, &pi->qkey_violations);
+
+	mad_decode_field(rcv_buf, IB_PORT_GUID_CAP_F, &pi->guid_capabilities);
+
+	mad_decode_field(rcv_buf, IB_PORT_CLIENT_REREG_F, &pi->client_rereg);
+	mad_decode_field(rcv_buf, IB_PORT_SUBN_TIMEOUT_F, &pi->subnet_timeout);
+	mad_decode_field(rcv_buf, IB_PORT_RESP_TIME_VAL_F, &pi->response_time_val);
+	mad_decode_field(rcv_buf, IB_PORT_LOCAL_PHYS_ERR_F, &pi->local_phys_error);
+	mad_decode_field(rcv_buf, IB_PORT_OVERRUN_ERR_F, &pi->overrun_error);
+	mad_decode_field(rcv_buf, IB_PORT_MAX_CREDIT_HINT_F, &pi->max_credit_hint);
+	mad_decode_field(rcv_buf, IB_PORT_LINK_ROUND_TRIP_F, &pi->link_round_trip);
+}
+
+static int
+get_port_info(ibnd_fabric_t *fabric, ibnd_port_t *port, int portnum, ib_portid_t *portid)
+{
+	char portinfo[64];
+	void *pi = portinfo;
+
+	port->portnum = portnum;
+
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms,
+			fabric->ibmad_port))
+		return -1;
+
+	decode_port_info(pi, &port->info);
+
+	IBND_DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s\n",
+		portid2str(portid), portnum, port->info.lid, port->info.link_state,
+		port->info.phys_state, ibnd_linkwidth_str(port->info.link_width_active),
+		ibnd_linkspeed_str(port->info.link_speed_active));
+	return 1;
+}
+
+static void
+decode_node_info(void * rcv_buf, ibnd_node_info_t *ni)
+{
+	mad_decode_field(rcv_buf, IB_NODE_BASE_VERS_F, &ni->base_ver);
+	mad_decode_field(rcv_buf, IB_NODE_CLASS_VERS_F, &ni->class_ver);
+	mad_decode_field(rcv_buf, IB_NODE_TYPE_F, &ni->type);
+	mad_decode_field(rcv_buf, IB_NODE_NPORTS_F, &ni->numports);
+	mad_decode_field(rcv_buf, IB_NODE_SYSTEM_GUID_F, &ni->sysimgguid);
+	mad_decode_field(rcv_buf, IB_NODE_GUID_F, &ni->nodeguid);
+	mad_decode_field(rcv_buf, IB_NODE_PORT_GUID_F, &ni->nodeportguid);
+	mad_decode_field(rcv_buf, IB_NODE_PARTITION_CAP_F, &ni->partition_cap);
+	mad_decode_field(rcv_buf, IB_NODE_DEVID_F, &ni->devid);
+	mad_decode_field(rcv_buf, IB_NODE_REVISION_F, &ni->revision);
+	mad_decode_field(rcv_buf, IB_NODE_LOCAL_PORT_F, &ni->localport);
+	mad_decode_field(rcv_buf, IB_NODE_VENDORID_F, &ni->vendid);
+}
+
+/*
+ * Returns -1 if error.
+ */
+static int
+query_node_info(ibnd_fabric_t *fabric, ibnd_node_t *node, ib_portid_t *portid)
+{
+	char nodeinfo[64];
+	void *ni = nodeinfo;
+	if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout_ms,
+			fabric->ibmad_port))
+		return -1;
+	decode_node_info(ni, &(node->info));
+	return (0);
+}
+
+/*
+ * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
+ */
+static int
+query_node(ibnd_fabric_t *fabric, ibnd_node_t *node, ibnd_port_t *port, ib_portid_t *portid)
+{
+	char portinfo[64];
+	void *pi = portinfo;
+	char switchinfo[64];
+	void *si = switchinfo;
+	void *nd = node->nodedesc;
+
+	if (query_node_info(fabric, node, portid))
+		return -1;
+
+	port->portnum = node->info.localport;
+	port->guid = node->info.nodeportguid;
+
+	if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms,
+			fabric->ibmad_port))
+		return -1;
+
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout_ms,
+			fabric->ibmad_port))
+		return -1;
+	decode_port_info(pi, &port->info);
+
+	if (node->info.type != IBND_SWITCH_NODE)
+		return 0;
+
+	node->smalid = port->info.lid;
+	node->smalmc = port->info.lmc;
+
+	/* after we have the sma information find out the real PortInfo for this port */
+	if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->info.localport, timeout_ms,
+			fabric->ibmad_port))
+		return -1;
+	decode_port_info(pi, &port->info);
+
+        if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms,
+			fabric->ibmad_port))
+                node->sw_info.smaenhsp0 = 0;	/* assume base SP0 */
+	else
+		mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->sw_info.smaenhsp0);
+
+	IBND_DEBUG("portid %s: got switch node %" PRIx64 " '%s'\n",
+	      portid2str(portid), node->info.nodeguid, node->nodedesc);
+	return 1;
+}
+
+static int
+add_port_to_dpath(ib_dr_path_t *path, int nextport)
+{
+	if (path->cnt+2 >= sizeof(path->p))
+		return -1;
+	++path->cnt;
+	path->p[path->cnt] = nextport;
+	return path->cnt;
+}
+
+static int
+extend_dpath(ibnd_fabric_t *fabric, ib_dr_path_t *path, int nextport)
+{
+	int rc = add_port_to_dpath(path, nextport);
+	if ((rc != -1) && (path->cnt > fabric->maxhops_discovered))
+		fabric->maxhops_discovered = path->cnt;
+	return (rc);
+}
+
+static void
+dump_endnode(ib_portid_t *path, char *prompt, ibnd_node_t *node, ibnd_port_t *port)
+{
+	if (!show_progress)
+		return;
+
+	printf("%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n",
+		portid2str(path), prompt,
+		ibnd_node_type_str(node),
+		node->info.nodeguid, node->info.type == IBND_SWITCH_NODE ? 0 : port->portnum,
+		port->info.lid, port->info.lid + (1 << port->info.lmc) - 1,
+		node->nodedesc);
+}
+
+static ibnd_node_t *
+find_existing_node(ibnd_fabric_t *fabric, ibnd_node_t *new)
+{
+	int hash = HASHGUID(new->info.nodeguid) % HTSZ;
+	ibnd_node_t *node;
+
+	for (node = fabric->nodestbl[hash]; node; node = node->htnext)
+		if (node->info.nodeguid == new->info.nodeguid)
+			return node;
+
+	return NULL;
+}
+
+ibnd_node_t *
+ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid)
+{
+	int hash = HASHGUID(guid) % HTSZ;
+	ibnd_node_t *node;
+
+	for (node = fabric->nodestbl[hash]; node; node = node->htnext)
+		if (node->info.nodeguid == guid)
+			return node;
+
+	return NULL;
+}
+
+ibnd_node_t *
+ibnd_update_node(ibnd_node_t *node)
+{
+	char portinfo[64];
+	void *pi = portinfo;
+	ibnd_port_info_t port0_info;
+	char switchinfo[64];
+	void *si = switchinfo;
+	void *nd = node->nodedesc;
+	int p = 0;
+
+	if (query_node_info(node->fabric, node, &(node->path_portid)))
+		return (NULL);
+
+	if (!smp_query_via(nd, &(node->path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms,
+			node->fabric->ibmad_port))
+		return (NULL);
+
+	/* update all the port info's */
+	for (p = 1; p >= node->info.numports; p++) {
+		get_port_info(node->fabric, node->ports[p], p, &(node->path_portid));
+	}
+
+	if (node->info.type != IBND_SWITCH_NODE)
+		goto done;
+
+	if (!smp_query_via(pi, &(node->path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms,
+			node->fabric->ibmad_port))
+		return (NULL);
+	decode_port_info(pi, &port0_info);
+
+	node->smalid = port0_info.lid;
+	node->smalmc = port0_info.lmc;
+
+        if (!smp_query_via(si, &(node->path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms,
+			node->fabric->ibmad_port))
+                node->sw_info.smaenhsp0 = 0;	/* assume base SP0 */
+	else
+		mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->sw_info.smaenhsp0);
+
+done:
+	return (node);
+}
+
+ibnd_node_t *
+ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str)
+{
+	int i = 0;
+	ibnd_node_t *rc = fabric->from_node;
+	ib_dr_path_t path;
+
+	if (str2drpath(&path, dr_str, 0, 0) == -1) {
+		return (NULL);
+	}
+
+	for (i = 0; i <= path.cnt; i++) {
+		ibnd_port_t *remote_port = NULL;
+		if (path.p[i] == 0)
+			continue;
+		if (!rc->ports)
+			return (NULL);
+
+		remote_port = rc->ports[path.p[i]]->remoteport;
+		if (!remote_port)
+			return (NULL);
+
+		rc = remote_port->node;
+	}
+
+	return (rc);
+}
+
+void
+add_to_nodeguid_hash(ibnd_node_t *node, ibnd_node_t *hash[])
+{
+	int hash_idx = HASHGUID(node->info.nodeguid) % HTSZ;
+
+	node->htnext = hash[hash_idx];
+	hash[hash_idx] = node;
+}
+
+void
+add_to_portguid_hash(ibnd_port_t *port, ibnd_port_t *hash[])
+{
+	int hash_idx = HASHGUID(port->guid) % HTSZ;
+
+	port->htnext = hash[hash_idx];
+	hash[hash_idx] = port;
+}
+
+ibnd_port_t *
+find_existing_port_fabric(ibnd_fabric_t *fabric, uint64_t guid)
+{
+	int hash = HASHGUID(guid) % HTSZ;
+	ibnd_port_t *port;
+
+	for (port = fabric->portstbl[hash]; port; port = port->htnext)
+		if (port->guid == guid)
+			return port;
+
+	return NULL;
+}
+
+void
+add_to_type_list(ibnd_node_t *node, ibnd_fabric_t *fabric)
+{
+	switch (node->info.type) {
+		case IBND_CA_NODE:
+			node->type_next = fabric->ch_adapters;
+			fabric->ch_adapters = node;
+			break;
+		case IBND_SWITCH_NODE:
+			node->type_next = fabric->switches;
+			fabric->switches = node;
+			break;
+		case IBND_ROUTER_NODE:
+			node->type_next = fabric->routers;
+			fabric->routers = node;
+			break;
+	}
+}
+
+void
+add_to_nodedist(ibnd_node_t *node, ibnd_fabric_t *fabric)
+{
+	int dist = node->dist;
+	if (node->info.type != IBND_SWITCH_NODE)
+			dist = MAXHOPS; 	/* special Ca list */
+
+	node->dnext = fabric->nodesdist[dist];
+	fabric->nodesdist[dist] = node;
+}
+
+
+static ibnd_node_t *
+create_node(ibnd_fabric_t *fabric, ibnd_node_t *temp, ib_portid_t *path, int dist)
+{
+	ibnd_node_t *node;
+
+	node = malloc(sizeof(*node));
+	if (!node) {
+		IBPANIC("OOM: node creation failed\n");
+		return NULL;
+	}
+
+	memcpy(node, temp, sizeof(*node));
+	node->dist = dist;
+	node->path_portid = *path;
+	node->fabric = fabric;
+
+	add_to_nodeguid_hash(node, fabric->nodestbl);
+
+	/* add this to the all nodes list */
+	node->next = fabric->nodes;
+	fabric->nodes = node;
+
+	add_to_type_list(node, fabric);
+	add_to_nodedist(node, fabric);
+
+	return node;
+}
+
+static ibnd_port_t *
+find_existing_port_node(ibnd_node_t *node, ibnd_port_t *port)
+{
+	if (port->portnum > node->info.numports || node->ports == NULL )
+		return (NULL);
+
+	return (node->ports[port->portnum]);
+}
+
+static ibnd_port_t *
+add_port_to_node(ibnd_fabric_t *fabric, ibnd_node_t *node, ibnd_port_t *temp)
+{
+	ibnd_port_t *port;
+
+	port = malloc(sizeof(*port));
+	if (!port)
+		return NULL;
+
+	memcpy(port, temp, sizeof(*port));
+	port->node = node;
+	port->ext_portnum = 0;
+
+	if (node->ports == NULL) {
+		node->ports = calloc(sizeof(*node->ports), node->info.numports + 1);
+		if (!node->ports) {
+			IBND_ERROR("Failed to allocate the ports array\n");
+			return (NULL);
+		}
+	}
+
+	node->ports[temp->portnum] = port;
+
+	add_to_portguid_hash(port, fabric->portstbl);
+	return port;
+}
+
+void
+link_ports(ibnd_node_t *node, ibnd_port_t *port, ibnd_node_t *remotenode, ibnd_port_t *remoteport)
+{
+	IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u\n",
+		node->info.nodeguid, node, port, port->portnum,
+		remotenode->info.nodeguid, remotenode, remoteport, remoteport->portnum);
+	if (port->remoteport)
+		port->remoteport->remoteport = NULL;
+	if (remoteport->remoteport)
+		remoteport->remoteport->remoteport = NULL;
+	port->remoteport = remoteport;
+	remoteport->remoteport = port;
+}
+
+static int
+get_remote_node(ibnd_fabric_t *fabric, ibnd_node_t *node, ibnd_port_t *port, ib_portid_t *path,
+		int portnum, int dist)
+{
+	ibnd_node_t node_buf;
+	ibnd_port_t port_buf;
+	ibnd_node_t *remotenode, *oldnode;
+	ibnd_port_t *remoteport, *oldport;
+
+	memset(&node_buf, 0, sizeof(node_buf));
+	memset(&port_buf, 0, sizeof(port_buf));
+
+	IBND_DEBUG("handle node %p port %p:%d dist %d\n", node, port, portnum, dist);
+	if (port->info.phys_state != 5)	/* LinkUp */
+		return -1;
+
+	if (extend_dpath(fabric, &path->drpath, portnum) < 0)
+		return -1;
+
+	if (query_node(fabric, &node_buf, &port_buf, path) < 0) {
+		IBWARN("NodeInfo on %s failed, skipping port",
+			portid2str(path));
+		path->drpath.cnt--;	/* restore path */
+		return -1;
+	}
+
+	oldnode = find_existing_node(fabric, &node_buf);
+	if (oldnode)
+		remotenode = oldnode;
+	else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1)))
+		IBPANIC("no memory");
+
+	oldport = find_existing_port_node(remotenode, &port_buf);
+	if (oldport) {
+		remoteport = oldport;
+	} else if (!(remoteport = add_port_to_node(fabric, remotenode, &port_buf)))
+		IBPANIC("no memory");
+
+	dump_endnode(path, oldnode ? "known remote" : "new remote",
+			remotenode, remoteport);
+
+	link_ports(node, port, remotenode, remoteport);
+
+	path->drpath.cnt--;	/* restore path */
+	return 0;
+}
+
+static void *
+ibnd_init_port(char *dev_name, int dev_port)
+{
+	int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS};
+
+	/* Crank up the mad lib */
+	return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2));
+}
+
+ibnd_fabric_t *
+ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms,
+			ib_portid_t *from, int hops)
+{
+	ibnd_fabric_t *fabric = NULL;
+	ib_portid_t my_portid = {0};
+	ibnd_node_t node_buf;
+	ibnd_port_t port_buf;
+	ibnd_node_t *node;
+	ibnd_port_t *port;
+	int i;
+	int dist = 0;
+	ib_portid_t *path;
+	int max_hops = MAXHOPS-1; /* default find everything */
+
+	/* if not everything how much? */
+	if (hops >= 0) {
+		max_hops = hops;
+	}
+
+	/* If not specified start from "my" port */
+	if (!from) {
+		from = &my_portid;
+	}
+
+	fabric = malloc(sizeof(*fabric));
+
+	if (!fabric) {
+		IBPANIC("OOM: failed to malloc ibnd_fabric_t\n");
+		return (NULL);
+	}
+
+	memset(fabric, 0, sizeof(*fabric));
+
+	fabric->ibmad_port = ibnd_init_port(dev_name, dev_port);
+	if (!fabric->ibmad_port) {
+		IBPANIC("OOM: failed to open \"%s\" port %d\n",
+			dev_name, dev_port);
+		goto error;
+	}
+
+	IBND_DEBUG("from %s\n", portid2str(from));
+
+	memset(&node_buf, 0, sizeof(node_buf));
+	memset(&port_buf, 0, sizeof(port_buf));
+
+	if (query_node(fabric, &node_buf, &port_buf, from) < 0) {
+		IBWARN("can't reach node %s\n", portid2str(from));
+		goto error;
+	}
+
+	node = create_node(fabric, &node_buf, from, 0);
+	if (!node)
+		goto error;
+
+	fabric->from_node = node;
+
+	port = add_port_to_node(fabric, node, &port_buf);
+	if (!port)
+		IBPANIC("out of memory");
+
+	if (node->info.type != IBND_SWITCH_NODE &&
+	    get_remote_node(fabric, node, port, from, node->info.localport, 0) < 0)
+		return fabric;
+
+	for (dist = 0; dist <= max_hops; dist++) {
+
+		for (node = fabric->nodesdist[dist]; node; node = node->dnext) {
+
+			path = &node->path_portid;
+
+			IBND_DEBUG("dist %d node %p\n", dist, node);
+			dump_endnode(path, "processing", node, port);
+
+			for (i = 1; i <= node->info.numports; i++) {
+				if (i == node->info.localport)
+					continue;
+
+				if (get_port_info(fabric, &port_buf, i, path) < 0) {
+					IBWARN("can't reach node %s port %d", portid2str(path), i);
+					continue;
+				}
+
+				port = find_existing_port_node(node, &port_buf);
+				if (port)
+					continue;
+
+				port = add_port_to_node(fabric, node, &port_buf);
+				if (!port)
+					IBPANIC("out of memory");
+
+				/* If switch, set port GUID to node port GUID */
+				if (node->info.type == IBND_SWITCH_NODE)
+					port->guid = node->info.nodeportguid;
+
+				get_remote_node(fabric, node, port, path, i, dist);
+			}
+		}
+	}
+
+	fabric->chassis = group_nodes(fabric);
+
+	return fabric;
+error:
+	free(fabric);
+	return (NULL);
+}
+
+static void
+destroy_node(ibnd_node_t *node)
+{
+	int p = 0;
+
+	for (p = 0; p <= node->info.numports; p++) {
+		free(node->ports[p]);
+	}
+	free(node->ports);
+
+	if (node->chrecord)
+		free(node->chrecord);
+	free(node);
+}
+
+void
+ibnd_destroy_fabric(ibnd_fabric_t *fabric)
+{
+	int dist = 0;
+	ibnd_node_t *node = NULL;
+	ibnd_node_t *next = NULL;
+	ibnd_chassis_list_t *ch, *ch_next;
+
+	for (dist = 0; dist <= MAXHOPS; dist++) {
+		node = fabric->nodesdist[dist];
+		while (node) {
+			next = node->dnext;
+			destroy_node(node);
+			node = next;
+		}
+	}
+	ch = fabric->first_chassis;
+	while (ch) {
+		ch_next = ch->next;
+		free(ch);
+		ch = ch_next;
+	}
+	free(fabric);
+	if (fabric->ibmad_port)
+		mad_rpc_close_port(fabric->ibmad_port);
+}
+
+void
+ibnd_debug(int i)
+{
+	if (i) {
+		ibdebug++;
+		madrpc_show_errors(1);
+		umad_debug(i);
+	} else {
+		ibdebug = 0;
+		madrpc_show_errors(0);
+		umad_debug(0);
+	}
+}
+
+void
+ibnd_show_progress(int i)
+{
+	show_progress = i;
+}
+
+const char*
+ibnd_node_type_str(ibnd_node_t *node)
+{
+	switch(node->info.type) {
+	case IBND_CA_NODE:     return "Ca";
+	case IBND_SWITCH_NODE: return "Switch";
+	case IBND_ROUTER_NODE: return "Router";
+	}
+	return "??";
+}
+
+const char*
+ibnd_node_type_str_short(ibnd_node_t *node)
+{
+	switch(node->info.type) {
+	case IBND_SWITCH_NODE: return "SW";
+	case IBND_CA_NODE:     return "CA";
+	case IBND_ROUTER_NODE: return "RT";
+	}
+	return "??";
+}
+
+
+void
+ibnd_iter_nodes(ibnd_fabric_t *fabric,
+		ibnd_iter_func_t func,
+		void *user_data)
+{
+	ibnd_node_t *cur = NULL;
+
+	for (cur = fabric->nodes; cur; cur = cur->next) {
+		func(cur, user_data);
+	}
+}
+
+
+void
+ibnd_iter_nodes_type(ibnd_fabric_t *fabric,
+		ibnd_iter_func_t func,
+		ibnd_node_type_t node_type,
+		void *user_data)
+{
+	ibnd_node_t *list = NULL;
+	ibnd_node_t *cur = NULL;
+
+	switch (node_type) {
+		case IBND_SWITCH_NODE:
+			list = fabric->switches;
+			break;
+		case IBND_CA_NODE:
+			list = fabric->ch_adapters;
+			break;
+		case IBND_ROUTER_NODE:
+			list = fabric->routers;
+			break;
+		default:
+			IBND_DEBUG("Invalid node_type specified %d\n", node_type);
+			break;
+	}
+
+	for (cur = list; cur; cur = cur->type_next) {
+		func(cur, user_data);
+	}
+}
+
diff --git a/libibnetdisc/src/libibnetdisc.map b/libibnetdisc/src/libibnetdisc.map
new file mode 100644
index 0000000..5e8c315
--- /dev/null
+++ b/libibnetdisc/src/libibnetdisc.map
@@ -0,0 +1,27 @@
+IBNETDISC_1.0 {
+	global:
+		ibnd_debug;
+		ibnd_show_progress;
+		ibnd_discover_fabric;
+		ibnd_cache_fabric;
+		ibnd_read_fabric;
+		ibnd_destroy_fabric;
+		ibnd_find_node_guid;
+		ibnd_update_node;
+		ibnd_find_node_dr;
+		ibnd_linkwidth_str;
+		ibnd_linkspeed_str;
+		ibnd_node_type_str;
+		ibnd_node_type_str_short;
+		ibnd_is_xsigo_guid;
+		ibnd_is_xsigo_tca;
+		ibnd_is_xsigo_hca;
+		ibnd_get_chassis_guid;
+		ibnd_get_chassis_type;
+		ibnd_get_chassis_slot_str;
+		ibnd_linkstate_str;
+		ibnd_physstate_str;
+		ibnd_iter_nodes;
+		ibnd_iter_nodes_type;
+	local: *;
+};
diff --git a/libibnetdisc/test/iblinkinfotest.c b/libibnetdisc/test/iblinkinfotest.c
new file mode 100644
index 0000000..7c52a0b
--- /dev/null
+++ b/libibnetdisc/test/iblinkinfotest.c
@@ -0,0 +1,395 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <time.h>
+#include <string.h>
+#include <getopt.h>
+#include <errno.h>
+#include <inttypes.h>
+
+#include <infiniband/complib/cl_nodenamemap.h>
+#include <infiniband/ibnetdisc.h>
+
+char *argv0 = "iblinkinfotest";
+static FILE *f;
+
+static char *node_name_map_file = NULL;
+static nn_map_t *node_name_map = NULL;
+
+static int timeout_ms = 500;
+
+static int debug = 0;
+#define	DEBUG(str, args...) \
+	if (debug) fprintf(stderr, str, ##args)
+
+static int down_links_only = 0;
+static int line_mode = 0;
+static int add_sw_settings = 0;
+static int print_port_guids = 0;
+
+static unsigned int
+get_max(unsigned int num)
+{
+	unsigned int v = num; // 32-bit word to find the log base 2 of
+	unsigned r = 0; // r will be lg(v)
+
+	while (v >>= 1) // unroll for more speed...
+	{
+		r++;
+	}
+
+	return (1 << r);
+}
+
+void
+get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port)
+{
+	int max_speed = 0;
+
+	int max_width = get_max(port->info.link_width_supported
+				& port->remoteport->info.link_width_supported);
+	if ((max_width & port->info.link_width_active) == 0) {
+		// we are not at the max supported width
+		// print what we could be at.
+		snprintf(width_msg, msg_size, "Could be %s",
+			ibnd_linkwidth_str(max_width));
+	}
+
+	max_speed = get_max(port->info.link_speed_supported
+				& port->remoteport->info.link_speed_supported);
+	if ((max_speed & port->info.link_speed_active) == 0) {
+		// we are not at the max supported speed
+		// print what we could be at.
+		snprintf(speed_msg, msg_size, "Could be %s",
+			ibnd_linkspeed_str(max_speed));
+	}
+}
+
+void
+print_port(ibnd_node_t *node, ibnd_port_t *port)
+{
+	char remote_guid_str[256];
+	char remote_str[256];
+	char link_str[256];
+	char width_msg[256];
+	char speed_msg[256];
+	char ext_port_str[256];
+
+	if (!port)
+		return;
+
+	remote_guid_str[0] = '\0';
+	remote_str[0] = '\0';
+	link_str[0] = '\0';
+	width_msg[0] = '\0';
+	speed_msg[0] = '\0';
+
+	if (port->remoteport) {
+		char  remote_name_buf[256];
+		strncpy(remote_name_buf, port->remoteport->node->nodedesc, 256);
+
+		if (port->remoteport->ext_portnum)
+			snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum);
+		else
+			ext_port_str[0] = '\0';
+
+		get_msg(width_msg, speed_msg, 256, port);
+		if (line_mode) {
+			if (print_port_guids) {
+				snprintf(remote_guid_str, 256,
+					"0x%016lx ",
+					port->remoteport->guid);
+			} else {
+				snprintf(remote_guid_str, 256,
+					"0x%016lx ",
+					port->remoteport->node->info.nodeguid);
+			}
+		}
+
+		snprintf(remote_str, 256,
+			"%s%6d %4d[%2s] \"%s\" (%s %s)\n",
+			remote_guid_str,
+			port->remoteport->info.lid ?
+				port->remoteport->info.lid :
+				port->remoteport->node->smalid,
+			port->remoteport->portnum,
+			ext_port_str,
+			remap_node_name(node_name_map,
+				port->remoteport->node->info.nodeguid,
+				remote_name_buf),
+			width_msg,
+			speed_msg
+			);
+	} else {
+		snprintf(remote_str, 256,
+			"%6s %4s[%2s] \"\" ( )\n", "", "", "");
+	}
+
+	if (add_sw_settings) {
+		snprintf(link_str, 256,
+			"(%3s %s %6s/%8s) (HOQ:%d VL_Stall:%d)",
+			ibnd_linkwidth_str(port->info.link_width_active),
+			ibnd_linkspeed_str(port->info.link_speed_active),
+			ibnd_linkstate_str(port->info.link_state),
+			ibnd_physstate_str(port->info.phys_state),
+			port->info.hoq_lifetime,
+			port->info.vl_stall_count
+			);
+	} else {
+		snprintf(link_str, 256,
+			"(%3s %s %6s/%8s)",
+			ibnd_linkwidth_str(port->info.link_width_active),
+			ibnd_linkspeed_str(port->info.link_speed_active),
+			ibnd_linkstate_str(port->info.link_state),
+			ibnd_physstate_str(port->info.phys_state)
+			);
+	}
+
+	if (port->ext_portnum)
+		snprintf(ext_port_str, 256, "%d", port->ext_portnum);
+	else
+		ext_port_str[0] = '\0';
+
+	if (line_mode) {
+		char  name_buf[256];
+		strncpy(name_buf, node->nodedesc, 256);
+		printf("0x%016lx \"%30s\" %6d %4d[%2s] ==%s==>  %s",
+			node->info.nodeguid,
+			remap_node_name(node_name_map,
+				node->info.nodeguid,
+				name_buf),
+			node->smalid, port->portnum,
+			ext_port_str,
+			link_str,
+			remote_str
+			);
+	} else {
+		printf("      %6d %4d[%2s] ==%s==>  %s",
+			node->smalid, port->portnum,
+			ext_port_str,
+			link_str,
+			remote_str
+			);
+	}
+}
+
+void
+print_switch(ibnd_node_t *node, void *user_data)
+{
+	int i = 0;
+
+	if (!line_mode) {
+		char  name_buf[256];
+		strncpy(name_buf, node->nodedesc, 256);
+		printf("Switch 0x%016lx %s:\n",
+			node->info.nodeguid,
+			remap_node_name(node_name_map,
+				node->info.nodeguid,
+				name_buf));
+	}
+
+	for (i = 1; i <= node->info.numports; i++) {
+		ibnd_port_t *port = node->ports[i];
+		if (!port)
+			continue;
+		if (!down_links_only || port->info.link_state == IBND_LINK_DOWN) {
+			print_port(node, port);
+		}
+	}
+}
+
+void
+usage(void)
+{
+	fprintf(stderr,
+		"Usage: %s [-hclp -S <guid> -D <direct route> -C <ca_name> -P <ca_port>]\n"
+		"   Report link speed and connection for each port of each switch which is active\n"
+		"   -h This help message\n"
+		"   -S <guid> output only the node specified by guid\n"
+		"   -D <direct route> print only node specified by <direct route>\n"
+		"   -f <dr_path> specify node to start \"from\"\n"
+		"   -n <hops> Number of hops to include away from specified node\n"
+		"   -d print only down links\n"
+		"   -l (line mode) print all information for each link on each line\n"
+		"   -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n"
+
+
+		"   -t <timeout_ms> timeout for any single fabric query\n"
+		"   -s show errors\n"
+		"   --node-name-map <map_file> use specified node name map\n"
+
+		"   -C <ca_name> use selected Channel Adaptor name for queries\n"
+		"   -P <ca_port> use selected channel adaptor port for queries\n"
+		"   -g print port guids instead of node guids\n"
+		"   --debug print debug messages\n"
+		,
+			argv0);
+	exit(-1);
+}
+
+int
+main(int argc, char **argv)
+{
+	char *ca = 0;
+	int ca_port = 0;
+	ibnd_fabric_t *fabric = NULL;
+	uint64_t guid = 0;
+	char *dr_path = NULL;
+	char *from = NULL;
+	int hops = 0;
+	ib_portid_t port_id;
+
+	static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:";
+	static const struct option long_opts[] = {
+		{ "S", 1, 0, 'S'},
+		{ "D", 1, 0, 'D'},
+		{ "num-hops", 1, 0, 'n'},
+		{ "down-links-only", 0, 0, 'd'},
+		{ "line-mode", 0, 0, 'l'},
+		{ "ca-name", 1, 0, 'C'},
+		{ "ca-port", 1, 0, 'P'},
+		{ "timeout", 1, 0, 't'},
+		{ "show", 0, 0, 's'},
+		{ "print-port-guids", 0, 0, 'g'},
+		{ "print-additional", 0, 0, 'p'},
+		{ "help", 0, 0, 'h'},
+		{ "usage", 0, 0, 'u'},
+		{ "node-name-map", 1, 0, 1},
+		{ "debug", 0, 0, 2},
+		{ "from", 1, 0, 'f'},
+		{ }
+	};
+
+	f = stdout;
+
+	argv0 = argv[0];
+
+	while (1) {
+		int ch = getopt_long(argc, argv, str_opts, long_opts, NULL);
+		if ( ch == -1 )
+			break;
+		switch(ch) {
+		case 1:
+			node_name_map_file = strdup(optarg);
+			break;
+		case 2:
+			debug = 1;
+			ibnd_debug(1);
+			break;
+		case 'f':
+			from = strdup(optarg);
+			break;
+		case 'C':
+			ca = strdup(optarg);
+			break;
+		case 'P':
+			ca_port = strtoul(optarg, 0, 0);
+			break;
+		case 'D':
+			dr_path = strdup(optarg);
+			break;
+		case 'n':
+			hops = (int)strtol(optarg, NULL, 0);
+			break;
+		case 'd':
+			down_links_only = 1;
+			break;
+		case 'l':
+			line_mode = 1;
+			break;
+		case 't':
+			timeout_ms = strtoul(optarg, 0, 0);
+			break;
+		case 'g':
+			print_port_guids = 1;
+			break;
+		case 'S':
+			guid = (uint64_t)strtoull(optarg, 0, 0);
+			break;
+		case 'p':
+			add_sw_settings = 1;
+			break;
+		default:
+			usage();
+			break;
+		}
+	}
+	argc -= optind;
+	argv += optind;
+
+	if (argc && !(f = fopen(argv[0], "w")))
+		fprintf(stderr, "can't open file %s for writing", argv[0]);
+
+	node_name_map = open_node_name_map(node_name_map_file);
+
+	if (from) {
+		/* only scan part of the fabric */
+		str2drpath(&(port_id.drpath), from, 0, 0);
+		if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) {
+			fprintf(stderr, "discover failed\n");
+			exit(1);
+		}
+		guid = 0;
+	} else {
+		if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) {
+			fprintf(stderr, "discover failed\n");
+			exit(1);
+		}
+	}
+
+	if (guid) {
+		ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid);
+		print_switch(sw, NULL);
+	} else if (dr_path) {
+		ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path);
+		print_switch(sw, NULL);
+	} else {
+		ibnd_iter_nodes_type(fabric, print_switch, IBND_SWITCH_NODE, NULL);
+	}
+
+	ibnd_destroy_fabric(fabric);
+
+	close_node_name_map(node_name_map);
+	exit(0);
+}
diff --git a/libibnetdisc/test/ibnetdisctest.c b/libibnetdisc/test/ibnetdisctest.c
new file mode 100644
index 0000000..e4088da
--- /dev/null
+++ b/libibnetdisc/test/ibnetdisctest.c
@@ -0,0 +1,588 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <time.h>
+#include <string.h>
+#include <getopt.h>
+#include <errno.h>
+#include <inttypes.h>
+
+#include <infiniband/complib/cl_nodenamemap.h>
+#include <infiniband/ibnetdisc.h>
+
+#define LIST_CA_NODE	 (1 << IBND_CA_NODE)
+#define LIST_SWITCH_NODE (1 << IBND_SWITCH_NODE)
+#define LIST_ROUTER_NODE (1 << IBND_ROUTER_NODE)
+
+char *argv0 = "ibnetdiscover";
+static FILE *f;
+
+static char *node_name_map_file = NULL;
+static nn_map_t *node_name_map = NULL;
+
+static int timeout_ms = 2000;
+static int dumplevel = 0;
+
+static int debug = 0;
+#define	DEBUG(str, args...) \
+	if (debug) fprintf(stderr, str, ##args)
+
+char *
+node_name(ibnd_node_t *node)
+{
+	static char buf[256];
+
+	switch(node->info.type) {
+	case IBND_CA_NODE:
+		sprintf(buf, "\"%s", "H");
+		break;
+	case IBND_SWITCH_NODE:
+		sprintf(buf, "\"%s", "S");
+		break;
+	case IBND_ROUTER_NODE:
+		sprintf(buf, "\"%s", "R");
+		break;
+	default:
+		sprintf(buf, "\"%s", "?");
+		break;
+	}
+	sprintf(buf+2, "-%016" PRIx64 "\"", node->info.nodeguid);
+
+	return buf;
+}
+
+void
+list_node(ibnd_node_t *node, void *user_data)
+{
+	char *nodename = remap_node_name(node_name_map, node->info.nodeguid,
+					      node->nodedesc);
+
+	fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n",
+		ibnd_node_type_str(node),
+		node->info.nodeguid, node->info.numports, node->info.devid,
+		node->info.vendid,
+		nodename);
+
+	free(nodename);
+}
+
+void
+list_nodes(ibnd_fabric_t *fabric, int list)
+{
+	if (list & LIST_CA_NODE) {
+		ibnd_iter_nodes_type(fabric, list_node, IBND_CA_NODE, NULL);
+	}
+	if (list & LIST_SWITCH_NODE) {
+		ibnd_iter_nodes_type(fabric, list_node, IBND_SWITCH_NODE, NULL);
+	}
+	if (list & LIST_ROUTER_NODE) {
+		ibnd_iter_nodes_type(fabric, list_node, IBND_ROUTER_NODE, NULL);
+	}
+}
+
+void
+out_ids(ibnd_node_t *node, int group, char *chname)
+{
+	fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->info.vendid, node->info.devid);
+	if (node->info.sysimgguid)
+		fprintf(f, "sysimgguid=0x%" PRIx64, node->info.sysimgguid);
+	if (group
+	    && node->chrecord && node->chrecord->chassisnum) {
+		fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum);
+		if (chname)
+			fprintf(f, " (%s)", clean_nodedesc(chname));
+		if (ibnd_is_xsigo_tca(node->info.nodeguid)
+				&& node->ports[1]
+				&& node->ports[1]->remoteport)
+			fprintf(f, " slot %d", node->ports[1]->remoteport->portnum);
+	}
+	fprintf(f, "\n");
+}
+
+
+uint64_t
+out_chassis(ibnd_fabric_t *fabric, int chassisnum)
+{
+	uint64_t guid;
+
+	fprintf(f, "\nChassis %d", chassisnum);
+	guid = ibnd_get_chassis_guid(fabric, chassisnum);
+	if (guid)
+		fprintf(f, " (guid 0x%" PRIx64 ")", guid);
+	fprintf(f, "\n");
+	return guid;
+}
+
+void
+out_switch(ibnd_node_t *node, int group, char *chname)
+{
+	char *str;
+	char  str2[256];
+	char *nodename = NULL;
+
+	out_ids(node, group, chname);
+	fprintf(f, "switchguid=0x%" PRIx64, node->info.nodeguid);
+	fprintf(f, "(%" PRIx64 ")", node->info.nodeportguid);
+	if (group) {
+		str = ibnd_get_chassis_type(node);
+		if (str)
+			fprintf(f, "%s ", str);
+		str = ibnd_get_chassis_slot_str(node, str2, 256);
+		if (str)
+			fprintf(f, "%s ", str);
+	}
+
+	nodename = remap_node_name(node_name_map, node->info.nodeguid,
+				node->nodedesc);
+
+	fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n",
+		node->info.numports, node_name(node),
+		nodename,
+		node->sw_info.smaenhsp0 ? "enhanced" : "base",
+		node->smalid, node->smalmc);
+
+	free(nodename);
+}
+
+void
+out_ca(ibnd_node_t *node, int group, char *chname)
+{
+	char *node_type;
+	char *node_type2;
+
+	out_ids(node, group, chname);
+	switch(node->info.type) {
+	case IBND_CA_NODE:
+		node_type = "ca";
+		node_type2 = "Ca";
+		break;
+	case IBND_ROUTER_NODE:
+		node_type = "rt";
+		node_type2 = "Rt";
+		break;
+	default:
+		node_type = "???";
+		node_type2 = "???";
+		break;
+	}
+
+	fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->info.nodeguid);
+	fprintf(f, "%s\t%d %s\t\t# \"%s\"",
+		node_type2, node->info.numports, node_name(node),
+		clean_nodedesc(node->nodedesc));
+	if (group && ibnd_is_xsigo_hca(node->info.nodeguid))
+		fprintf(f, " (scp)");
+	fprintf(f, "\n");
+}
+
+#define OUT_BUFFER_SIZE 16
+static char *
+out_ext_port(ibnd_port_t *port, int group)
+{
+	static char mapping[OUT_BUFFER_SIZE];
+
+	if (group && port->ext_portnum != 0) {
+		snprintf(mapping, OUT_BUFFER_SIZE,
+			"[ext %d]", port->ext_portnum);
+	}
+
+	return (mapping);
+}
+
+void
+out_switch_port(ibnd_port_t *port, int group)
+{
+	char *ext_port_str = NULL;
+	char *rem_nodename = NULL;
+
+	DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport);
+	fprintf(f, "[%d]", port->portnum);
+
+	ext_port_str = out_ext_port(port, group);
+	if (ext_port_str)
+		fprintf(f, "%s", ext_port_str);
+
+	rem_nodename = remap_node_name(node_name_map,
+				port->remoteport->node->info.nodeguid,
+				port->remoteport->node->nodedesc);
+
+	ext_port_str = out_ext_port(port->remoteport, group);
+	fprintf(f, "\t%s[%d]%s",
+		node_name(port->remoteport->node),
+		port->remoteport->portnum,
+		ext_port_str ? ext_port_str : "");
+	if (port->remoteport->node->info.type != IBND_SWITCH_NODE)
+		fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid);
+	fprintf(f, "\t\t# \"%s\" lid %d %s%s",
+		rem_nodename,
+		port->remoteport->node->info.type == IBND_SWITCH_NODE ?  port->remoteport->node->smalid : port->remoteport->info.lid,
+		ibnd_linkwidth_str(port->info.link_width_active),
+		ibnd_linkspeed_str(port->info.link_speed_active));
+
+	if (ibnd_is_xsigo_tca(port->remoteport->guid))
+		fprintf(f, " slot %d", port->portnum);
+	else if (ibnd_is_xsigo_hca(port->remoteport->guid))
+		fprintf(f, " (scp)");
+	fprintf(f, "\n");
+
+	free(rem_nodename);
+}
+
+void
+out_ca_port(ibnd_port_t *port, int group)
+{
+	char *str = NULL;
+	char *rem_nodename = NULL;
+
+	fprintf(f, "[%d]", port->portnum);
+	if (port->node->info.type != IBND_SWITCH_NODE)
+		fprintf(f, "(%" PRIx64 ") ", port->guid);
+	fprintf(f, "\t%s[%d]",
+		node_name(port->remoteport->node),
+		port->remoteport->portnum);
+	str = out_ext_port(port->remoteport, group);
+	if (str)
+		fprintf(f, "%s", str);
+	if (port->remoteport->node->info.type != IBND_SWITCH_NODE)
+		fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid);
+
+	rem_nodename = remap_node_name(node_name_map,
+				port->remoteport->node->info.nodeguid,
+				port->remoteport->node->nodedesc);
+
+	fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n",
+		port->info.lid, port->info.lmc, rem_nodename,
+		port->remoteport->node->info.type == IBND_SWITCH_NODE ?  port->remoteport->node->smalid : port->remoteport->info.lid,
+		ibnd_linkwidth_str(port->info.link_width_active),
+		ibnd_linkspeed_str(port->info.link_speed_active));
+
+	free(rem_nodename);
+}
+
+int
+dump_topology(int group, ibnd_fabric_t *fabric)
+{
+	ibnd_node_t *node;
+	ibnd_port_t *port;
+	int i = 0, dist = 0, p = 0;
+	time_t t = time(0);
+	uint64_t chguid;
+	char *chname = NULL;
+
+	fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
+	fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered);
+	fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n",
+		fabric->from_node->info.nodeguid, fabric->from_node->info.nodeportguid);
+
+	/* Make pass on switches */
+	if (group) {
+		ibnd_chassis_list_t *ch = NULL;
+
+		/* Chassis based switches first */
+		for (ch = fabric->chassis; ch; ch = ch->next) {
+			int n = 0;
+
+			if (!ch->chassisnum)
+				continue;
+			chguid = out_chassis(fabric, ch->chassisnum);
+
+			chname = NULL;
+/**
+ * Hal will this work for Xsigo?
+ */
+			if (ibnd_is_xsigo_guid(chguid)) {
+				for (node = ch->nodes; node; node = node->chassis_next) {
+					if (ibnd_is_xsigo_hca(node->info.nodeguid)) {
+						chname = node->nodedesc;
+						fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc));
+					}
+				}
+
+#if 0
+/**
+ * vs. this?
+ */
+				for (node = fabric->nodesdist[MAXHOPS]; node; node = node->dnext) {
+					if (!node->chrecord ||
+					    !node->chrecord->chassisnum)
+						continue;
+
+					if (node->chrecord->chassisnum != ch->chassisnum)
+						continue;
+
+					if (ibnd_is_xsigo_hca(node->nodeguid)) {
+						chname = node->nodedesc;
+						fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc));
+					}
+				}
+#endif
+			}
+
+			fprintf(f, "\n# Spine Nodes");
+			for (n = 1; n <= (SPINES_MAX_NUM+1); n++) {
+				if (ch->spinenode[n]) {
+					out_switch(ch->spinenode[n], group, chname);
+					for (p = 1; p <= ch->spinenode[n]->info.numports; p++) {
+						port = ch->spinenode[n]->ports[p];
+						if (port && port->remoteport)
+							out_switch_port(port, group);
+					}
+				}
+			}
+			fprintf(f, "\n# Line Nodes");
+			for (n = 1; n <= (LINES_MAX_NUM+1); n++) {
+				if (ch->linenode[n]) {
+					out_switch(ch->linenode[n], group, chname);
+					for (p = 1; p <= ch->linenode[n]->info.numports; p++) {
+						port = ch->linenode[n]->ports[p];
+						if (port && port->remoteport)
+							out_switch_port(port, group);
+					}
+				}
+			}
+
+			fprintf(f, "\n# Chassis Switches");
+			for (node = ch->nodes; node; node = node->chassis_next) {
+				if (node->info.type == IBND_SWITCH_NODE) {
+					out_switch(node, group, chname);
+					for (p = 1; p <= node->info.numports; p++) {
+						port = node->ports[p];
+						if (port && port->remoteport)
+							out_switch_port(port, group);
+					}
+				}
+			}
+
+			fprintf(f, "\n# Chassis CAs");
+			for (node = ch->nodes; node; node = node->chassis_next) {
+				if (node->info.type == IBND_CA_NODE) {
+					out_ca(node, group, chname);
+					for (p = 1; p <= node->info.numports; p++) {
+						port = node->ports[p];
+						if (port && port->remoteport)
+							out_ca_port(port, group);
+					}
+				}
+			}
+
+		}
+
+	} else { /* !group */
+		for (node = fabric->switches; node; node = node->type_next) {
+				DEBUG("SWITCH: dist %d node %p\n", dist, node);
+				out_switch(node, group, chname);
+				for (p = 1; p <= node->info.numports; p++) {
+					port = node->ports[p];
+					if (port && port->remoteport)
+						out_switch_port(port, group);
+				}
+		}
+	}
+
+	chname = NULL;
+	if (group) {
+		fprintf(f, "\nNon-Chassis Nodes\n");
+		for (node = fabric->switches; node; node = node->type_next) {
+				DEBUG("SWITCH: dist %d node %p\n", dist, node);
+				/* Now, skip chassis based switches */
+				if (node->chrecord &&
+				    node->chrecord->chassisnum)
+					continue;
+				out_switch(node, group, chname);
+
+				for (p = 1; p <= node->info.numports; p++) {
+					port = node->ports[p];
+					if (port && port->remoteport)
+						out_switch_port(port, group);
+				}
+		}
+
+	}
+
+	/* Make pass on CAs */
+	for (node = fabric->ch_adapters; node; node = node->type_next) {
+		DEBUG("CA: dist %d node %p\n", dist, node);
+		/* Now, skip chassis based CAs */
+		if (group && node->chrecord &&
+		    node->chrecord->chassisnum)
+			continue;
+		out_ca(node, group, chname);
+
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (port && port->remoteport)
+				out_ca_port(port, group);
+		}
+	}
+
+	/* make pass on routers */
+	for (node = fabric->routers; node; node = node->type_next) {
+		DEBUG("RT: dist %d node %p\n", dist, node);
+		/* Now, skip chassis based CAs */
+		if (group && node->chrecord &&
+		    node->chrecord->chassisnum)
+			continue;
+		out_ca(node, group, chname);
+		for (p = 1; p <= node->info.numports; p++) {
+			port = node->ports[p];
+			if (port && port->remoteport)
+				out_ca_port(port, group);
+		}
+	}
+
+	return i;
+}
+
+void
+usage(void)
+{
+	fprintf(stderr, "Usage: %s [-d(ebug)] -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port "
+			"-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) [<topology-file>]\n",
+			argv0);
+	fprintf(stderr, "       --node-name-map <node-name-map> specify a node name map file\n");
+	exit(-1);
+}
+
+int
+main(int argc, char **argv)
+{
+	int list = 0;
+	char *ca = 0;
+	int ca_port = 0;
+	int group = 0;
+	int ports_report = 0;
+	ibnd_fabric_t *fabric = NULL;
+
+	static char const str_opts[] = "C:P:t:devslgHSRpVhu";
+	static const struct option long_opts[] = {
+		{ "C", 1, 0, 'C'},
+		{ "P", 1, 0, 'P'},
+		{ "debug", 0, 0, 'd'},
+		{ "show", 0, 0, 's'},
+		{ "list", 0, 0, 'l'},
+		{ "grouping", 0, 0, 'g'},
+		{ "Hca_list", 0, 0, 'H'},
+		{ "Switch_list", 0, 0, 'S'},
+		{ "Router_list", 0, 0, 'R'},
+		{ "timeout", 1, 0, 't'},
+		{ "node-name-map", 1, 0, 1},
+		{ "ports", 0, 0, 'p'},
+		{ "help", 0, 0, 'h'},
+		{ "usage", 0, 0, 'u'},
+		{ }
+	};
+
+	f = stdout;
+
+	argv0 = argv[0];
+
+	while (1) {
+		int ch = getopt_long(argc, argv, str_opts, long_opts, NULL);
+		if ( ch == -1 )
+			break;
+		switch(ch) {
+		case 1:
+			node_name_map_file = strdup(optarg);
+			break;
+		case 'C':
+			ca = optarg;
+			break;
+		case 'P':
+			ca_port = strtoul(optarg, 0, 0);
+			break;
+		case 'd':
+			debug = 1;
+			ibnd_debug(1);
+			break;
+		case 't':
+			timeout_ms = strtoul(optarg, 0, 0);
+			break;
+		case 's':
+			dumplevel = 1;
+			break;
+		case 'l':
+			list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE;
+			break;
+		case 'g':
+			group = 1;
+			break;
+		case 'S':
+			list |= LIST_SWITCH_NODE;
+			break;
+		case 'H':
+			list |= LIST_CA_NODE;
+			break;
+		case 'R':
+			list |= LIST_ROUTER_NODE;
+			break;
+		case 'p':
+			ports_report = 1;
+			break;
+		default:
+			usage();
+			break;
+		}
+	}
+	argc -= optind;
+	argv += optind;
+
+	if (argc && !(f = fopen(argv[0], "w")))
+		fprintf(stderr, "can't open file %s for writing", argv[0]);
+
+	node_name_map = open_node_name_map(node_name_map_file);
+
+	if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) {
+		fprintf(stderr, "discover failed\n");
+		exit(1);
+	}
+
+	if (list)
+		list_nodes(fabric, list);
+	else
+		dump_topology(group, fabric);
+
+	ibnd_destroy_fabric(fabric);
+	close_node_name_map(node_name_map);
+	exit(0);
+}
diff --git a/libibnetdisc/test/testleaks.c b/libibnetdisc/test/testleaks.c
new file mode 100644
index 0000000..4c10afb
--- /dev/null
+++ b/libibnetdisc/test/testleaks.c
@@ -0,0 +1,261 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <time.h>
+#include <string.h>
+#include <getopt.h>
+#include <errno.h>
+#include <inttypes.h>
+
+#include <infiniband/complib/cl_nodenamemap.h>
+#include <infiniband/ibnetdisc.h>
+
+char *argv0 = "iblinkinfotest";
+static FILE *f;
+
+static int timeout_ms = 500;
+
+void
+print_port(ibnd_node_t *node, ibnd_port_t *port)
+{
+	char remote_guid_str[256];
+	char remote_str[256];
+	char link_str[256];
+	char speed_msg[256];
+	char ext_port_str[256];
+
+	if (!port)
+		return;
+
+	remote_guid_str[0] = '\0';
+	remote_str[0] = '\0';
+	link_str[0] = '\0';
+	speed_msg[0] = '\0';
+
+	if (port->remoteport) {
+		char  remote_name_buf[256];
+		strncpy(remote_name_buf, port->remoteport->node->nodedesc, 256);
+
+		if (port->remoteport->ext_portnum)
+			snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum);
+		else
+			ext_port_str[0] = '\0';
+
+		snprintf(remote_str, 256,
+			"%s%6d %4d[%2s] \"%s\" (%s)\n",
+			remote_guid_str,
+			port->remoteport->info.lid ?
+				port->remoteport->info.lid :
+				port->remoteport->node->smalid,
+			port->remoteport->portnum,
+			ext_port_str,
+			port->remoteport->node->nodedesc,
+			speed_msg
+			);
+	} else {
+		snprintf(remote_str, 256,
+			"%6s %4s[%2s] \"\" ( )\n", "", "", "");
+	}
+
+	snprintf(link_str, 256,
+		"(%3s %s %6s/%8s)",
+		ibnd_linkwidth_str(port->info.link_width_active),
+		ibnd_linkspeed_str(port->info.link_speed_active),
+		ibnd_linkstate_str(port->info.link_state),
+		ibnd_physstate_str(port->info.phys_state)
+		);
+
+	if (port->ext_portnum)
+		snprintf(ext_port_str, 256, "%d", port->ext_portnum);
+	else
+		ext_port_str[0] = '\0';
+
+	printf("      %6d %4d[%2s] ==%s==>  %s",
+		node->smalid, port->portnum,
+		ext_port_str,
+		link_str,
+		remote_str
+		);
+}
+
+void
+print_switch(ibnd_node_t *node, void *user_data)
+{
+	int i = 0;
+
+	for (i = 1; i <= node->info.numports; i++) {
+		ibnd_port_t *port = node->ports[i];
+		if (!port)
+			continue;
+		if (port->info.link_state == IBND_LINK_DOWN) {
+			print_port(node, port);
+		}
+	}
+}
+
+void
+usage(void)
+{
+	fprintf(stderr,
+		"Usage: %s [-hclp -S <guid> -D <direct route> -C <ca_name> -P <ca_port>]\n"
+		"   Report link speed and connection for each port of each switch which is active\n"
+		"   -h This help message\n"
+		"   -S <guid> output only the node specified by guid\n"
+		"   -D <direct route> print only node specified by <direct route>\n"
+		"   -f <dr_path> specify node to start \"from\"\n"
+		"   -n <hops> Number of hops to include away from specified node\n"
+
+		"   -t <timeout_ms> timeout for any single fabric query\n"
+		"   -s show errors\n"
+
+		"   -C <ca_name> use selected Channel Adaptor name for queries\n"
+		"   -P <ca_port> use selected channel adaptor port for queries\n"
+		"   --debug print debug messages\n"
+		,
+			argv0);
+	exit(-1);
+}
+
+int
+main(int argc, char **argv)
+{
+	char *ca = 0;
+	int ca_port = 0;
+	ibnd_fabric_t *fabric = NULL;
+	uint64_t guid = 0;
+	char *dr_path = NULL;
+	char *from = NULL;
+	int hops = 0;
+	ib_portid_t port_id;
+
+	static char const str_opts[] = "S:D:n:C:P:t:shuf:";
+	static const struct option long_opts[] = {
+		{ "S", 1, 0, 'S'},
+		{ "D", 1, 0, 'D'},
+		{ "num-hops", 1, 0, 'n'},
+		{ "ca-name", 1, 0, 'C'},
+		{ "ca-port", 1, 0, 'P'},
+		{ "timeout", 1, 0, 't'},
+		{ "show", 0, 0, 's'},
+		{ "help", 0, 0, 'h'},
+		{ "usage", 0, 0, 'u'},
+		{ "debug", 0, 0, 2},
+		{ "from", 1, 0, 'f'},
+		{ }
+	};
+
+	f = stdout;
+
+	argv0 = argv[0];
+
+	while (1) {
+		int ch = getopt_long(argc, argv, str_opts, long_opts, NULL);
+		if ( ch == -1 )
+			break;
+		switch(ch) {
+		case 2:
+			ibnd_debug(1);
+			break;
+		case 'f':
+			from = strdup(optarg);
+			break;
+		case 'C':
+			ca = strdup(optarg);
+			break;
+		case 'P':
+			ca_port = strtoul(optarg, 0, 0);
+			break;
+		case 'D':
+			dr_path = strdup(optarg);
+			break;
+		case 'n':
+			hops = (int)strtol(optarg, NULL, 0);
+			break;
+		case 't':
+			timeout_ms = strtoul(optarg, 0, 0);
+			break;
+		case 'S':
+			guid = (uint64_t)strtoull(optarg, 0, 0);
+			break;
+		default:
+			usage();
+			break;
+		}
+	}
+	argc -= optind;
+	argv += optind;
+
+	while (1) {
+		if (from) {
+			/* only scan part of the fabric */
+			str2drpath(&(port_id.drpath), from, 0, 0);
+			if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) {
+				fprintf(stderr, "discover failed\n");
+				exit(1);
+			}
+			guid = 0;
+		} else {
+			if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) {
+				fprintf(stderr, "discover failed\n");
+				exit(1);
+			}
+		}
+
+#if 0
+		if (guid) {
+			ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid);
+			print_switch(sw, NULL);
+		} else if (dr_path) {
+			ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path);
+			print_switch(sw, NULL);
+		} else {
+			ibnd_iter_nodes_type(fabric, print_switch, IBND_SWITCH_NODE, NULL);
+		}
+#endif
+
+		ibnd_destroy_fabric(fabric);
+	}
+
+	exit(0);
+}
-- 
1.5.4.5


From weiny2 at llnl.gov  Thu Nov 20 16:38:14 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Thu, 20 Nov 2008 16:38:14 -0800
Subject: [ofa-general] [PATCH 2/3] Convert iblinkinfo.pl to C and use new
 ibnetdisc library.
Message-ID: <20081120163814.3e7c0c78.weiny2@llnl.gov>

>From b1c2cc8f96a3d88f2ef341ebee0b550fd5bd2a7b Mon Sep 17 00:00:00 2001
From: Ira Weiny <weiny2 at llnl.gov>
Date: Thu, 20 Nov 2008 08:45:00 -0800
Subject: [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library.

Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
---
 infiniband-diags/Makefile.am           |    9 +-
 infiniband-diags/configure.in          |    2 +
 infiniband-diags/scripts/iblinkinfo.pl |  327 --------------------------
 infiniband-diags/src/iblinkinfo.c      |  393 ++++++++++++++++++++++++++++++++
 4 files changed, 402 insertions(+), 329 deletions(-)
 delete mode 100755 infiniband-diags/scripts/iblinkinfo.pl
 create mode 100644 infiniband-diags/src/iblinkinfo.c

diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am
index c22ba5e..8f26749 100644
--- a/infiniband-diags/Makefile.am
+++ b/infiniband-diags/Makefile.am
@@ -10,7 +10,7 @@ endif
 sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \
 	        src/ibroute src/ibstat src/ibsysstat src/ibtracert \
 	        src/perfquery src/sminfo src/smpdump src/smpquery \
-	        src/saquery src/vendstat
+	        src/saquery src/vendstat src/iblinkinfo.pl
 
 if ENABLE_TEST_UTILS
 sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test
@@ -27,7 +27,7 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \
 	       scripts/dump_lfts.sh scripts/dump_mfts.sh \
 	       scripts/set_nodedesc.sh \
 	       scripts/ibqueryerrors.pl scripts/ibswportwatch.pl \
-	       scripts/iblinkinfo.pl scripts/ibprintswitch.pl \
+	       scripts/ibprintswitch.pl \
 	       scripts/ibprintca.pl scripts/ibprintrt.pl \
 	       scripts/ibfindnodesusing.pl scripts/ibidsverify.pl \
 	       scripts/check_lft_balance.pl
@@ -39,6 +39,11 @@ src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common
 src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS)
 src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir)
 
+src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c
+src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS)
+src_iblinkinfo_pl_LDFLAGS = -Wl,--rpath -Wl,$(libdir) \
+			-libcommon -libnetdisc
+
 src_ibping_SOURCES = src/ibping.c src/ibdiag_common.c
 src_ibping_CFLAGS = -Wall $(DBGFLAGS)
 
diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in
index d227219..46021d6 100644
--- a/infiniband-diags/configure.in
+++ b/infiniband-diags/configure.in
@@ -46,6 +46,8 @@ AC_CHECK_LIB(osmvendor, osmv_query_sa, [],
 	AC_MSG_ERROR([osmv_query_sa() not found. diags require libosmvendor.]), [-lopensm])
 AC_CHECK_LIB(opensm, osm_log_init_v2, [],
 	AC_MSG_ERROR([osm_log_init_v2() not found. diags require libopensm.]))
+AC_CHECK_LIB(ibnetdisc, ibnd_discover_fabric, [],
+	AC_MSG_ERROR([ibnd_discover_fabric() not found. diags require libibnetdisc.]))
 fi
 
 dnl Checks for header files.
diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl
deleted file mode 100755
index b6b27ce..0000000
--- a/infiniband-diags/scripts/iblinkinfo.pl
+++ /dev/null
@@ -1,327 +0,0 @@
-#!/usr/bin/perl
-#
-# Copyright (c) 2006 The Regents of the University of California.
-# Copyright (c) 2007-2008 Voltaire, Inc. All rights reserved.
-#
-# Produced at Lawrence Livermore National Laboratory.
-# Written by Ira Weiny <weiny2 at llnl.gov>.
-#
-# This software is available to you under a choice of one of two
-# licenses.  You may choose to be licensed under the terms of the GNU
-# General Public License (GPL) Version 2, available from the file
-# COPYING in the main directory of this source tree, or the
-# OpenIB.org BSD license below:
-#
-#     Redistribution and use in source and binary forms, with or
-#     without modification, are permitted provided that the following
-#     conditions are met:
-#
-#      - Redistributions of source code must retain the above
-#        copyright notice, this list of conditions and the following
-#        disclaimer.
-#
-#      - Redistributions in binary form must reproduce the above
-#        copyright notice, this list of conditions and the following
-#        disclaimer in the documentation and/or other materials
-#        provided with the distribution.
-#
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
-# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
-# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
-# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
-# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-# SOFTWARE.
-#
-
-use strict;
-
-use Getopt::Std;
-use IBswcountlimits;
-
-sub usage_and_exit
-{
-	my $prog = $_[0];
-	print
-"Usage: $prog [-Rhclp -S <guid> -D <direct route> -C <ca_name> -P <ca_port>]\n";
-	print
-"   Report link speed and connection for each port of each switch which is active\n";
-	print "   -h This help message\n";
-	print
-"   -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n";
-	print
-"   -D <direct route> output only the switch specified by direct route path\n";
-	print "   -S <guid> output only the switch specified by <guid> (hex format)\n";
-	print "   -d print only down links\n";
-	print
-	  "   -l (line mode) print all information for each link on each line\n";
-	print
-"   -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n";
-	print "   -c print port capabilities (enabled/supported values)\n";
-	print "   -C <ca_name> use selected Channel Adaptor name for queries\n";
-	print "   -P <ca_port> use selected channel adaptor port for queries\n";
-	print "   -g print port guids instead of node guids\n";
-	exit 2;
-}
-
-my $argv0              = `basename $0`;
-my $regenerate_map     = undef;
-my $single_switch      = undef;
-my $direct_route       = undef;
-my $line_mode          = undef;
-my $print_add_switch   = undef;
-my $print_extended_cap = undef;
-my $only_down_links    = undef;
-my $ca_name            = "";
-my $ca_port            = "";
-my $print_port_guids   = undef;
-my $switch_found       = "no";
-chomp $argv0;
-
-if (!getopts("hcpldRS:D:C:P:g")) { usage_and_exit $argv0; }
-if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; }
-if (defined $Getopt::Std::opt_D) { $direct_route   = $Getopt::Std::opt_D; }
-if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; }
-if (defined $Getopt::Std::opt_S) {
-	$single_switch = format_guid($Getopt::Std::opt_S);
-}
-if (defined $Getopt::Std::opt_d) { $only_down_links    = $Getopt::Std::opt_d; }
-if (defined $Getopt::Std::opt_l) { $line_mode          = $Getopt::Std::opt_l; }
-if (defined $Getopt::Std::opt_p) { $print_add_switch   = $Getopt::Std::opt_p; }
-if (defined $Getopt::Std::opt_c) { $print_extended_cap = $Getopt::Std::opt_c; }
-if (defined $Getopt::Std::opt_C) { $ca_name            = $Getopt::Std::opt_C; }
-if (defined $Getopt::Std::opt_P) { $ca_port            = $Getopt::Std::opt_P; }
-if (defined $Getopt::Std::opt_g) { $print_port_guids   = $Getopt::Std::opt_g; }
-
-my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port);
-
-sub main
-{
-	get_link_ends($regenerate_map, $ca_name, $ca_port);
-	if (defined($direct_route)) {
-		# convert DR to guid, then use original single_switch option
-		$single_switch = convert_dr_to_guid($direct_route);
-		if (!defined($single_switch) || !is_switch($single_switch)) {
-			printf("The direct route (%s) does not map to a switch.\n",
-				$direct_route);
-			return;
-		}
-	}
-	foreach my $switch (sort (keys(%IBswcountlimits::link_ends))) {
-		if ($single_switch && $switch ne $single_switch) {
-			next;
-		} else {
-			$switch_found = "yes";
-		}
-		my $switch_prompt = "no";
-		my $num_ports = get_num_ports($switch, $ca_name, $ca_port);
-		if ($num_ports == 0) {
-			printf("ERROR: switch $switch has 0 ports???\n");
-		}
-		my @output_lines    = undef;
-		my $pkt_lifetime    = "";
-		my $pkt_life_prompt = "";
-		my $port_timeouts   = "";
-		my $print_switch    = "yes";
-		if ($only_down_links) { $print_switch = "no"; }
-		if ($print_add_switch) {
-			my $data = `smpquery $extra_smpquery_params -G switchinfo $switch`;
-			if ($data eq "") {
-				printf("ERROR: failed to get switchinfo for $switch\n");
-			}
-			my @lines = split("\n", $data);
-			foreach my $line (@lines) {
-				if ($line =~ /^LifeTime:\.+(.*)/) { $pkt_lifetime = $1; }
-			}
-			$pkt_life_prompt = sprintf(" (LT: %2s)", $pkt_lifetime);
-		}
-		foreach my $port (1 .. $num_ports) {
-			my $hr = $IBswcountlimits::link_ends{$switch}{$port};
-			if ($switch_prompt eq "no" && !$line_mode) {
-				my $switch_name = "";
-				my $tmp_port = $port;
-				while ($switch_name eq "" && $tmp_port <= $num_ports) {
-					# the first port is down find switch name with up port
-					my $hr = $IBswcountlimits::link_ends{$switch}{$tmp_port};
-					$switch_name = $hr->{loc_desc};
-					$tmp_port++;
-				}
-				if ($switch_name eq "") {
-					printf(
-						"WARNING: Switch Name not found for $switch\n");
-				}
-				push(
-					@output_lines,
-					sprintf(
-						"Switch %18s %s%s:\n",
-						$switch, $switch_name, $pkt_life_prompt
-					)
-				);
-				$switch_prompt = "yes";
-			}
-			my $data =
-			  `smpquery $extra_smpquery_params -G portinfo $switch $port`;
-			if ($data eq "") {
-				printf(
-					"ERROR: failed to get portinfo for $switch port $port\n");
-			}
-			my @lines          = split("\n", $data);
-			my $speed          = "";
-			my $speed_sup      = "";
-			my $speed_enable   = "";
-			my $width          = "";
-			my $width_sup      = "";
-			my $width_enable   = "";
-			my $state          = "";
-			my $hoq_life       = "";
-			my $vl_stall       = "";
-			my $phy_link_state = "";
-
-			foreach my $line (@lines) {
-				if ($line =~ /^LinkSpeedActive:\.+(.*)/) { $speed = $1; }
-				if ($line =~ /^LinkSpeedEnabled:\.+(.*)/) {
-					$speed_enable = $1;
-				}
-				if ($line =~ /^LinkSpeedSupported:\.+(.*)/) { $speed_sup = $1; }
-				if ($line =~ /^LinkWidthActive:\.+(.*)/)    { $width     = $1; }
-				if ($line =~ /^LinkWidthEnabled:\.+(.*)/) {
-					$width_enable = $1;
-				}
-				if ($line =~ /^LinkWidthSupported:\.+(.*)/) { $width_sup = $1; }
-				if ($line =~ /^LinkState:\.+(.*)/)          { $state     = $1; }
-				if ($line =~ /^HoqLife:\.+(.*)/)            { $hoq_life  = $1; }
-				if ($line =~ /^VLStallCount:\.+(.*)/)       { $vl_stall  = $1; }
-				if ($line =~ /^PhysLinkState:\.+(.*)/) { $phy_link_state = $1; }
-			}
-			my $rem_port         = $hr->{rem_port};
-			my $rem_lid          = $hr->{rem_lid};
-			my $rem_speed_sup    = "";
-			my $rem_speed_enable = "";
-			my $rem_width_sup    = "";
-			my $rem_width_enable = "";
-			if ($rem_lid ne "" && $rem_port ne "") {
-				$data =
-				  `smpquery $extra_smpquery_params portinfo $rem_lid $rem_port`;
-				if ($data eq "") {
-					printf(
-						"ERROR: failed to get portinfo for $switch port $port\n"
-					);
-				}
-				my @lines = split("\n", $data);
-				foreach my $line (@lines) {
-					if ($line =~ /^LinkSpeedEnabled:\.+(.*)/) {
-						$rem_speed_enable = $1;
-					}
-					if ($line =~ /^LinkSpeedSupported:\.+(.*)/) {
-						$rem_speed_sup = $1;
-					}
-					if ($line =~ /^LinkWidthEnabled:\.+(.*)/) {
-						$rem_width_enable = $1;
-					}
-					if ($line =~ /^LinkWidthSupported:\.+(.*)/) {
-						$rem_width_sup = $1;
-					}
-				}
-			}
-			my $capabilities = "";
-			if ($print_extended_cap) {
-				$capabilities = sprintf("(%3s %s %6s / %8s [%s/%s][%s/%s])",
-					$width, $speed, $state, $phy_link_state, $width_enable,
-					$width_sup, $speed_enable, $speed_sup);
-			} else {
-				$capabilities = sprintf("(%3s %s %6s / %8s)",
-					$width, $speed, $state, $phy_link_state);
-			}
-			if ($print_add_switch) {
-				$port_timeouts =
-				  sprintf(" (HOQ:%s VL_Stall:%s)", $hoq_life, $vl_stall);
-			}
-			if (!$only_down_links || ($only_down_links && $state eq "Down")) {
-				my $width_msg = "";
-				my $speed_msg = "";
-				if ($rem_width_enable ne "" && $rem_width_sup ne "") {
-					if (   $width_enable =~ /12X/
-						&& $rem_width_enable =~ /12X/
-						&& $width !~ /12X/)
-					{
-						$width_msg = "Could be 12X";
-					} else {
-						if (   $width_enable =~ /8X/
-							&& $rem_width_enable =~ /8X/
-							&& $width !~ /8X/)
-						{
-							$width_msg = "Could be 8X";
-						} else {
-							if (   $width_enable =~ /4X/
-								&& $rem_width_enable =~ /4X/
-								&& $width !~ /4X/)
-							{
-								$width_msg = "Could be 4X";
-							}
-						}
-					}
-				}
-				if ($rem_speed_enable ne "" && $rem_speed_sup ne "") {
-					if (   $speed_enable =~ /10\.0/
-						&& $rem_speed_enable =~ /10\.0/
-						&& $speed !~ /10\.0/)
-					{
-						$speed_msg = "Could be 10.0 Gbps";
-					} else {
-						if (   $speed_enable =~ /5\.0/
-							&& $rem_speed_enable =~ /5\.0/
-							&& $speed !~ /5\.0/)
-						{
-							$speed_msg = "Could be 5.0 Gbps";
-						}
-					}
-				}
-
-				if ($line_mode) {
-					my $line_begin = sprintf("%18s \"%30s\"%s",
-						$switch, $hr->{loc_desc}, $pkt_life_prompt);
-					my $ext_guid = sprintf("%18s", $hr->{rem_guid});
-					if ($print_port_guids && $hr->{rem_port_guid} ne "") {
-						$ext_guid = sprintf("0x%016s", $hr->{rem_port_guid});
-					}
-					push(
-						@output_lines,
-						sprintf(
-"%s %6s %4s[%2s]  ==%s%s==>  %18s %6s %4s[%2s] \"%s\" ( %s %s)\n",
-							$line_begin,     $hr->{loc_sw_lid},
-							$port,           $hr->{loc_ext_port},
-							$capabilities,   $port_timeouts,
-							$ext_guid,       $hr->{rem_lid},
-							$hr->{rem_port}, $hr->{rem_ext_port},
-							$hr->{rem_desc}, $width_msg,
-							$speed_msg
-						)
-					);
-				} else {
-					push(
-						@output_lines,
-						sprintf(
-" %6s %4s[%2s]  ==%s%s==>  %6s %4s[%2s] \"%s\" ( %s %s)\n",
-							$hr->{loc_sw_lid},   $port,
-							$hr->{loc_ext_port}, $capabilities,
-							$port_timeouts,      $hr->{rem_lid},
-							$hr->{rem_port},     $hr->{rem_ext_port},
-							$hr->{rem_desc},     $width_msg,
-							$speed_msg
-						)
-					);
-				}
-				$print_switch = "yes";
-			}
-		}
-		if ($print_switch eq "yes") {
-			foreach my $line (@output_lines) { print $line; }
-		}
-	}
-	if ($single_switch && $switch_found ne "yes") {
-		printf("Switch \"%s\" not found.\n", $single_switch);
-	}
-}
-main;
-
diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c
new file mode 100644
index 0000000..1d503bb
--- /dev/null
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -0,0 +1,393 @@
+/*
+ * Copyright (c) 2004-2007 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <time.h>
+#include <string.h>
+#include <getopt.h>
+#include <errno.h>
+#include <inttypes.h>
+
+#include <infiniband/complib/cl_nodenamemap.h>
+#include <infiniband/ibnetdisc.h>
+
+char *argv0 = "iblinkinfotest";
+static FILE *f;
+
+static char *node_name_map_file = NULL;
+static nn_map_t *node_name_map = NULL;
+
+static int timeout_ms = 500;
+
+static int down_links_only = 0;
+static int line_mode = 0;
+static int add_sw_settings = 0;
+static int print_port_guids = 0;
+
+static unsigned int
+get_max(unsigned int num)
+{
+	unsigned int v = num; // 32-bit word to find the log base 2 of
+	unsigned r = 0; // r will be lg(v)
+
+	while (v >>= 1) // unroll for more speed...
+	{
+		r++;
+	}
+
+	return (1 << r);
+}
+
+void
+get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port)
+{
+	int max_speed = 0;
+
+	int max_width = get_max(port->info.link_width_supported
+				& port->remoteport->info.link_width_supported);
+	if ((max_width & port->info.link_width_active) == 0) {
+		// we are not at the max supported width
+		// print what we could be at.
+		snprintf(width_msg, msg_size, "Could be %s",
+			ibnd_linkwidth_str(max_width));
+	}
+
+	max_speed = get_max(port->info.link_speed_supported
+				& port->remoteport->info.link_speed_supported);
+	if ((max_speed & port->info.link_speed_active) == 0) {
+		// we are not at the max supported speed
+		// print what we could be at.
+		snprintf(speed_msg, msg_size, "Could be %s",
+			ibnd_linkspeed_str(max_speed));
+	}
+}
+
+void
+print_port(ibnd_node_t *node, ibnd_port_t *port)
+{
+	char remote_guid_str[256];
+	char remote_str[256];
+	char link_str[256];
+	char width_msg[256];
+	char speed_msg[256];
+	char ext_port_str[256];
+
+	if (!port)
+		return;
+
+	remote_guid_str[0] = '\0';
+	remote_str[0] = '\0';
+	link_str[0] = '\0';
+	width_msg[0] = '\0';
+	speed_msg[0] = '\0';
+
+	if (port->remoteport) {
+		char  remote_name_buf[256];
+		strncpy(remote_name_buf, port->remoteport->node->nodedesc, 256);
+
+		if (port->remoteport->ext_portnum)
+			snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum);
+		else
+			ext_port_str[0] = '\0';
+
+		get_msg(width_msg, speed_msg, 256, port);
+		if (line_mode) {
+			if (print_port_guids) {
+				snprintf(remote_guid_str, 256,
+					"0x%016lx ",
+					port->remoteport->guid);
+			} else {
+				snprintf(remote_guid_str, 256,
+					"0x%016lx ",
+					port->remoteport->node->info.nodeguid);
+			}
+		}
+
+		snprintf(remote_str, 256,
+			"%s%6d %4d[%2s] \"%s\" (%s %s)\n",
+			remote_guid_str,
+			port->remoteport->info.lid ?
+				port->remoteport->info.lid :
+				port->remoteport->node->smalid,
+			port->remoteport->portnum,
+			ext_port_str,
+			remap_node_name(node_name_map,
+				port->remoteport->node->info.nodeguid,
+				remote_name_buf),
+			width_msg,
+			speed_msg
+			);
+	} else {
+		snprintf(remote_str, 256,
+			"%6s %4s[%2s] \"\" ( )\n", "", "", "");
+	}
+
+	if (add_sw_settings) {
+		snprintf(link_str, 256,
+			"(%3s %s %6s/%8s) (HOQ:%d VL_Stall:%d)",
+			ibnd_linkwidth_str(port->info.link_width_active),
+			ibnd_linkspeed_str(port->info.link_speed_active),
+			ibnd_linkstate_str(port->info.link_state),
+			ibnd_physstate_str(port->info.phys_state),
+			port->info.hoq_lifetime,
+			port->info.vl_stall_count
+			);
+	} else {
+		snprintf(link_str, 256,
+			"(%3s %s %6s/%8s)",
+			ibnd_linkwidth_str(port->info.link_width_active),
+			ibnd_linkspeed_str(port->info.link_speed_active),
+			ibnd_linkstate_str(port->info.link_state),
+			ibnd_physstate_str(port->info.phys_state)
+			);
+	}
+
+	if (port->ext_portnum)
+		snprintf(ext_port_str, 256, "%d", port->ext_portnum);
+	else
+		ext_port_str[0] = '\0';
+
+	if (line_mode) {
+		char  name_buf[256];
+		strncpy(name_buf, node->nodedesc, 256);
+		printf("0x%016lx \"%30s\" %6d %4d[%2s] ==%s==>  %s",
+			node->info.nodeguid,
+			remap_node_name(node_name_map,
+				node->info.nodeguid,
+				name_buf),
+			node->smalid, port->portnum,
+			ext_port_str,
+			link_str,
+			remote_str
+			);
+	} else {
+		printf("      %6d %4d[%2s] ==%s==>  %s",
+			node->smalid, port->portnum,
+			ext_port_str,
+			link_str,
+			remote_str
+			);
+	}
+}
+
+void
+print_switch(ibnd_node_t *node, void *user_data)
+{
+	int i = 0;
+
+	if (!line_mode) {
+		char  name_buf[256];
+		strncpy(name_buf, node->nodedesc, 256);
+		printf("Switch 0x%016lx %s:\n",
+			node->info.nodeguid,
+			remap_node_name(node_name_map,
+				node->info.nodeguid,
+				name_buf));
+	}
+
+	for (i = 1; i <= node->info.numports; i++) {
+		ibnd_port_t *port = node->ports[i];
+		if (!port)
+			continue;
+		if (!down_links_only || port->info.link_state == IBND_LINK_DOWN) {
+			print_port(node, port);
+		}
+	}
+}
+
+void
+usage(void)
+{
+	fprintf(stderr,
+		"Usage: %s [-hclp -S <guid> -D <direct route> -C <ca_name> -P <ca_port>]\n"
+		"   Report link speed and connection for each port of each switch which is active\n"
+		"   -h This help message\n"
+		"   -S <guid> output only the node specified by guid\n"
+		"   -D <direct route> print only node specified by <direct route>\n"
+		"   -f <dr_path> specify node to start \"from\"\n"
+		"   -n <hops> Number of hops to include away from specified node\n"
+		"   -d print only down links\n"
+		"   -l (line mode) print all information for each link on each line\n"
+		"   -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n"
+
+
+		"   -t <timeout_ms> timeout for any single fabric query\n"
+		"   -s show progress during scan\n"
+		"   --node-name-map <map_file> use specified node name map\n"
+
+		"   -C <ca_name> use selected Channel Adaptor name for queries\n"
+		"   -P <ca_port> use selected channel adaptor port for queries\n"
+		"   -g print port guids instead of node guids\n"
+		"   --debug print debug messages\n"
+		,
+			argv0);
+	exit(-1);
+}
+
+int
+main(int argc, char **argv)
+{
+	char *ca = 0;
+	int ca_port = 0;
+	ibnd_fabric_t *fabric = NULL;
+	uint64_t guid = 0;
+	char *dr_path = NULL;
+	char *from = NULL;
+	int hops = 0;
+	ib_portid_t port_id;
+
+	static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:";
+	static const struct option long_opts[] = {
+		{ "S", 1, 0, 'S'},
+		{ "D", 1, 0, 'D'},
+		{ "num-hops", 1, 0, 'n'},
+		{ "down-links-only", 0, 0, 'd'},
+		{ "line-mode", 0, 0, 'l'},
+		{ "ca-name", 1, 0, 'C'},
+		{ "ca-port", 1, 0, 'P'},
+		{ "timeout", 1, 0, 't'},
+		{ "show", 0, 0, 's'},
+		{ "print-port-guids", 0, 0, 'g'},
+		{ "print-additional", 0, 0, 'p'},
+		{ "help", 0, 0, 'h'},
+		{ "usage", 0, 0, 'u'},
+		{ "node-name-map", 1, 0, 1},
+		{ "debug", 0, 0, 2},
+		{ "from", 1, 0, 'f'},
+		{ }
+	};
+
+	f = stdout;
+
+	argv0 = argv[0];
+
+	while (1) {
+		int ch = getopt_long(argc, argv, str_opts, long_opts, NULL);
+		if ( ch == -1 )
+			break;
+		switch(ch) {
+		case 1:
+			node_name_map_file = strdup(optarg);
+			break;
+		case 2:
+			ibnd_debug(1);
+			break;
+		case 'f':
+			from = strdup(optarg);
+			break;
+		case 'C':
+			ca = strdup(optarg);
+			break;
+		case 'P':
+			ca_port = strtoul(optarg, 0, 0);
+			break;
+		case 'D':
+			dr_path = strdup(optarg);
+			break;
+		case 'n':
+			hops = (int)strtol(optarg, NULL, 0);
+			break;
+		case 'd':
+			down_links_only = 1;
+			break;
+		case 'l':
+			line_mode = 1;
+			break;
+		case 't':
+			timeout_ms = strtoul(optarg, 0, 0);
+			break;
+		case 's':
+			ibnd_show_progress(1);
+			break;
+		case 'g':
+			print_port_guids = 1;
+			break;
+		case 'S':
+			guid = (uint64_t)strtoull(optarg, 0, 0);
+			break;
+		case 'p':
+			add_sw_settings = 1;
+			break;
+		default:
+			usage();
+			break;
+		}
+	}
+	argc -= optind;
+	argv += optind;
+
+	if (argc && !(f = fopen(argv[0], "w")))
+		fprintf(stderr, "can't open file %s for writing", argv[0]);
+
+	node_name_map = open_node_name_map(node_name_map_file);
+
+	if (from) {
+		/* only scan part of the fabric */
+		str2drpath(&(port_id.drpath), from, 0, 0);
+		if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) {
+			fprintf(stderr, "discover failed\n");
+			exit(1);
+		}
+		guid = 0;
+	} else {
+		if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) {
+			fprintf(stderr, "discover failed\n");
+			exit(1);
+		}
+	}
+
+	if (guid) {
+		ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid);
+		print_switch(sw, NULL);
+	} else if (dr_path) {
+		ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path);
+		print_switch(sw, NULL);
+	} else {
+		ibnd_iter_nodes_type(fabric, print_switch, IBND_SWITCH_NODE, NULL);
+	}
+
+	ibnd_destroy_fabric(fabric);
+
+	close_node_name_map(node_name_map);
+	exit(0);
+}
-- 
1.5.4.5


From ddiss at sgi.com  Thu Nov 20 22:10:45 2008
From: ddiss at sgi.com (David Disseldorp)
Date: Fri, 21 Nov 2008 17:10:45 +1100
Subject: [ofa-general] [PATCH] iser: avoid recv buf exhaustion
Message-ID: <1227247845-16023-1-git-send-email-ddiss@sgi.com>

iSCSI/iSER targets may send PDUs without a prior request from the initiator,
RFC 5046 refers to these PDUs as "unexpected". NOP-In PDUs with itt=RESERVED
and Asynchronous Message PDUs occupy this category.

The amount of active "unexpected" PDU's an iSER target may have at any time is
governed by the MaxOutstandingUnexpectedPDUs key, which is not yet supported.

Currently when an iSER target sends an "unexpected" PDU, the initiators recv
buffer consumed by the PDU is not replaced. If over initial_post_recv_bufs_num
"unexpected" PDUs are received then the receive queue will run out of receive
work requests.

This patch ensures recv buffers consumed by "unexpected" PDUs are replaced
prior to sending the next control-type PDU.

Signed-off-by: David Disseldorp <ddiss at sgi.com>
Signed-off-by: Ken Sandars <ksandars at sgi.com>
---
 drivers/infiniband/ulp/iser/iscsi_iser.h     |    3 +
 drivers/infiniband/ulp/iser/iser_initiator.c |   76 ++++++++++++++++++++++++--
 drivers/infiniband/ulp/iser/iser_verbs.c     |    1 +
 3 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 81a8262..8611195 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -252,6 +252,9 @@ struct iser_conn {
 	wait_queue_head_t	     wait;          /* waitq for conn/disconn  */
 	atomic_t                     post_recv_buf_count; /* posted rx count   */
 	atomic_t                     post_send_buf_count; /* posted tx count   */
+	atomic_t                     unexpected_pdu_count;/* count of received *
+							   * unexpected pdus   *
+							   * not yet retired   */
 	char 			     name[ISER_OBJECT_NAME_SIZE];
 	struct iser_page_vec         *page_vec;     /* represents SG to fmr maps*
 						     * maps serialized as tx is*/
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
index cdd2831..9f8cffb 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -274,8 +274,10 @@ int iser_conn_set_full_featured_mode(struct iscsi_conn *conn)
 	struct iscsi_iser_conn *iser_conn = conn->dd_data;
 
 	int i;
-	/* no need to keep it in a var, we are after login so if this should
-	 * be negotiated, by now the result should be available here */
+	/*
+	 * FIXME this value should be declared to the target during login with
+	 * the MaxOutstandingUnexpectedPDUs key when supported
+	 */
 	int initial_post_recv_bufs_num = ISER_MAX_RX_MISC_PDUS;
 
 	iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num);
@@ -310,6 +312,33 @@ iser_check_xmit(struct iscsi_conn *conn, void *task)
 	return 0;
 }
 
+static inline int
+iser_post_unexpected_recvs(struct iscsi_conn *conn)
+{
+	struct iscsi_iser_conn *iser_conn = conn->dd_data;
+	int outstanding_unexp_pdus;
+	int err = 0;
+
+	if (atomic_read(&iser_conn->ib_conn->unexpected_pdu_count) == 0)
+		goto out;
+
+	outstanding_unexp_pdus =
+		atomic_xchg(&iser_conn->ib_conn->unexpected_pdu_count, 0);
+
+	while (outstanding_unexp_pdus > 0) {
+		if (iser_post_receive_control(conn) != 0) {
+			iser_err("post_rcv failed\n");
+			err = -ENOMEM;
+			atomic_add(outstanding_unexp_pdus,
+				   &iser_conn->ib_conn->unexpected_pdu_count);
+			goto out;
+		}
+		outstanding_unexp_pdus--;
+	}
+
+out:
+	return err;
+}
 
 /**
  * iser_send_command - send command PDU
@@ -372,6 +401,7 @@ int iser_send_command(struct iscsi_conn *conn,
 	iser_reg_single(iser_conn->ib_conn->device,
 			send_dto->regd[0], DMA_TO_DEVICE);
 
+	/* post recv buffer for SCSI response */
 	if (iser_post_receive_control(conn) != 0) {
 		iser_err("post_recv failed!\n");
 		err = -ENOMEM;
@@ -380,6 +410,12 @@ int iser_send_command(struct iscsi_conn *conn,
 
 	iser_task->status = ISER_TASK_STATUS_STARTED;
 
+	/*
+	 * post recv bufs for those consumed by unexpected pdus from target
+	 * errors are ignored, as retry occurs on next send
+	 */
+	iser_post_unexpected_recvs(conn);
+
 	err = iser_post_send(&iser_task->desc);
 	if (!err)
 		return 0;
@@ -478,6 +514,7 @@ int iser_send_control(struct iscsi_conn *conn,
 	int err = 0;
 	struct iser_regd_buf *regd_buf;
 	struct iser_device *device;
+	unsigned char opcode;
 
 	if (!iser_conn_state_comp(iser_conn->ib_conn, ISER_CONN_UP)) {
 		iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn);
@@ -512,12 +549,24 @@ int iser_send_control(struct iscsi_conn *conn,
 				       data_seg_len);
 	}
 
-	if (iser_post_receive_control(conn) != 0) {
-		iser_err("post_rcv_buff failed!\n");
-		err = -ENOMEM;
-		goto send_control_error;
+	opcode = task->hdr->opcode & ISCSI_OPCODE_MASK;
+
+	/* post recv buffer for response if one is expected */
+	if (!((opcode == ISCSI_OP_NOOP_OUT)
+	 && (task->hdr->itt == RESERVED_ITT))) {
+		if (iser_post_receive_control(conn) != 0) {
+			iser_err("post_rcv_buff failed!\n");
+			err = -ENOMEM;
+			goto send_control_error;
+		}
 	}
 
+	/*
+	 * post recv bufs for those consumed by unexpected pdus from target
+	 * errors are ignored, as retry occurs on next send
+	 */
+	iser_post_unexpected_recvs(conn);
+
 	err = iser_post_send(mdesc);
 	if (!err)
 		return 0;
@@ -586,6 +635,21 @@ void iser_rcv_completion(struct iser_desc *rx_desc,
 	 * parallel to the execution of iser_conn_term. So the code that waits *
 	 * for the posted rx bufs refcount to become zero handles everything   */
 	atomic_dec(&conn->ib_conn->post_recv_buf_count);
+
+	/*
+	 * if an unexpected PDU was received then the recv wr consumed must
+	 * be replaced, this is done in the next send of a control-type PDU
+	 */
+	if ((opcode == ISCSI_OP_NOOP_IN)
+	 && (hdr->itt == RESERVED_ITT)) {
+		/* nop-in with itt = 0xffffffff */
+		atomic_inc(&conn->ib_conn->unexpected_pdu_count);
+	}
+	else if (opcode == ISCSI_OP_ASYNC_EVENT) {
+		/* asyncronous message */
+		atomic_inc(&conn->ib_conn->unexpected_pdu_count);
+	}
+	/* a reject PDU consumes the recv buf posted for the response */
 }
 
 void iser_snd_completion(struct iser_desc *tx_desc)
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 26ff621..6dc6b17 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -498,6 +498,7 @@ void iser_conn_init(struct iser_conn *ib_conn)
 	init_waitqueue_head(&ib_conn->wait);
 	atomic_set(&ib_conn->post_recv_buf_count, 0);
 	atomic_set(&ib_conn->post_send_buf_count, 0);
+	atomic_set(&ib_conn->unexpected_pdu_count, 0);
 	atomic_set(&ib_conn->refcount, 1);
 	INIT_LIST_HEAD(&ib_conn->conn_list);
 	spin_lock_init(&ib_conn->lock);
-- 
1.5.4.5


From jackm at dev.mellanox.co.il  Thu Nov 20 23:02:01 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Fri, 21 Nov 2008 09:02:01 +0200
Subject: [ofa-general] Re: Race condition in userspace libraries with
	create/destroy qp
In-Reply-To: <adavduiggfo.fsf@cisco.com>
References: <200811201211.46527.jackm@dev.mellanox.co.il>
	<adavduiggfo.fsf@cisco.com>
Message-ID: <200811210902.03040.jackm@dev.mellanox.co.il>

On Friday 21 November 2008 00:50, Roland Dreier wrote:
> > 2. Create a mutex for this purpose, and use it to force the create and destroy qp operations
>  >    to be atomic WRT  the ibv_cmd_xxx_qp operations and the store/clear qp operations.
> 
> This looks like the best solution.
> 
> I wonder if we should just add this synchronization in libibverbs rather
> than individual drivers?  I notice that libcxgb3 seems to have the same
> bug AFAICS.  But maybe it's better to just keep the simple rule that
> driver libraries are responsible for locking their own data structures.
> 
Thanks for responding so quickly!

I prefer to keep the rule that low-level driver libraries are responsible.
Its not clear that all low-level drivers necessarily have this issue.

BTW, I notice that there is a ctx->qp_table_mutex (used only in file
libmlx4/src/qp.c). What if I steal that and move its use upwards into
procedures mlx4_create_qp/mlx4_destroy_qp? (a bit cheesy, but it saves
creating yet another mutex in the mlx4 user context).

- Jack


From sashak at voltaire.com  Fri Nov 21 01:28:37 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 21 Nov 2008 11:28:37 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c disable the port
	with the least hop count
In-Reply-To: <49251926.9090509@gmail.com>
References: <49251926.9090509@gmail.com>
Message-ID: <20081121092837.GA6965@sashak.voltaire.com>

On 10:00 Thu 20 Nov     , Eli Dorfman wrote:
> disable the port with the least hop count.
> this will address the case of inter switch link where the
> most remote port (from opensm) is sending traps.
> in that case we would like to disable the nearest switch port (from opensm).
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri Nov 21 01:29:16 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 21 Nov 2008 11:29:16 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm/osm_state_mgr.c: bug fix in
	unicast cache
In-Reply-To: <492520CF.4080001@dev.mellanox.co.il>
References: <492520CF.4080001@dev.mellanox.co.il>
Message-ID: <20081121092916.GB6965@sashak.voltaire.com>

On 10:33 Thu 20 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> When there are errors during initialization and new
> heavy sweep is forced, unicast cache might hold a
> snapshot of the previous routing, and since there
> might be no *topology* changes, unicast cache will
> apply that cached routing, which might be wrong.
> 
> This patch invalidates cache explicitly if there
> were initialization errors in addition to few other
> cases.
> 
> V2: don't invalidate cache when
>     opt.force_heavy_sweep is on.
> 
> This fix addresses bug #1398.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Applied. Thanks.

Sasha


From sashak at voltaire.com  Fri Nov 21 01:45:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 21 Nov 2008 11:45:14 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c disable the port
	with the least hop count
In-Reply-To: <49251926.9090509@gmail.com>
References: <49251926.9090509@gmail.com>
Message-ID: <20081121094514.GC6965@sashak.voltaire.com>

Hi Eli,

On 10:00 Thu 20 Nov     , Eli Dorfman wrote:
> disable the port with the least hop count.
> this will address the case of inter switch link where the
> most remote port (from opensm) is sending traps.
> in that case we would like to disable the nearest switch port (from opensm).
> 
> Signed-off-by: Eli Dorfman <elid at voltaire.com>

I applied the patch. However have some question.

> ---
>  opensm/opensm/osm_trap_rcv.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
> index 07c5183..d1dfbd4 100644
> --- a/opensm/opensm/osm_trap_rcv.c
> +++ b/opensm/opensm/osm_trap_rcv.c
> @@ -239,8 +239,8 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
>  	ib_port_info_t *pi = (ib_port_info_t *)payload;
>  	int ret;
>  
> -	/* in case of endport - disable switch's peer port */
> -	if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
> +	/* select the nearest port to master opensm */
> +	if (p->dr_path.hop_count > p->p_remote_physp->dr_path.hop_count)
>  		p = p->p_remote_physp;

Is it possible that this noisy port is switch external port, "the
nearest" to OpenSM node and doesn't have remote port (due to unstable
link)? We saw such cases in practice and it is handled by OpenSM in a
light sweep (see __osm_state_mgr_get_remote_port_info() calls in
__osm_state_mgr_light_sweep_start() function).

With endports check only is is impossible IMO, but with I don't see that
it cannot happen with switch ports. Right?

If so then maybe the code should look like:

	if (p->p_remote_physp &&
	    p->dr_path.hop_count > p->p_remote_physp->dr_path.hop_count)
		p = p->p_remote_physp;


Sasha

>  
>  	/* If trap 131, might want to disable peer port if available */
> -- 
> 1.5.5
> 


From vlad at lists.openfabrics.org  Fri Nov 21 03:26:33 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 21 Nov 2008 03:26:33 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081121-0200 daily build status
Message-ID: <20081121112633.D12CAE60AE8@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From hal.rosenstock at gmail.com  Fri Nov 21 04:25:23 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Fri, 21 Nov 2008 07:25:23 -0500
Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/3] ibnetdiscover library
	"libibnetdisc"
In-Reply-To: <20081120163809.26a3c499.weiny2@llnl.gov>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
Message-ID: <f0e08f230811210425y4cadbebdk2d18318074635de3@mail.gmail.com>

Hi Ira,

On Thu, Nov 20, 2008 at 7:38 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> The following 3 patches implement "libibnetdisc" which provides the
> functionality of ibnetdiscover in a C library.
>
> I mentioned this to Sasha at the last Sonoma conference and posted the bulk of
> this code to the list a few months ago.  This libary is still providing the 85%
> performance speed up of iblinkinfo.pl on our clusters.
>
> This new series is heavily tested and, for our hardware, preserves the
> functionality of ibnetdiscover.  Since I don't have a Xsigo box to test on I
> can only verify that it compiles correctly.

Have you also verified this QLogic/Silverstorm and Cisco chassis
switches ? They were supported too.

-- Hal

> Ira
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From fenkes at de.ibm.com  Fri Nov 21 07:37:14 2008
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Fri, 21 Nov 2008 16:37:14 +0100
Subject: [ofa-general] [PATCH] IB/ehca: Fix lockdep failures for
	shca_list_lock
In-Reply-To: <48499C11.7030504@gmail.com>
References: <200806061835.43802.fenkes@de.ibm.com> <48499C11.7030504@gmail.com>
Message-ID: <200811211637.15300.fenkes@de.ibm.com>

From: Michael Ellerman <michael at ellerman.id.au>

shca_list_lock is taken from softirq context in ehca_poll_eqs, so we need to
lock IRQ safe elsewhere.

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Acked-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index bb02a86..021c454 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -717,6 +717,7 @@ static int __devinit ehca_probe(struct of_device *dev,
 	const u64 *handle;
 	struct ib_pd *ibpd;
 	int ret, i, eq_size;
+	u64 flags;
 
 	handle = of_get_property(dev->node, "ibm,hca-handle", NULL);
 	if (!handle) {
@@ -830,9 +831,9 @@ static int __devinit ehca_probe(struct of_device *dev,
 		ehca_err(&shca->ib_device,
 			 "Cannot create device attributes  ret=%d", ret);
 
-	spin_lock(&shca_list_lock);
+	spin_lock_irqsave(&shca_list_lock, flags);
 	list_add(&shca->shca_list, &shca_list);
-	spin_unlock(&shca_list_lock);
+	spin_unlock_irqrestore(&shca_list_lock, flags);
 
 	return 0;
 
@@ -878,6 +879,7 @@ probe1:
 static int __devexit ehca_remove(struct of_device *dev)
 {
 	struct ehca_shca *shca = dev->dev.driver_data;
+	u64 flags;
 	int ret;
 
 	sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp);
@@ -915,9 +917,9 @@ static int __devexit ehca_remove(struct of_device *dev)
 
 	ib_dealloc_device(&shca->ib_device);
 
-	spin_lock(&shca_list_lock);
+	spin_lock_irqsave(&shca_list_lock, flags);
 	list_del(&shca->shca_list);
-	spin_unlock(&shca_list_lock);
+	spin_unlock_irqrestore(&shca_list_lock, flags);
 
 	return ret;
 }
@@ -975,6 +977,7 @@ static int ehca_mem_notifier(struct notifier_block *nb,
 			     unsigned long action, void *data)
 {
 	static unsigned long ehca_dmem_warn_time;
+	unsigned long flags;
 
 	switch (action) {
 	case MEM_CANCEL_OFFLINE:
@@ -985,12 +988,12 @@ static int ehca_mem_notifier(struct notifier_block *nb,
 	case MEM_GOING_ONLINE:
 	case MEM_GOING_OFFLINE:
 		/* only ok if no hca is attached to the lpar */
-		spin_lock(&shca_list_lock);
+		spin_lock_irqsave(&shca_list_lock, flags);
 		if (list_empty(&shca_list)) {
-			spin_unlock(&shca_list_lock);
+			spin_unlock_irqrestore(&shca_list_lock, flags);
 			return NOTIFY_OK;
 		} else {
-			spin_unlock(&shca_list_lock);
+			spin_unlock_irqrestore(&shca_list_lock, flags);
 			if (printk_timed_ratelimit(&ehca_dmem_warn_time,
 						   30 * 1000))
 				ehca_gen_err("DMEM operations are not allowed"
-- 
1.5.5


From fenkes at de.ibm.com  Fri Nov 21 08:18:16 2008
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Fri, 21 Nov 2008 17:18:16 +0100
Subject: [ofa-general] [PATCH] IB/ehca: Fix locking for shca_list_lock
In-Reply-To: <1227283347.3599.8.camel@johannes.berg>
References: <200806061835.43802.fenkes@de.ibm.com>
	<200811211637.15300.fenkes@de.ibm.com>
	<1227283347.3599.8.camel@johannes.berg>
Message-ID: <200811211718.17489.fenkes@de.ibm.com>

shca_list_lock is taken from softirq context in ehca_poll_eqs, so we need to
lock IRQ safe elsewhere.

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---

On Friday 21 November 2008 17:02, Johannes Berg wrote:
> On Fri, 2008-11-21 at 16:37 +0100, Joachim Fenkes wrote:
> 
> > +	u64 flags;
> 
> > -	spin_lock(&shca_list_lock);
> > +	spin_lock_irqsave(&shca_list_lock, flags);
> 
> That's wrong and I think will give a warning on all machines where
> u64 != unsigned long. Might not particularly matter in this case.

Doesn't matter for a ppc64 only driver, but you're right nonetheless. Thanks.
 
> Also, generally it seems wrong to say "fix lockdep failure" when the
> patch really fixes a bug that lockdep happened to find.

Whatever -- changed.

Here's the updated patch.

Regards,
  Joachim


 drivers/infiniband/hw/ehca/ehca_main.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index bb02a86..169aa1a 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -717,6 +717,7 @@ static int __devinit ehca_probe(struct of_device *dev,
 	const u64 *handle;
 	struct ib_pd *ibpd;
 	int ret, i, eq_size;
+	unsigned long flags;
 
 	handle = of_get_property(dev->node, "ibm,hca-handle", NULL);
 	if (!handle) {
@@ -830,9 +831,9 @@ static int __devinit ehca_probe(struct of_device *dev,
 		ehca_err(&shca->ib_device,
 			 "Cannot create device attributes  ret=%d", ret);
 
-	spin_lock(&shca_list_lock);
+	spin_lock_irqsave(&shca_list_lock, flags);
 	list_add(&shca->shca_list, &shca_list);
-	spin_unlock(&shca_list_lock);
+	spin_unlock_irqrestore(&shca_list_lock, flags);
 
 	return 0;
 
@@ -878,6 +879,7 @@ probe1:
 static int __devexit ehca_remove(struct of_device *dev)
 {
 	struct ehca_shca *shca = dev->dev.driver_data;
+	unsigned long flags;
 	int ret;
 
 	sysfs_remove_group(&dev->dev.kobj, &ehca_dev_attr_grp);
@@ -915,9 +917,9 @@ static int __devexit ehca_remove(struct of_device *dev)
 
 	ib_dealloc_device(&shca->ib_device);
 
-	spin_lock(&shca_list_lock);
+	spin_lock_irqsave(&shca_list_lock, flags);
 	list_del(&shca->shca_list);
-	spin_unlock(&shca_list_lock);
+	spin_unlock_irqrestore(&shca_list_lock, flags);
 
 	return ret;
 }
@@ -975,6 +977,7 @@ static int ehca_mem_notifier(struct notifier_block *nb,
 			     unsigned long action, void *data)
 {
 	static unsigned long ehca_dmem_warn_time;
+	unsigned long flags;
 
 	switch (action) {
 	case MEM_CANCEL_OFFLINE:
@@ -985,12 +988,12 @@ static int ehca_mem_notifier(struct notifier_block *nb,
 	case MEM_GOING_ONLINE:
 	case MEM_GOING_OFFLINE:
 		/* only ok if no hca is attached to the lpar */
-		spin_lock(&shca_list_lock);
+		spin_lock_irqsave(&shca_list_lock, flags);
 		if (list_empty(&shca_list)) {
-			spin_unlock(&shca_list_lock);
+			spin_unlock_irqrestore(&shca_list_lock, flags);
 			return NOTIFY_OK;
 		} else {
-			spin_unlock(&shca_list_lock);
+			spin_unlock_irqrestore(&shca_list_lock, flags);
 			if (printk_timed_ratelimit(&ehca_dmem_warn_time,
 						   30 * 1000))
 				ehca_gen_err("DMEM operations are not allowed"
-- 
1.5.5


From johannes at sipsolutions.net  Fri Nov 21 08:02:27 2008
From: johannes at sipsolutions.net (Johannes Berg)
Date: Fri, 21 Nov 2008 17:02:27 +0100
Subject: [ofa-general] Re: [PATCH] IB/ehca: Fix lockdep failures for
	shca_list_lock
In-Reply-To: <200811211637.15300.fenkes@de.ibm.com>
References: <200806061835.43802.fenkes@de.ibm.com>
	<48499C11.7030504@gmail.com>  <200811211637.15300.fenkes@de.ibm.com>
Message-ID: <1227283347.3599.8.camel@johannes.berg>

On Fri, 2008-11-21 at 16:37 +0100, Joachim Fenkes wrote:

> +	u64 flags;

> -	spin_lock(&shca_list_lock);
> +	spin_lock_irqsave(&shca_list_lock, flags);

That's wrong and I think will give a warning on all machines where
u64 != unsigned long. Might not particularly matter in this case.

Also, generally it seems wrong to say "fix lockdep failure" when the
patch really fixes a bug that lockdep happened to find.

johannes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081121/393f9aac/attachment.sig>

From rdreier at cisco.com  Fri Nov 21 10:28:03 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 21 Nov 2008 10:28:03 -0800
Subject: [ofa-general] Re: [PATCH] IB/ehca: Fix locking for shca_list_lock
In-Reply-To: <200811211718.17489.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Fri, 21 Nov 2008 17:18:16 +0100")
References: <200806061835.43802.fenkes@de.ibm.com>
	<200811211637.15300.fenkes@de.ibm.com>
	<1227283347.3599.8.camel@johannes.berg>
	<200811211718.17489.fenkes@de.ibm.com>
Message-ID: <adawsexexxo.fsf@cisco.com>

Looks good... I'll add this for 2.6.29, since as far as I can tell this
bug has been there approximately forever already.


From rdreier at cisco.com  Fri Nov 21 10:44:01 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 21 Nov 2008 10:44:01 -0800
Subject: [ofa-general] Re: Race condition in userspace libraries with
	create/destroy qp
In-Reply-To: <200811210902.03040.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Fri, 21 Nov 2008 09:02:01 +0200")
References: <200811201211.46527.jackm@dev.mellanox.co.il>
	<adavduiggfo.fsf@cisco.com>
	<200811210902.03040.jackm@dev.mellanox.co.il>
Message-ID: <adaod08gbri.fsf@cisco.com>

 > I prefer to keep the rule that low-level driver libraries are responsible.
 > Its not clear that all low-level drivers necessarily have this issue.

Yes, makes sense to me.

 > BTW, I notice that there is a ctx->qp_table_mutex (used only in file
 > libmlx4/src/qp.c). What if I steal that and move its use upwards into
 > procedures mlx4_create_qp/mlx4_destroy_qp? (a bit cheesy, but it saves
 > creating yet another mutex in the mlx4 user context).

Actually I don't think it's cheesy at all -- expanding the region where
qp_table_mutex is held to avoid this bug makes perfect sense to me and
seems like a clean solution.

 - R.


From sashak at voltaire.com  Fri Nov 21 11:24:28 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 21 Nov 2008 21:24:28 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_switch.h: use updated LFT for
	routing
In-Reply-To: <492550E3.90805@dev.mellanox.co.il>
References: <492550E3.90805@dev.mellanox.co.il>
Message-ID: <20081121192428.GB8310@sashak.voltaire.com>

Hi Yevgeny,

On 13:58 Thu 20 Nov     , Yevgeny Kliteynik wrote:
> 
> Function osm_switch_get_port_by_lid() was using the switch's
> LFT, so this LFT might not be updated to recent routing.

I guess it could be only with 'subnet_initialization_error' flag up
(failed LinFwdTbl set will trigger this flag).

> I think that this was also relevant before the LFT simplification.

Yes, logically it should be so, but...

> One immediate outcome of this bug is opensm.fdbs file - when it
> is dumped from the switch LFT (and not from lft_buf),

Why this bug is triggered only now?

> it sometimes
> doesn't match the lst file.

What this "sometimes" mean? I think the case should be investigated
deeper. By such patch we are just trying to hide a possible issue.

As far as I understand opensm.fdbs (and other routing dump) are
generated only after all LinFwdTbl responses are arrived, when some of
them failed 'subnet_initialization_error' flag is up and OpenSM will
resweep. If so why is 'opensm.fdbs' broken? It is not immediately
clear for me.

Sasha

> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  opensm/include/opensm/osm_switch.h |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
> index caa0bc5..f06931c 100644
> --- a/opensm/include/opensm/osm_switch.h
> +++ b/opensm/include/opensm/osm_switch.h
> @@ -411,7 +411,11 @@ osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw,
>  {
>  	if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO)
>  		return OSM_NO_PATH;
> -	return p_sw->lft[lid_ho];
> +
> +	if (p_sw->lft_buf)
> +		return p_sw->lft_buf[lid_ho];
> +	else
> +		return p_sw->lft[lid_ho];
>  }
>  /*
>  * PARAMETERS
> -- 
> 1.5.1.4
> 
> 


From chien.tin.tung at intel.com  Fri Nov 21 12:50:38 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:38 -0600
Subject: [ofa-general] [PATCH 01/10] RDMA/nes: Cleanup cqp_request list usage
Message-ID: <20081121205038.GA5976@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Cleanup cqp_request list usage

Use nes_free_cqp_request() from commit 1ff66e8c1faee7c2711b84b9c89e1c5fcd767839.
Change some continue to break in nes_cm_timer_tick.  Send_entry was a 
list processed in a loop, thus continue.  Now it is a single item, changing
continue to break to be semantically correct.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
Roland,

This patch series is a continuation of nes_cm rework/bugfix.  Most of them deal with
resource management and shutdown issues.  They have been tested with Intel MPI/DAPL
and proved to scale much better than current code base.  Please consider them for 2.6.28.

Regards,

Chien

diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 2caf9da..2a1d6c7 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -519,7 +519,7 @@ static void nes_cm_timer_tick(unsigned long pass)
 		do {
 			send_entry = cm_node->send_entry;
 			if (!send_entry)
-				continue;
+				break;
 			if (time_after(send_entry->timetosend, jiffies)) {
 				if (cm_node->state != NES_CM_STATE_TSA) {
 					if ((nexttimeout >
@@ -528,18 +528,18 @@ static void nes_cm_timer_tick(unsigned long pass)
 						nexttimeout =
 							send_entry->timetosend;
 						settimer = 1;
-						continue;
+						break;
 					}
 				} else {
 					free_retrans_entry(cm_node);
-					continue;
+					break;
 				}
 			}
 
 			if ((cm_node->state == NES_CM_STATE_TSA) ||
 				(cm_node->state == NES_CM_STATE_CLOSED)) {
 				free_retrans_entry(cm_node);
-				continue;
+				break;
 			}
 
 			if (!send_entry->retranscount ||
@@ -557,7 +557,7 @@ static void nes_cm_timer_tick(unsigned long pass)
 						NES_CM_EVENT_ABORTED);
 				spin_lock_irqsave(&cm_node->retrans_list_lock,
 					flags);
-				continue;
+				break;
 			}
 			atomic_inc(&send_entry->skb->users);
 			cm_packets_retrans++;
@@ -583,7 +583,7 @@ static void nes_cm_timer_tick(unsigned long pass)
 				send_entry->retrycount--;
 				nexttimeout = jiffies + NES_SHORT_TIME;
 				settimer = 1;
-				continue;
+				break;
 			} else {
 				cm_packets_sent++;
 			}
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index d36c9a0..4fdb724 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -1695,13 +1695,8 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries,
 			/* use 4k pbl */
 			nes_debug(NES_DBG_CQ, "pbl_entries=%u, use a 4k PBL\n", pbl_entries);
 			if (nesadapter->free_4kpbl == 0) {
-				if (cqp_request->dynamic) {
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-					kfree(cqp_request);
-				} else {
-					list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs);
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-				}
+				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+				nes_free_cqp_request(nesdev, cqp_request);
 				if (!context)
 					pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem,
 							nescq->hw_cq.cq_pbase);
@@ -1717,13 +1712,8 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries,
 			/* use 256 byte pbl */
 			nes_debug(NES_DBG_CQ, "pbl_entries=%u, use a 256 byte PBL\n", pbl_entries);
 			if (nesadapter->free_256pbl == 0) {
-				if (cqp_request->dynamic) {
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-					kfree(cqp_request);
-				} else {
-					list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs);
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-				}
+				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+				nes_free_cqp_request(nesdev, cqp_request);
 				if (!context)
 					pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem,
 							nescq->hw_cq.cq_pbase);
@@ -1928,13 +1918,8 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd,
 			/* Two level PBL */
 			if ((pbl_count+1) > nesadapter->free_4kpbl) {
 				nes_debug(NES_DBG_MR, "Out of 4KB Pbls for two level request.\n");
-				if (cqp_request->dynamic) {
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-					kfree(cqp_request);
-				} else {
-					list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs);
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-				}
+				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+				nes_free_cqp_request(nesdev, cqp_request);
 				return -ENOMEM;
 			} else {
 				nesadapter->free_4kpbl -= pbl_count+1;
@@ -1942,13 +1927,8 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd,
 		} else if (residual_page_count > 32) {
 			if (pbl_count > nesadapter->free_4kpbl) {
 				nes_debug(NES_DBG_MR, "Out of 4KB Pbls.\n");
-				if (cqp_request->dynamic) {
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-					kfree(cqp_request);
-				} else {
-					list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs);
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-				}
+				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+				nes_free_cqp_request(nesdev, cqp_request);
 				return -ENOMEM;
 			} else {
 				nesadapter->free_4kpbl -= pbl_count;
@@ -1956,13 +1936,8 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd,
 		} else {
 			if (pbl_count > nesadapter->free_256pbl) {
 				nes_debug(NES_DBG_MR, "Out of 256B Pbls.\n");
-				if (cqp_request->dynamic) {
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-					kfree(cqp_request);
-				} else {
-					list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs);
-					spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
-				}
+				spin_unlock_irqrestore(&nesadapter->pbl_lock, flags);
+				nes_free_cqp_request(nesdev, cqp_request);
 				return -ENOMEM;
 			} else {
 				nesadapter->free_256pbl -= pbl_count;


From chien.tin.tung at intel.com  Fri Nov 21 12:50:41 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:41 -0600
Subject: [ofa-general] [PATCH 02/10] RDMA/nes: Lock down connected_nodes list
	while processing it
Message-ID: <20081121205041.GA828@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Lock down connected_nodes list while processing it

While processing connected_nodes list, we would release the lock when
we need to send reset to remote partner.  That created a window where
the list can be modified.  Change this into a two step process.  Place
nodes that need processing on a local list then process the local list.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 2a1d6c7..257d994 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -459,13 +459,23 @@ static void nes_cm_timer_tick(unsigned long pass)
 	int ret = NETDEV_TX_OK;
 	enum nes_cm_node_state last_state;
 
+	struct list_head timer_list;
+	INIT_LIST_HEAD(&timer_list);
 	spin_lock_irqsave(&cm_core->ht_lock, flags);
 
 	list_for_each_safe(list_node, list_core_temp,
-		&cm_core->connected_nodes) {
+				&cm_core->connected_nodes) {
 		cm_node = container_of(list_node, struct nes_cm_node, list);
-		add_ref_cm_node(cm_node);
-		spin_unlock_irqrestore(&cm_core->ht_lock, flags);
+		if (!list_empty(&cm_node->recv_list) || (cm_node->send_entry)) {
+			add_ref_cm_node(cm_node);
+			list_add(&cm_node->timer_entry, &timer_list);
+		}
+	}
+	spin_unlock_irqrestore(&cm_core->ht_lock, flags);
+
+	list_for_each_safe(list_node, list_core_temp, &timer_list) {
+		cm_node = container_of(list_node, struct nes_cm_node,
+					timer_entry);
 		spin_lock_irqsave(&cm_node->recv_list_lock, flags);
 		list_for_each_safe(list_core, list_node_temp,
 			&cm_node->recv_list) {
@@ -615,14 +625,12 @@ static void nes_cm_timer_tick(unsigned long pass)
 
 		spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags);
 		rem_ref_cm_node(cm_node->cm_core, cm_node);
-		spin_lock_irqsave(&cm_core->ht_lock, flags);
 		if (ret != NETDEV_TX_OK) {
 			nes_debug(NES_DBG_CM, "rexmit failed for cm_node=%p\n",
 				cm_node);
 			break;
 		}
 	}
-	spin_unlock_irqrestore(&cm_core->ht_lock, flags);
 
 	if (settimer) {
 		if (!timer_pending(&cm_core->tcp_timer)) {
@@ -925,28 +933,36 @@ static int mini_cm_dec_refcnt_listen(struct nes_cm_core *cm_core,
 	struct list_head *list_pos = NULL;
 	struct list_head *list_temp = NULL;
 	struct nes_cm_node *cm_node = NULL;
+	struct list_head reset_list;
 
 	nes_debug(NES_DBG_CM, "attempting listener= %p free_nodes= %d, "
 		"refcnt=%d\n", listener, free_hanging_nodes,
 		atomic_read(&listener->ref_count));
 	/* free non-accelerated child nodes for this listener */
+	INIT_LIST_HEAD(&reset_list);
 	if (free_hanging_nodes) {
 		spin_lock_irqsave(&cm_core->ht_lock, flags);
 		list_for_each_safe(list_pos, list_temp,
-			&g_cm_core->connected_nodes) {
+				   &g_cm_core->connected_nodes) {
 			cm_node = container_of(list_pos, struct nes_cm_node,
 				list);
 			if ((cm_node->listener == listener) &&
-				(!cm_node->accelerated)) {
-				cleanup_retrans_entry(cm_node);
-				spin_unlock_irqrestore(&cm_core->ht_lock,
-					flags);
-				send_reset(cm_node, NULL);
-				spin_lock_irqsave(&cm_core->ht_lock, flags);
+			    (!cm_node->accelerated)) {
+				add_ref_cm_node(cm_node);
+				list_add(&cm_node->reset_entry, &reset_list);
 			}
 		}
 		spin_unlock_irqrestore(&cm_core->ht_lock, flags);
 	}
+
+	list_for_each_safe(list_pos, list_temp, &reset_list) {
+		cm_node = container_of(list_pos, struct nes_cm_node,
+					reset_entry);
+		cleanup_retrans_entry(cm_node);
+		send_reset(cm_node, NULL);
+		rem_ref_cm_node(cm_node->cm_core, cm_node);
+	}
+
 	spin_lock_irqsave(&cm_core->listen_list_lock, flags);
 	if (!atomic_dec_return(&listener->ref_count)) {
 		list_del(&listener->list);
diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h
index 367b3d2..282a9cb 100644
--- a/drivers/infiniband/hw/nes/nes_cm.h
+++ b/drivers/infiniband/hw/nes/nes_cm.h
@@ -292,6 +292,8 @@ struct nes_cm_node {
 	int                       apbvt_set;
 	int                       accept_pend;
 	int			freed;
+	struct list_head	timer_entry;
+	struct list_head	reset_entry;
 	struct nes_qp		*nesqp;
 };
 

From chien.tin.tung at intel.com  Fri Nov 21 12:50:44 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:44 -0600
Subject: [ofa-general] [PATCH 03/10] RDMA/nes: Remove tx_free_list
Message-ID: <20081121205044.GA7424@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Remove tx_free_list

There is no lock protecting tx_free_list thus causing a system crash
when skb_dequeue() is called and the list is empty.  Since it did not give
any performance boost under heavy load, removing it to simplfy the code.
Change get_free_pkt call to allocate MAX_CM_BUFFER skb for connection 
establishment/teardown as well as MPA request/response.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 257d994..fe07797 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -94,7 +94,7 @@ static int mini_cm_set(struct nes_cm_core *, u32, u32);
 
 static struct sk_buff *form_cm_frame(struct sk_buff *, struct nes_cm_node *,
 	void *, u32, void *, u32, u8);
-static struct sk_buff *get_free_pkt(struct nes_cm_node *cm_node);
+static struct sk_buff *get_free_pkt(u32);
 static int add_ref_cm_node(struct nes_cm_node *);
 static int rem_ref_cm_node(struct nes_cm_core *, struct nes_cm_node *);
 
@@ -356,7 +356,6 @@ static void print_core(struct nes_cm_core *core)
 
 	nes_debug(NES_DBG_CM, "State         : %u \n",  core->state);
 
-	nes_debug(NES_DBG_CM, "Tx Free cnt   : %u \n", skb_queue_len(&core->tx_free_list));
 	nes_debug(NES_DBG_CM, "Listen Nodes  : %u \n", atomic_read(&core->listen_node_cnt));
 	nes_debug(NES_DBG_CM, "Active Nodes  : %u \n", atomic_read(&core->node_cnt));
 
@@ -691,7 +690,7 @@ static int send_syn(struct nes_cm_node *cm_node, u32 sendack,
 	optionssize += 1;
 
 	if (!skb)
-		skb = get_free_pkt(cm_node);
+		skb = get_free_pkt(MAX_CM_BUFFER);
 	if (!skb) {
 		nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n");
 		return -1;
@@ -716,7 +715,7 @@ static int send_reset(struct nes_cm_node *cm_node, struct sk_buff *skb)
 	int flags = SET_RST | SET_ACK;
 
 	if (!skb)
-		skb = get_free_pkt(cm_node);
+		skb = get_free_pkt(MAX_CM_BUFFER);
 	if (!skb) {
 		nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n");
 		return -1;
@@ -737,7 +736,7 @@ static int send_ack(struct nes_cm_node *cm_node, struct sk_buff *skb)
 	int ret;
 
 	if (!skb)
-		skb = get_free_pkt(cm_node);
+		skb = get_free_pkt(MAX_CM_BUFFER);
 
 	if (!skb) {
 		nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n");
@@ -760,7 +759,7 @@ static int send_fin(struct nes_cm_node *cm_node, struct sk_buff *skb)
 
 	/* if we didn't get a frame get one */
 	if (!skb)
-		skb = get_free_pkt(cm_node);
+		skb = get_free_pkt(MAX_CM_BUFFER);
 
 	if (!skb) {
 		nes_debug(NES_DBG_CM, "Failed to get a Free pkt\n");
@@ -777,40 +776,9 @@ static int send_fin(struct nes_cm_node *cm_node, struct sk_buff *skb)
 /**
  * get_free_pkt
  */
-static struct sk_buff *get_free_pkt(struct nes_cm_node *cm_node)
-{
-	struct sk_buff *skb, *new_skb;
-
-	/* check to see if we need to repopulate the free tx pkt queue */
-	if (skb_queue_len(&cm_node->cm_core->tx_free_list) < NES_CM_FREE_PKT_LO_WATERMARK) {
-		while (skb_queue_len(&cm_node->cm_core->tx_free_list) <
-				cm_node->cm_core->free_tx_pkt_max) {
-			/* replace the frame we took, we won't get it back */
-			new_skb = dev_alloc_skb(cm_node->cm_core->mtu);
-			BUG_ON(!new_skb);
-			/* add a replacement frame to the free tx list head */
-			skb_queue_head(&cm_node->cm_core->tx_free_list, new_skb);
-		}
-	}
-
-	skb = skb_dequeue(&cm_node->cm_core->tx_free_list);
-
-	return skb;
-}
-
-
-/**
- * make_hashkey - generate hash key from node tuple
- */
-static inline int make_hashkey(u16 loc_port, nes_addr_t loc_addr, u16 rem_port,
-		nes_addr_t rem_addr)
+static struct sk_buff *get_free_pkt(u32 pktsize)
 {
-	u32 hashkey = 0;
-
-	hashkey = loc_addr + rem_addr + loc_port + rem_port;
-	hashkey = (hashkey % NES_CM_HASHTABLE_SIZE);
-
-	return hashkey;
+		return dev_alloc_skb(pktsize);
 }
 
 
@@ -821,13 +789,9 @@ static struct nes_cm_node *find_node(struct nes_cm_core *cm_core,
 		u16 rem_port, nes_addr_t rem_addr, u16 loc_port, nes_addr_t loc_addr)
 {
 	unsigned long flags;
-	u32 hashkey;
 	struct list_head *hte;
 	struct nes_cm_node *cm_node;
 
-	/* make a hash index key for this packet */
-	hashkey = make_hashkey(loc_port, loc_addr, rem_port, rem_addr);
-
 	/* get a handle on the hte */
 	hte = &cm_core->connected_nodes;
 
@@ -895,7 +859,6 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core,
 static int add_hte_node(struct nes_cm_core *cm_core, struct nes_cm_node *cm_node)
 {
 	unsigned long flags;
-	u32 hashkey;
 	struct list_head *hte;
 
 	if (!cm_node || !cm_core)
@@ -904,11 +867,6 @@ static int add_hte_node(struct nes_cm_core *cm_core, struct nes_cm_node *cm_node
 	nes_debug(NES_DBG_CM, "Adding Node %p to Active Connection HT\n",
 		cm_node);
 
-	/* first, make an index into our hash table */
-	hashkey = make_hashkey(cm_node->loc_port, cm_node->loc_addr,
-			cm_node->rem_port, cm_node->rem_addr);
-	cm_node->hashkey = hashkey;
-
 	spin_lock_irqsave(&cm_core->ht_lock, flags);
 
 	/* get a handle on the hash table element (list head for this slot) */
@@ -2151,10 +2109,7 @@ static void mini_cm_recv_pkt(struct nes_cm_core *cm_core,
  */
 static struct nes_cm_core *nes_cm_alloc_core(void)
 {
-	int i;
-
 	struct nes_cm_core *cm_core;
-	struct sk_buff *skb = NULL;
 
 	/* setup the CM core */
 	/* alloc top level core control structure */
@@ -2172,19 +2127,6 @@ static struct nes_cm_core *nes_cm_alloc_core(void)
 
 	atomic_set(&cm_core->events_posted, 0);
 
-	/* init the packet lists */
-	skb_queue_head_init(&cm_core->tx_free_list);
-
-	for (i = 0; i < NES_CM_DEFAULT_FRAME_CNT; i++) {
-		skb = dev_alloc_skb(cm_core->mtu);
-		if (!skb) {
-			kfree(cm_core);
-			return NULL;
-		}
-		/* add 'raw' skb to free frame list */
-		skb_queue_head(&cm_core->tx_free_list, skb);
-	}
-
 	cm_core->api = &nes_cm_api;
 
 	spin_lock_init(&cm_core->ht_lock);
diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h
index 282a9cb..89d80fc 100644
--- a/drivers/infiniband/hw/nes/nes_cm.h
+++ b/drivers/infiniband/hw/nes/nes_cm.h
@@ -161,6 +161,8 @@ struct nes_timer_entry {
 
 #define NES_CM_DEF_SEQ2      0x18ed5740
 #define NES_CM_DEF_LOCAL_ID2 0xb807
+#define	MAX_CM_BUFFER	512
+
 
 typedef u32 nes_addr_t;
 
@@ -254,8 +256,6 @@ struct nes_cm_listener {
 
 /* per connection node and node state information */
 struct nes_cm_node {
-	u32                       hashkey;
-
 	nes_addr_t                loc_addr, rem_addr;
 	u16                       loc_port, rem_port;
 
@@ -352,7 +352,6 @@ struct nes_cm_core {
 	u32                     mtu;
 	u32                     free_tx_pkt_max;
 	u32                     rx_pkt_posted;
-	struct sk_buff_head     tx_free_list;
 	atomic_t                ht_node_cnt;
 	struct list_head        connected_nodes;
 	/* struct list_head hashtable[NES_CM_HASHTABLE_SIZE]; */


From chien.tin.tung at intel.com  Fri Nov 21 12:50:46 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:46 -0600
Subject: [ofa-general] [PATCH 04/10] RDMA/nes: Avoid race condition between
	MPA request and reset event to rdma_cm
Message-ID: <20081121205046.GA5428@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Avoid race condition between MPA request and reset event to rdma_cm

In passive open after indicating MPA request to rdma_cm, an incoming RST would
fire a reset event to rdma_cm causing it to crash since the current state is
not connected.  The solution is to wait for nes_accept() or nes_reject() before
firing the reset event.  If nes_accept() or nes_reject() is already done, then
the reset event will be fired when RST is processed.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index fe07797..01fd309 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -1318,6 +1318,7 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 {
 
 	int	reset = 0;	/* whether to send reset in case of err.. */
+	int	passive_state;
 	atomic_inc(&cm_resets_recvd);
 	nes_debug(NES_DBG_CM, "Received Reset, cm_node = %p, state = %u."
 			" refcnt=%d\n", cm_node, cm_node->state,
@@ -1331,7 +1332,14 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 			cm_node->listener, cm_node->state);
 		active_open_err(cm_node, skb, reset);
 		break;
-	/* For PASSIVE open states, remove the cm_node event */
+	case NES_CM_STATE_MPAREQ_RCVD:
+		passive_state = atomic_add_return(1, &cm_node->passive_state);
+		if (passive_state ==  NES_SEND_RESET_EVENT)
+			create_event(cm_node, NES_CM_EVENT_RESET);
+		cleanup_retrans_entry(cm_node);
+		cm_node->state = NES_CM_STATE_CLOSED;
+		dev_kfree_skb_any(skb);
+		break;
 	case NES_CM_STATE_ESTABLISHED:
 	case NES_CM_STATE_SYN_RCVD:
 	case NES_CM_STATE_LISTENING:
@@ -1339,7 +1347,14 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 		passive_open_err(cm_node, skb, reset);
 		break;
 	case NES_CM_STATE_TSA:
+		active_open_err(cm_node, skb, reset);
+		break;
+	case NES_CM_STATE_CLOSED:
+		cleanup_retrans_entry(cm_node);
+		drop_packet(skb);
+		break;
 	default:
+		drop_packet(skb);
 		break;
 	}
 }
@@ -1368,6 +1383,9 @@ static void handle_rcv_mpa(struct nes_cm_node *cm_node, struct sk_buff *skb,
 		dev_kfree_skb_any(skb);
 		if (type == NES_CM_EVENT_CONNECTED)
 			cm_node->state = NES_CM_STATE_TSA;
+		else
+			atomic_set(&cm_node->passive_state,
+					NES_PASSIVE_STATE_INDICATED);
 		create_event(cm_node, type);
 
 	}
@@ -1944,6 +1962,7 @@ static int mini_cm_reject(struct nes_cm_core *cm_core,
 	struct ietf_mpa_frame *mpa_frame, struct nes_cm_node *cm_node)
 {
 	int ret = 0;
+	int passive_state;
 
 	nes_debug(NES_DBG_CM, "%s cm_node=%p type=%d state=%d\n",
 		__func__, cm_node, cm_node->tcp_cntxt.client, cm_node->state);
@@ -1951,9 +1970,13 @@ static int mini_cm_reject(struct nes_cm_core *cm_core,
 	if (cm_node->tcp_cntxt.client)
 		return ret;
 	cleanup_retrans_entry(cm_node);
-	cm_node->state = NES_CM_STATE_CLOSED;
 
-	ret = send_reset(cm_node, NULL);
+	passive_state = atomic_add_return(1, &cm_node->passive_state);
+	cm_node->state = NES_CM_STATE_CLOSED;
+	if (passive_state == NES_SEND_RESET_EVENT)
+		rem_ref_cm_node(cm_core, cm_node);
+	else
+		ret = send_reset(cm_node, NULL);
 	return ret;
 }
 
@@ -2355,7 +2378,6 @@ static int nes_cm_disconn_true(struct nes_qp *nesqp)
 			atomic_inc(&cm_disconnects);
 			cm_event.event = IW_CM_EVENT_DISCONNECT;
 			if (last_ae == NES_AEQE_AEID_LLP_CONNECTION_RESET) {
-				issued_disconnect_reset = 1;
 				cm_event.status = IW_CM_EVENT_STATUS_RESET;
 				nes_debug(NES_DBG_CM, "Generating a CM "
 					"Disconnect Event (status reset) for "
@@ -2505,6 +2527,7 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 	struct nes_v4_quad nes_quad;
 	u32 crc_value;
 	int ret;
+	int passive_state;
 
 	ibqp = nes_get_qp(cm_id->device, conn_param->qpn);
 	if (!ibqp)
@@ -2672,8 +2695,6 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 			conn_param->private_data_len +
 			sizeof(struct ietf_mpa_frame));
 
-	attr.qp_state = IB_QPS_RTS;
-	nes_modify_qp(&nesqp->ibqp, &attr, IB_QP_STATE, NULL);
 
 	/* notify OF layer that accept event was successfull */
 	cm_id->add_ref(cm_id);
@@ -2686,6 +2707,8 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 	cm_event.private_data = NULL;
 	cm_event.private_data_len = 0;
 	ret = cm_id->event_handler(cm_id, &cm_event);
+	attr.qp_state = IB_QPS_RTS;
+	nes_modify_qp(&nesqp->ibqp, &attr, IB_QP_STATE, NULL);
 	if (cm_node->loopbackpartner) {
 		cm_node->loopbackpartner->mpa_frame_size =
 			nesqp->private_data_len;
@@ -2698,6 +2721,9 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 		printk(KERN_ERR "%s[%u] OFA CM event_handler returned, "
 			"ret=%d\n", __func__, __LINE__, ret);
 
+	passive_state = atomic_add_return(1, &cm_node->passive_state);
+	if (passive_state == NES_SEND_RESET_EVENT)
+		create_event(cm_node, NES_CM_EVENT_RESET);
 	return 0;
 }
 
@@ -3180,6 +3206,18 @@ static void cm_event_reset(struct nes_cm_event *event)
 	cm_event.private_data_len = 0;
 
 	ret = cm_id->event_handler(cm_id, &cm_event);
+	cm_id->add_ref(cm_id);
+	atomic_inc(&cm_closes);
+	cm_event.event = IW_CM_EVENT_CLOSE;
+	cm_event.status = IW_CM_EVENT_STATUS_OK;
+	cm_event.provider_data = cm_id->provider_data;
+	cm_event.local_addr = cm_id->local_addr;
+	cm_event.remote_addr = cm_id->remote_addr;
+	cm_event.private_data = NULL;
+	cm_event.private_data_len = 0;
+	nes_debug(NES_DBG_CM, "NODE %p Generating CLOSE\n", event->cm_node);
+	ret = cm_id->event_handler(cm_id, &cm_event);
+
 	nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret);
 
 
diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h
index 89d80fc..6f01095 100644
--- a/drivers/infiniband/hw/nes/nes_cm.h
+++ b/drivers/infiniband/hw/nes/nes_cm.h
@@ -76,6 +76,10 @@ enum nes_timer_type {
 	NES_TIMER_TYPE_CLOSE,
 };
 
+#define NES_PASSIVE_STATE_INDICATED	0
+#define NES_DO_NOT_SEND_RESET_EVENT	1
+#define NES_SEND_RESET_EVENT		2
+
 #define MAX_NES_IFS 4
 
 #define SET_ACK 1
@@ -295,6 +299,7 @@ struct nes_cm_node {
 	struct list_head	timer_entry;
 	struct list_head	reset_entry;
 	struct nes_qp		*nesqp;
+	atomic_t 		passive_state;
 };
 
 /* structure for client or CM to fill when making CM api calls. */


From chien.tin.tung at intel.com  Fri Nov 21 12:50:49 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:49 -0600
Subject: [ofa-general] [PATCH 05/10] RDMA/nes: Forward packets for a new
	connection with stale APBVT entry
Message-ID: <20081121205049.GA6388@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Forward packets for a new connection with stale APBVT entry

Under heavy traffic, there is a small windows when an APBVT entry is not
yet removed and a new connection is established.  Packets for the new
connection are dropped until APBVT entry is removed.  This patch will
forward the packets instead of dropping them.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 01fd309..fd2dba7 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -86,7 +86,7 @@ static int mini_cm_accept(struct nes_cm_core *, struct ietf_mpa_frame *,
 	struct nes_cm_node *);
 static int mini_cm_reject(struct nes_cm_core *, struct ietf_mpa_frame *,
 	struct nes_cm_node *);
-static void mini_cm_recv_pkt(struct nes_cm_core *, struct nes_vnic *,
+static int mini_cm_recv_pkt(struct nes_cm_core *, struct nes_vnic *,
 	struct sk_buff *);
 static int mini_cm_dealloc_core(struct nes_cm_core *);
 static int mini_cm_get(struct nes_cm_core *);
@@ -2034,7 +2034,7 @@ static int mini_cm_close(struct nes_cm_core *cm_core, struct nes_cm_node *cm_nod
  * recv_pkt - recv an ETHERNET packet, and process it through CM
  * node state machine
  */
-static void mini_cm_recv_pkt(struct nes_cm_core *cm_core,
+static int mini_cm_recv_pkt(struct nes_cm_core *cm_core,
 	struct nes_vnic *nesvnic, struct sk_buff *skb)
 {
 	struct nes_cm_node *cm_node = NULL;
@@ -2042,23 +2042,16 @@ static void mini_cm_recv_pkt(struct nes_cm_core *cm_core,
 	struct iphdr *iph;
 	struct tcphdr *tcph;
 	struct nes_cm_info nfo;
+	int skb_handled = 1;
 
 	if (!skb)
-		return;
+		return 0;
 	if (skb->len < sizeof(struct iphdr) + sizeof(struct tcphdr)) {
-		dev_kfree_skb_any(skb);
-		return;
+		return 0;
 	}
 
 	iph = (struct iphdr *)skb->data;
 	tcph = (struct tcphdr *)(skb->data + sizeof(struct iphdr));
-	skb_reset_network_header(skb);
-	skb_set_transport_header(skb, sizeof(*tcph));
-	if (!tcph) {
-		dev_kfree_skb_any(skb);
-		return;
-	}
-	skb->len = ntohs(iph->tot_len);
 
 	nfo.loc_addr = ntohl(iph->daddr);
 	nfo.loc_port = ntohs(tcph->dest);
@@ -2079,23 +2072,21 @@ static void mini_cm_recv_pkt(struct nes_cm_core *cm_core,
 			/* Only type of packet accepted are for */
 			/* the PASSIVE open (syn only) */
 			if ((!tcph->syn) || (tcph->ack)) {
-				cm_packets_dropped++;
+				skb_handled = 0;
 				break;
 			}
 			listener = find_listener(cm_core, nfo.loc_addr,
 				nfo.loc_port,
 				NES_CM_LISTENER_ACTIVE_STATE);
-			if (listener) {
-				nfo.cm_id = listener->cm_id;
-				nfo.conn_type = listener->conn_type;
-			} else {
-				nes_debug(NES_DBG_CM, "Unable to find listener "
-					"for the pkt\n");
-				cm_packets_dropped++;
-				dev_kfree_skb_any(skb);
+			if (!listener) {
+				nfo.cm_id = NULL;
+				nfo.conn_type = 0;
+				nes_debug(NES_DBG_CM, "Unable to find listener for the pkt\n");
+				skb_handled = 0;
 				break;
 			}
-
+			nfo.cm_id = listener->cm_id;
+			nfo.conn_type = listener->conn_type;
 			cm_node = make_cm_node(cm_core, nesvnic, &nfo,
 				listener);
 			if (!cm_node) {
@@ -2121,9 +2112,13 @@ static void mini_cm_recv_pkt(struct nes_cm_core *cm_core,
 			dev_kfree_skb_any(skb);
 			break;
 		}
+		skb_reset_network_header(skb);
+		skb_set_transport_header(skb, sizeof(*tcph));
+		skb->len = ntohs(iph->tot_len);
 		process_packet(cm_node, skb, cm_core);
 		rem_ref_cm_node(cm_core, cm_node);
 	} while (0);
+	return skb_handled;
 }
 
 
@@ -2927,15 +2922,16 @@ int nes_destroy_listen(struct iw_cm_id *cm_id)
  */
 int nes_cm_recv(struct sk_buff *skb, struct net_device *netdevice)
 {
+	int rc = 0;
 	cm_packets_received++;
 	if ((g_cm_core) && (g_cm_core->api)) {
-		g_cm_core->api->recv_pkt(g_cm_core, netdev_priv(netdevice), skb);
+		rc = g_cm_core->api->recv_pkt(g_cm_core, netdev_priv(netdevice), skb);
 	} else {
 		nes_debug(NES_DBG_CM, "Unable to process packet for CM,"
 				" cm is not setup properly.\n");
 	}
 
-	return 0;
+	return rc;
 }
 
 
diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h
index 6f01095..fafa350 100644
--- a/drivers/infiniband/hw/nes/nes_cm.h
+++ b/drivers/infiniband/hw/nes/nes_cm.h
@@ -396,7 +396,7 @@ struct nes_cm_ops {
 			struct nes_cm_node *);
 	int (*reject)(struct nes_cm_core *, struct ietf_mpa_frame *,
 			struct nes_cm_node *);
-	void (*recv_pkt)(struct nes_cm_core *, struct nes_vnic *,
+	int (*recv_pkt)(struct nes_cm_core *, struct nes_vnic *,
 			struct sk_buff *);
 	int (*destroy_cm_core)(struct nes_cm_core *);
 	int (*get)(struct nes_cm_core *);
diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c
index 7c49cc8..8f70ff2 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -2700,27 +2700,33 @@ void nes_nic_ce_handler(struct nes_device *nesdev, struct nes_hw_nic_cq *cq)
 							pkt_type, (pkt_type & NES_PKT_TYPE_APBVT_MASK)); */
 
 				if ((pkt_type & NES_PKT_TYPE_APBVT_MASK) == NES_PKT_TYPE_APBVT_BITS) {
-					nes_cm_recv(rx_skb, nesvnic->netdev);
+					if (nes_cm_recv(rx_skb, nesvnic->netdev))
+						rx_skb = NULL;
+				}
+				if (rx_skb == NULL)
+					goto skip_rx_indicate0;
+
+
+				if ((cqe_misc & NES_NIC_CQE_TAG_VALID) &&
+				    (nesvnic->vlan_grp != NULL)) {
+					vlan_tag = (u16)(le32_to_cpu(
+							cq->cq_vbase[head].cqe_words[NES_NIC_CQE_TAG_PKT_TYPE_IDX])
+							>> 16);
+					nes_debug(NES_DBG_CQ, "%s: Reporting stripped VLAN packet. Tag = 0x%04X\n",
+							nesvnic->netdev->name, vlan_tag);
+					if (nes_use_lro)
+						lro_vlan_hwaccel_receive_skb(&nesvnic->lro_mgr, rx_skb,
+								nesvnic->vlan_grp, vlan_tag, NULL);
+					else
+						nes_vlan_rx(rx_skb, nesvnic->vlan_grp, vlan_tag);
 				} else {
-					if ((cqe_misc & NES_NIC_CQE_TAG_VALID) && (nesvnic->vlan_grp != NULL)) {
-						vlan_tag = (u16)(le32_to_cpu(
-								cq->cq_vbase[head].cqe_words[NES_NIC_CQE_TAG_PKT_TYPE_IDX])
-								>> 16);
-						nes_debug(NES_DBG_CQ, "%s: Reporting stripped VLAN packet. Tag = 0x%04X\n",
-								nesvnic->netdev->name, vlan_tag);
-						if (nes_use_lro)
-							lro_vlan_hwaccel_receive_skb(&nesvnic->lro_mgr, rx_skb,
-									nesvnic->vlan_grp, vlan_tag, NULL);
-						else
-							nes_vlan_rx(rx_skb, nesvnic->vlan_grp, vlan_tag);
-					} else {
-						if (nes_use_lro)
-							lro_receive_skb(&nesvnic->lro_mgr, rx_skb, NULL);
-						else
-							nes_netif_rx(rx_skb);
-					}
+					if (nes_use_lro)
+						lro_receive_skb(&nesvnic->lro_mgr, rx_skb, NULL);
+					else
+						nes_netif_rx(rx_skb);
 				}
 
+skip_rx_indicate0:
 				nesvnic->netdev->last_rx = jiffies;
 				/* nesvnic->netstats.rx_packets++; */
 				/* nesvnic->netstats.rx_bytes += rx_pkt_size; */


From chien.tin.tung at intel.com  Fri Nov 21 12:50:52 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:52 -0600
Subject: [ofa-general] [PATCH 06/10] RDMA/nes: Fix TCP complaiance test
	failures
Message-ID: <20081121205052.GA6468@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Fix TCP complaiance test failures

>From ANVL testing, we are not handling all cm_node states during connection
establishment.  Add missing state handlers.
Fixed sequence number
Send reset in handle_tcp_options()

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index fd2dba7..cc10da1 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -1466,7 +1466,7 @@ static void handle_syn_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	int optionsize;
 
 	optionsize = (tcph->doff << 2) - sizeof(struct tcphdr);
-	skb_pull(skb, tcph->doff << 2);
+	skb_trim(skb, 0);
 	inc_sequence = ntohl(tcph->seq);
 
 	switch (cm_node->state) {
@@ -1499,6 +1499,10 @@ static void handle_syn_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 		cm_node->state = NES_CM_STATE_SYN_RCVD;
 		send_syn(cm_node, 1, skb);
 		break;
+	case NES_CM_STATE_CLOSED:
+		cleanup_retrans_entry(cm_node);
+		send_reset(cm_node, skb);
+		break;
 	case NES_CM_STATE_TSA:
 	case NES_CM_STATE_ESTABLISHED:
 	case NES_CM_STATE_FIN_WAIT1:
@@ -1507,7 +1511,6 @@ static void handle_syn_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	case NES_CM_STATE_LAST_ACK:
 	case NES_CM_STATE_CLOSING:
 	case NES_CM_STATE_UNKNOWN:
-	case NES_CM_STATE_CLOSED:
 	default:
 		drop_packet(skb);
 		break;
@@ -1523,7 +1526,7 @@ static void handle_synack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	int optionsize;
 
 	optionsize = (tcph->doff << 2) - sizeof(struct tcphdr);
-	skb_pull(skb, tcph->doff << 2);
+	skb_trim(skb, 0);
 	inc_sequence = ntohl(tcph->seq);
 	switch (cm_node->state) {
 	case NES_CM_STATE_SYN_SENT:
@@ -1547,6 +1550,12 @@ static void handle_synack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 		/* passive open, so should not be here */
 		passive_open_err(cm_node, skb, 1);
 		break;
+	case NES_CM_STATE_LISTENING:
+	case NES_CM_STATE_CLOSED:
+		cm_node->tcp_cntxt.loc_seq_num = ntohl(tcph->ack_seq);
+		cleanup_retrans_entry(cm_node);
+		send_reset(cm_node, skb);
+		break;
 	case NES_CM_STATE_ESTABLISHED:
 	case NES_CM_STATE_FIN_WAIT1:
 	case NES_CM_STATE_FIN_WAIT2:
@@ -1554,7 +1563,6 @@ static void handle_synack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	case NES_CM_STATE_TSA:
 	case NES_CM_STATE_CLOSING:
 	case NES_CM_STATE_UNKNOWN:
-	case NES_CM_STATE_CLOSED:
 	case NES_CM_STATE_MPAREQ_SENT:
 	default:
 		drop_packet(skb);
@@ -1569,6 +1577,13 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	u32 inc_sequence;
 	u32 rem_seq_ack;
 	u32 rem_seq;
+	int ret;
+	int optionsize;
+	u32 temp_seq = cm_node->tcp_cntxt.loc_seq_num;
+
+	optionsize = (tcph->doff << 2) - sizeof(struct tcphdr);
+	cm_node->tcp_cntxt.loc_seq_num = ntohl(tcph->ack_seq);
+
 	if (check_seq(cm_node, tcph, skb))
 		return;
 
@@ -1581,7 +1596,18 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	switch (cm_node->state) {
 	case NES_CM_STATE_SYN_RCVD:
 		/* Passive OPEN */
+		ret = handle_tcp_options(cm_node, tcph, skb, optionsize, 1);
+		if (ret)
+			break;
 		cm_node->tcp_cntxt.rem_ack_num = ntohl(tcph->ack_seq);
+		cm_node->tcp_cntxt.loc_seq_num = temp_seq;
+		if (cm_node->tcp_cntxt.rem_ack_num !=
+		    cm_node->tcp_cntxt.loc_seq_num) {
+			nes_debug(NES_DBG_CM, "rem_ack_num != loc_seq_num\n");
+			cleanup_retrans_entry(cm_node);
+			send_reset(cm_node, skb);
+			return;
+		}
 		cm_node->state = NES_CM_STATE_ESTABLISHED;
 		if (datasize) {
 			cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize;
@@ -1613,11 +1639,15 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 			dev_kfree_skb_any(skb);
 		}
 		break;
+	case NES_CM_STATE_LISTENING:
+	case NES_CM_STATE_CLOSED:
+		cleanup_retrans_entry(cm_node);
+		send_reset(cm_node, skb);
+		break;
 	case NES_CM_STATE_FIN_WAIT1:
 	case NES_CM_STATE_SYN_SENT:
 	case NES_CM_STATE_FIN_WAIT2:
 	case NES_CM_STATE_TSA:
-	case NES_CM_STATE_CLOSED:
 	case NES_CM_STATE_MPAREQ_RCVD:
 	case NES_CM_STATE_LAST_ACK:
 	case NES_CM_STATE_CLOSING:
@@ -1640,9 +1670,9 @@ static int handle_tcp_options(struct nes_cm_node *cm_node, struct tcphdr *tcph,
 			nes_debug(NES_DBG_CM, "%s: Node %p, Sending RESET\n",
 				__func__, cm_node);
 			if (passive)
-				passive_open_err(cm_node, skb, 0);
+				passive_open_err(cm_node, skb, 1);
 			else
-				active_open_err(cm_node, skb, 0);
+				active_open_err(cm_node, skb, 1);
 			return 1;
 		}
 	}


From chien.tin.tung at intel.com  Fri Nov 21 12:50:55 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:55 -0600
Subject: [ofa-general] [PATCH 07/10] RDMA/nes: Check cqp_avail_reqs is empty
	after locking the list
Message-ID: <20081121205055.GA4888@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Check cqp_avail_reqs is empty after locking the list

Between the first empty list check and locking the list, the list
can change.  Check it again after it is locked to make sure
the list is not empty.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_utils.c b/drivers/infiniband/hw/nes/nes_utils.c
index fb8cbd7..5611a73 100644
--- a/drivers/infiniband/hw/nes/nes_utils.c
+++ b/drivers/infiniband/hw/nes/nes_utils.c
@@ -540,11 +540,14 @@ struct nes_cqp_request *nes_get_cqp_request(struct nes_device *nesdev)
 
 	if (!list_empty(&nesdev->cqp_avail_reqs)) {
 		spin_lock_irqsave(&nesdev->cqp.lock, flags);
-		cqp_request = list_entry(nesdev->cqp_avail_reqs.next,
+		if (!list_empty(&nesdev->cqp_avail_reqs)) {
+			cqp_request = list_entry(nesdev->cqp_avail_reqs.next,
 				struct nes_cqp_request, list);
-		list_del_init(&cqp_request->list);
+			list_del_init(&cqp_request->list);
+		}
 		spin_unlock_irqrestore(&nesdev->cqp.lock, flags);
-	} else {
+	}
+	if (cqp_request == NULL) {
 		cqp_request = kzalloc(sizeof(struct nes_cqp_request), GFP_KERNEL);
 		if (cqp_request) {
 			cqp_request->dynamic = 1;


From chien.tin.tung at intel.com  Fri Nov 21 12:50:58 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:50:58 -0600
Subject: [ofa-general] [PATCH 08/10] RDMA/nes: Change accept_pend_cnt to
	atomic
Message-ID: <20081121205058.GA8184@ctung-MOBL>

From: Faisal Latif <faisal.latif at intel.com>

RDMA/nes: Change accept_pend_cnt to atomic

There is a race condition on accept_pend_cnt.  Change it to atomic.

Signed-off-by: Faisal Latif <faisal.latif at intel.com>
Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index cc10da1..0025a7e 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -976,7 +976,7 @@ static inline int mini_cm_accelerated(struct nes_cm_core *cm_core,
 	u32 was_timer_set;
 	cm_node->accelerated = 1;
 
-	if (cm_node->accept_pend) {
+	if (atomic_dec_and_test(&cm_node->accept_pend)) {
 		BUG_ON(!cm_node->listener);
 		atomic_dec(&cm_node->listener->pend_accepts_cnt);
 		BUG_ON(atomic_read(&cm_node->listener->pend_accepts_cnt) < 0);
@@ -1091,7 +1091,7 @@ static struct nes_cm_node *make_cm_node(struct nes_cm_core *cm_core,
 	atomic_inc(&cm_core->node_cnt);
 	cm_node->conn_type = cm_info->conn_type;
 	cm_node->apbvt_set = 0;
-	cm_node->accept_pend = 0;
+	atomic_set(&cm_node->accept_pend, 0);
 
 	cm_node->nesvnic = nesvnic;
 	/* get some device handles, for arp lookup */
@@ -1156,7 +1156,7 @@ static int rem_ref_cm_node(struct nes_cm_core *cm_core,
 	spin_unlock_irqrestore(&cm_node->cm_core->ht_lock, flags);
 
 	/* if the node is destroyed before connection was accelerated */
-	if (!cm_node->accelerated && cm_node->accept_pend) {
+	if (!cm_node->accelerated && atomic_read(&cm_node->accept_pend)) {
 		BUG_ON(!cm_node->listener);
 		atomic_dec(&cm_node->listener->pend_accepts_cnt);
 		BUG_ON(atomic_read(&cm_node->listener->pend_accepts_cnt) < 0);
@@ -1477,25 +1477,25 @@ static void handle_syn_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 		break;
 	case NES_CM_STATE_LISTENING:
 		/* Passive OPEN */
-		cm_node->accept_pend = 1;
-		atomic_inc(&cm_node->listener->pend_accepts_cnt);
 		if (atomic_read(&cm_node->listener->pend_accepts_cnt) >
 				cm_node->listener->backlog) {
 			nes_debug(NES_DBG_CM, "drop syn due to backlog "
 				"pressure \n");
 			cm_backlog_drops++;
-			passive_open_err(cm_node, skb, 0);
+			rem_ref_cm_node(cm_node->cm_core, cm_node);
+			dev_kfree_skb_any(skb);
 			break;
 		}
 		ret = handle_tcp_options(cm_node, tcph, skb, optionsize,
 			1);
 		if (ret) {
-			passive_open_err(cm_node, skb, 0);
-			/* drop pkt */
 			break;
 		}
 		cm_node->tcp_cntxt.rcv_nxt = inc_sequence + 1;
 		BUG_ON(cm_node->send_entry);
+		atomic_set(&cm_node->accept_pend, 1);
+		atomic_inc(&cm_node->listener->pend_accepts_cnt);
+
 		cm_node->state = NES_CM_STATE_SYN_RCVD;
 		send_syn(cm_node, 1, skb);
 		break;
diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h
index fafa350..7600365 100644
--- a/drivers/infiniband/hw/nes/nes_cm.h
+++ b/drivers/infiniband/hw/nes/nes_cm.h
@@ -294,7 +294,7 @@ struct nes_cm_node {
 	enum nes_cm_conn_type     conn_type;
 	struct nes_vnic           *nesvnic;
 	int                       apbvt_set;
-	int                       accept_pend;
+	atomic_t                  accept_pend;
 	int			freed;
 	struct list_head	timer_entry;
 	struct list_head	reset_entry;


From chien.tin.tung at intel.com  Fri Nov 21 12:51:01 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:51:01 -0600
Subject: [ofa-general] [PATCH 09/10] RDMA/nes: Cleanup warnings
Message-ID: <20081121205101.GA1492@ctung-MOBL>

From:  Chien Tung <chien.tin.tung at intel.com>

RDMA/nes: Cleanup warnings

Wrapped NES_DEBUG and assert macros with do while (0) to avoid ambiguous else.
No one is using sk_buff * returned from form_cm_frame, take it out.
drop_packet() should not be incrementing reset counter on receiving a FIN.

Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h
index 1595dc7..13a5bb1 100644
--- a/drivers/infiniband/hw/nes/nes.h
+++ b/drivers/infiniband/hw/nes/nes.h
@@ -137,14 +137,18 @@
 
 #ifdef CONFIG_INFINIBAND_NES_DEBUG
 #define nes_debug(level, fmt, args...) \
+do { \
 	if (level & nes_debug_level) \
-		printk(KERN_ERR PFX "%s[%u]: " fmt, __func__, __LINE__, ##args)
-
-#define assert(expr)                                                \
-if (!(expr)) {                                                       \
-	printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n",  \
-		   #expr, __FILE__, __func__, __LINE__);                \
-}
+		printk(KERN_ERR PFX "%s[%u]: " fmt, __func__, __LINE__, ##args); \
+} while (0)
+
+#define assert(expr) \
+do { \
+	if (!(expr)) { \
+		printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n", \
+			   #expr, __FILE__, __func__, __LINE__); \
+	} \
+} while (0)
 
 #define NES_EVENT_TIMEOUT   1200000
 #else
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 0025a7e..24855ec 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -92,7 +92,7 @@ static int mini_cm_dealloc_core(struct nes_cm_core *);
 static int mini_cm_get(struct nes_cm_core *);
 static int mini_cm_set(struct nes_cm_core *, u32, u32);
 
-static struct sk_buff *form_cm_frame(struct sk_buff *, struct nes_cm_node *,
+static void form_cm_frame(struct sk_buff *, struct nes_cm_node *,
 	void *, u32, void *, u32, u8);
 static struct sk_buff *get_free_pkt(u32);
 static int add_ref_cm_node(struct nes_cm_node *);
@@ -251,7 +251,7 @@ static int parse_mpa(struct nes_cm_node *cm_node, u8 *buffer, u32 len)
  * form_cm_frame - get a free packet and build empty frame Use
  * node info to build.
  */
-static struct sk_buff *form_cm_frame(struct sk_buff *skb,
+static void form_cm_frame(struct sk_buff *skb,
 	struct nes_cm_node *cm_node, void *options, u32 optionsize,
 	void *data, u32 datasize, u8 flags)
 {
@@ -339,7 +339,6 @@ static struct sk_buff *form_cm_frame(struct sk_buff *skb,
 	skb_shinfo(skb)->nr_frags = 0;
 	cm_packets_created++;
 
-	return skb;
 }
 
 
@@ -380,8 +379,6 @@ int schedule_nes_timer(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	int ret = 0;
 	u32 was_timer_set;
 
-	if (!cm_node)
-		return -EINVAL;
 	new_send = kzalloc(sizeof(*new_send), GFP_ATOMIC);
 	if (!new_send)
 		return -1;
@@ -1280,7 +1277,6 @@ static void drop_packet(struct sk_buff *skb)
 static void handle_fin_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb,
 	struct tcphdr *tcph)
 {
-	atomic_inc(&cm_resets_recvd);
 	nes_debug(NES_DBG_CM, "Received FIN, cm_node = %p, state = %u. "
 		"refcnt=%d\n", cm_node, cm_node->state,
 		atomic_read(&cm_node->ref_count));


From chien.tin.tung at intel.com  Fri Nov 21 12:51:04 2008
From: chien.tin.tung at intel.com (Chien Tung)
Date: Fri, 21 Nov 2008 14:51:04 -0600
Subject: [ofa-general] [PATCH 10/10] RDMA/nes: Add loopback check to
	make_cm_node()
Message-ID: <20081121205104.GA3720@ctung-MOBL>

From:  Chien Tung <chien.tin.tung at intel.com>

RDMA/nes: Add loopback check to make_cm_node()

Check for loopback connection in make_cm_node()

Signed-off-by: Chien Tung <chien.tin.tung at intel.com>
--
diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c
index 24855ec..9cbea51 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -1097,7 +1097,10 @@ static struct nes_cm_node *make_cm_node(struct nes_cm_core *cm_core,
 
 	cm_node->loopbackpartner = NULL;
 	/* get the mac addr for the remote node */
-	arpindex = nes_arp_table(nesdev, cm_node->rem_addr, NULL, NES_ARP_RESOLVE);
+	if (ipv4_is_loopback(htonl(cm_node->rem_addr)))
+		arpindex = nes_arp_table(nesdev, ntohl(nesvnic->local_ipaddr), NULL, NES_ARP_RESOLVE);
+	else
+		arpindex = nes_arp_table(nesdev, cm_node->rem_addr, NULL, NES_ARP_RESOLVE);
 	if (arpindex < 0) {
 		arpindex = nes_addr_resolve_neigh(nesvnic, cm_info->rem_addr);
 		if (arpindex < 0) {


From jackm at dev.mellanox.co.il  Fri Nov 21 13:11:23 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Fri, 21 Nov 2008 23:11:23 +0200
Subject: [ofa-general] Re: Race condition in userspace libraries with
	create/destroy qp
In-Reply-To: <adaod08gbri.fsf@cisco.com>
References: <200811201211.46527.jackm@dev.mellanox.co.il>
	<200811210902.03040.jackm@dev.mellanox.co.il>
	<adaod08gbri.fsf@cisco.com>
Message-ID: <200811212311.23665.jackm@dev.mellanox.co.il>

On Friday 21 November 2008 20:44, Roland Dreier wrote:
> expanding the region where
> qp_table_mutex is held to avoid this bug makes perfect sense to me and
> seems like a clean solution.
> 
>  - R.

I'll send a patch on Sunday.


From jackm at dev.mellanox.co.il  Sat Nov 22 01:53:34 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sat, 22 Nov 2008 11:53:34 +0200
Subject: [ofa-general] [PATCH 0 of 2] Fix race condition in userspace
	libraries in create/destroy qp
Message-ID: <200811221153.36089.jackm@dev.mellanox.co.il>

The two patches in this series fix a race condition between
create_qp and destroy_qp which results in a newly-created QP not
being found by xxx_find_qp during CQ polling.

The low-level create_qp and destroy_qp functions are not atomic
WRT each other. If one thread is destroying a QP while another is
creating a qp, there is a race hole.  The destroying thread can lose
its timesice after it has deleted the QP from kernel space, but before
it has cleared it from userspace store (xxx_clear_qp).

If the other thread creates a qp during this break, it gets the same
QP base number and overwrites the destroyed QPs entry with xxx_store_qp().

When destroy_qp then deletes the qp number from the userspace store it
deletes the newly-created qp number, resulting in that QP not being found
in poll_cq.

This patch series fixes Bugzilla 1389 for the libmlx4 and libmthca libraries.

- Jack


From jackm at dev.mellanox.co.il  Sat Nov 22 01:53:48 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sat, 22 Nov 2008 11:53:48 +0200
Subject: [ofa-general] [PATCH 1 of 2] libmlx4: Fix race condition in
	create/destroy QP
Message-ID: <200811221153.49156.jackm@dev.mellanox.co.il>

Index: libmlx4/src/qp.c
===================================================================
--- libmlx4.orig/src/qp.c	2008-11-20 11:46:58.000000000 +0200
+++ libmlx4/src/qp.c	2008-11-22 09:44:13.000000000 +0200
@@ -667,37 +667,25 @@ struct mlx4_qp *mlx4_find_qp(struct mlx4
 int mlx4_store_qp(struct mlx4_context *ctx, uint32_t qpn, struct mlx4_qp *qp)
 {
 	int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
-	int ret = 0;
-
-	pthread_mutex_lock(&ctx->qp_table_mutex);
 
 	if (!ctx->qp_table[tind].refcnt) {
 		ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1,
 						   sizeof (struct mlx4_qp *));
-		if (!ctx->qp_table[tind].table) {
-			ret = -1;
-			goto out;
-		}
+		if (!ctx->qp_table[tind].table)
+			return -1;
 	}
 
 	++ctx->qp_table[tind].refcnt;
 	ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = qp;
-
-out:
-	pthread_mutex_unlock(&ctx->qp_table_mutex);
-	return ret;
+	return 0;
 }
 
 void mlx4_clear_qp(struct mlx4_context *ctx, uint32_t qpn)
 {
 	int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
 
-	pthread_mutex_lock(&ctx->qp_table_mutex);
-
 	if (!--ctx->qp_table[tind].refcnt)
 		free(ctx->qp_table[tind].table);
 	else
 		ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = NULL;
-
-	pthread_mutex_unlock(&ctx->qp_table_mutex);
 }
Index: libmlx4/src/verbs.c
===================================================================
--- libmlx4.orig/src/verbs.c	2008-11-20 11:46:58.000000000 +0200
+++ libmlx4/src/verbs.c	2008-11-22 11:05:44.000000000 +0200
@@ -452,6 +452,8 @@ struct ibv_qp *mlx4_create_qp(struct ibv
 	cmd.sq_no_prefetch = 0;	/* OK for ABI 2: just a reserved field */
 	memset(cmd.reserved, 0, sizeof cmd.reserved);
 
+	pthread_mutex_lock(&to_mctx(pd->context)->qp_table_mutex);
+
 	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
 				&resp, sizeof resp);
 	if (ret)
@@ -460,6 +462,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
 	ret = mlx4_store_qp(to_mctx(pd->context), qp->ibv_qp.qp_num, qp);
 	if (ret)
 		goto err_destroy;
+	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
 
 	qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr;
 	qp->rq.max_gs  = attr->cap.max_recv_sge;
@@ -477,6 +480,7 @@ err_destroy:
 	ibv_cmd_destroy_qp(&qp->ibv_qp);
 
 err_rq_db:
+	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
 	if (!attr->srq)
 		mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, qp->db);
 
@@ -580,9 +584,12 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 	struct mlx4_qp *qp = to_mqp(ibqp);
 	int ret;
 
+	pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);
 	ret = ibv_cmd_destroy_qp(ibqp);
-	if (ret)
+	if (ret) {
+		pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);
 		return ret;
+	}
 
 	mlx4_lock_cqs(ibqp);
 
@@ -594,6 +601,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 	mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);
 
 	mlx4_unlock_cqs(ibqp);
+	pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex);
 
 	if (!ibqp->srq)
 		mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db);


From jackm at dev.mellanox.co.il  Sat Nov 22 01:54:01 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sat, 22 Nov 2008 11:54:01 +0200
Subject: [ofa-general] [PATCH 2 of 2] libmthca: Fix race condition in
	create/destroy QP
Message-ID: <200811221154.02427.jackm@dev.mellanox.co.il>

Index: libmthca/src/verbs.c
===================================================================
--- libmthca.orig/src/verbs.c	2008-11-22 10:33:08.000000000 +0200
+++ libmthca/src/verbs.c	2008-11-22 10:58:01.258153000 +0200
@@ -566,6 +566,7 @@ struct ibv_qp *mthca_create_qp(struct ib
 		cmd.sq_db_index = cmd.rq_db_index = 0;
 	}
 
+	pthread_mutex_lock(&to_mctx(pd->context)->qp_table_mutex);
 	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
 				&resp, sizeof resp);
 	if (ret)
@@ -579,6 +580,7 @@ struct ibv_qp *mthca_create_qp(struct ib
 	ret = mthca_store_qp(to_mctx(pd->context), qp->ibv_qp.qp_num, qp);
 	if (ret)
 		goto err_destroy;
+	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
 
 	qp->sq.max 	    = attr->cap.max_send_wr;
 	qp->rq.max 	    = attr->cap.max_recv_wr;
@@ -592,6 +594,7 @@ err_destroy:
 	ibv_cmd_destroy_qp(&qp->ibv_qp);
 
 err_rq_db:
+	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
 	if (mthca_is_memfree(pd->context))
 		mthca_free_db(to_mctx(pd->context)->db_tab, MTHCA_DB_TYPE_RQ,
 			      qp->rq.db_index);
@@ -686,9 +689,12 @@ int mthca_destroy_qp(struct ibv_qp *qp)
 {
 	int ret;
 
+	pthread_mutex_lock(&to_mctx(qp->context)->qp_table_mutex);
 	ret = ibv_cmd_destroy_qp(qp);
-	if (ret)
+	if (ret) {
+		pthread_mutex_unlock(&to_mctx(qp->context)->qp_table_mutex);
 		return ret;
+	}
 
 	mthca_lock_cqs(qp);
 
@@ -700,6 +706,7 @@ int mthca_destroy_qp(struct ibv_qp *qp)
 	mthca_clear_qp(to_mctx(qp->context), qp->qp_num);
 
 	mthca_unlock_cqs(qp);
+	pthread_mutex_unlock(&to_mctx(qp->context)->qp_table_mutex);
 
 	if (mthca_is_memfree(qp->context)) {
 		mthca_free_db(to_mctx(qp->context)->db_tab, MTHCA_DB_TYPE_RQ,
Index: libmthca/src/qp.c
===================================================================
--- libmthca.orig/src/qp.c	2008-11-22 10:33:08.000000000 +0200
+++ libmthca/src/qp.c	2008-11-22 10:55:33.313592000 +0200
@@ -909,39 +909,27 @@ struct mthca_qp *mthca_find_qp(struct mt
 int mthca_store_qp(struct mthca_context *ctx, uint32_t qpn, struct mthca_qp *qp)
 {
 	int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
-	int ret = 0;
-
-	pthread_mutex_lock(&ctx->qp_table_mutex);
 
 	if (!ctx->qp_table[tind].refcnt) {
 		ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1,
 						   sizeof (struct mthca_qp *));
-		if (!ctx->qp_table[tind].table) {
-			ret = -1;
-			goto out;
-		}
+		if (!ctx->qp_table[tind].table)
+			return -1;
 	}
 
 	++ctx->qp_table[tind].refcnt;
 	ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = qp;
-
-out:
-	pthread_mutex_unlock(&ctx->qp_table_mutex);
-	return ret;
+	return 0;
 }
 
 void mthca_clear_qp(struct mthca_context *ctx, uint32_t qpn)
 {
 	int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
 
-	pthread_mutex_lock(&ctx->qp_table_mutex);
-
 	if (!--ctx->qp_table[tind].refcnt)
 		free(ctx->qp_table[tind].table);
 	else
 		ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = NULL;
-
-	pthread_mutex_unlock(&ctx->qp_table_mutex);
 }
 
 int mthca_free_err_wqe(struct mthca_qp *qp, int is_send,


From vlad at lists.openfabrics.org  Sat Nov 22 03:22:06 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 22 Nov 2008 03:22:06 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081122-0200 daily build status
Message-ID: <20081122112206.1A044E60BD5@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Sat Nov 22 03:51:33 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 22 Nov 2008 13:51:33 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sa_link_record: prevent potential
	endless recursion
Message-ID: <20081122115133.GI8310@sashak.voltaire.com>


This patch eliminates osm_node_get_any_physp_ptr() use which can return
invalid port in case of "port moving". In this case SA LinkRecord query
issued without source and destination LIDs will cause to endless
recursion and OpenSM crash.

The problem is easily reproducible for example when two ports HCA
originally connected by one port to a fabric will be reconnected quickly
(in less than OpenSM discovery cycle time) by another port and then
(after OpenSM sweep is finished) we will run 'saquery LinkRecord'.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_sa_link_record.c |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c
index c48df14..b92845e 100644
--- a/opensm/opensm/osm_sa_link_record.c
+++ b/opensm/opensm/osm_sa_link_record.c
@@ -342,18 +342,18 @@ __osm_lr_rcv_get_port_links(IN osm_sa_t * sa,
 			p_node = (osm_node_t *)cl_qmap_head(p_node_tbl);
 
 			while (p_node != (osm_node_t *)cl_qmap_end(p_node_tbl)) {
-				/*
-				   Get only one port for each node.
-				   After the recursive call, this function will
-				   scan all the ports of this node anyway.
-				 */
-				p_src_physp = osm_node_get_any_physp_ptr(p_node);
-				p_src_port = osm_get_port_by_guid(sa->p_subn,
-				        osm_physp_get_port_guid(p_src_physp));
-				__osm_lr_rcv_get_port_links(sa, p_lr,
-							    p_src_port, NULL,
-							    comp_mask, p_list,
-							    p_req_physp);
+				num_ports = osm_node_get_num_physp(p_node);
+				for (port_num = 1; port_num < num_ports;
+				     port_num++) {
+					p_src_physp =
+					    osm_node_get_physp_ptr(p_node,
+								   port_num);
+					if (p_src_physp)
+						__osm_lr_rcv_get_physp_link
+						    (sa, p_lr, p_src_physp,
+						     NULL, comp_mask, p_list,
+						     p_req_physp);
+				}
 				p_node = (osm_node_t *) cl_qmap_next(&p_node->
 								     map_item);
 			}
-- 
1.6.0.3.517.g759a


From michael at ellerman.id.au  Fri Nov 21 19:41:08 2008
From: michael at ellerman.id.au (Michael Ellerman)
Date: Sat, 22 Nov 2008 14:41:08 +1100
Subject: [ofa-general] Re: [PATCH] IB/ehca: Fix lockdep failures for
	shca_list_lock
In-Reply-To: <1227283347.3599.8.camel@johannes.berg>
References: <200806061835.43802.fenkes@de.ibm.com>
	<48499C11.7030504@gmail.com>  <200811211637.15300.fenkes@de.ibm.com>
	<1227283347.3599.8.camel@johannes.berg>
Message-ID: <1227325268.10134.2.camel@localhost>

On Fri, 2008-11-21 at 17:02 +0100, Johannes Berg wrote:
> On Fri, 2008-11-21 at 16:37 +0100, Joachim Fenkes wrote:
> 
> > +	u64 flags;
> 
> > -	spin_lock(&shca_list_lock);
> > +	spin_lock_irqsave(&shca_list_lock, flags);
> 
> That's wrong and I think will give a warning on all machines where
> u64 != unsigned long. Might not particularly matter in this case.

Crud, sorry.

> Also, generally it seems wrong to say "fix lockdep failure" when the
> patch really fixes a bug that lockdep happened to find.

True. I guess it should be "fix locking error found with lockdep", to
make it clear no one has actually hit the bug.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081122/70ce3dc8/attachment.sig>

From sashak at voltaire.com  Sat Nov 22 07:41:48 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 22 Nov 2008 17:41:48 +0200
Subject: [ofa-general] [PATCH] opensm/osm_sw_info_rcv: eliminate
	osm_node_get_any_physp_ptr() use
Message-ID: <20081122154148.GJ8310@sashak.voltaire.com>


The function osm_node_get_any_physp_ptr() is dangerous because it uses
potentially outdated local port number from NodeInfo. It is wrongly
used in commented out functions __osm_si_rcv_get_fwd_tbl() and
__osm_si_rcv_get_mcast_fwd_tbl() for direct path determination.

In __osm_ni_rcv_process_switch() function the usage is safe (for port
GUID only), but due to potential outdate DR path was extracted from MAD.

In order to unify all this stuff we will update DR path of switch port 0
on NodeInfo receive and will use it later in discovery process (instead
of potentially outdated DR path extracted from port returned by
osm_node_get_any_physp_ptr()).

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_node_info_rcv.c |    9 +++++----
 opensm/opensm/osm_sw_info_rcv.c   |   32 +++++++++-----------------------
 2 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c
index 20b16d1..c52c0d5 100644
--- a/opensm/opensm/osm_node_info_rcv.c
+++ b/opensm/opensm/osm_node_info_rcv.c
@@ -501,15 +501,16 @@ __osm_ni_rcv_process_switch(IN osm_sm_t * sm,
 {
 	ib_api_status_t status = IB_SUCCESS;
 	osm_madw_context_t context;
-	osm_dr_path_t dr_path;
+	osm_dr_path_t *path;
 	ib_smp_t *p_smp;
 
 	OSM_LOG_ENTER(sm->p_log);
 
 	p_smp = osm_madw_get_smp_ptr(p_madw);
 
-	osm_dr_path_init(&dr_path,
-			 osm_madw_get_bind_handle(p_madw),
+	/* update DR path of already initialized switch port 0 */
+	path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, 0));
+	osm_dr_path_init(path, osm_madw_get_bind_handle(p_madw),
 			 p_smp->hop_count, p_smp->initial_path);
 
 	context.si_context.node_guid = osm_node_get_node_guid(p_node);
@@ -517,7 +518,7 @@ __osm_ni_rcv_process_switch(IN osm_sm_t * sm,
 	context.si_context.light_sweep = FALSE;
 
 	/* Request a SwitchInfo attribute */
-	status = osm_req_get(sm, &dr_path, IB_MAD_ATTR_SWITCH_INFO,
+	status = osm_req_get(sm, path, IB_MAD_ATTR_SWITCH_INFO,
 			     0, CL_DISP_MSGID_NONE, &context);
 	if (status != IB_SUCCESS)
 		/* continue despite error */
diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c
index e9973e3..ce86adb 100644
--- a/opensm/opensm/osm_sw_info_rcv.c
+++ b/opensm/opensm/osm_sw_info_rcv.c
@@ -59,17 +59,13 @@
  The plock must be held before calling this function.
 **********************************************************************/
 static void
-__osm_si_rcv_get_port_info(IN osm_sm_t * sm,
-			   IN osm_switch_t * const p_sw,
-			   IN const osm_madw_t * const p_madw)
+__osm_si_rcv_get_port_info(IN osm_sm_t * sm, IN osm_switch_t * const p_sw)
 {
 	osm_madw_context_t context;
 	uint8_t port_num;
 	osm_physp_t *p_physp;
 	osm_node_t *p_node;
 	uint8_t num_ports;
-	osm_dr_path_t dr_path;
-	const ib_smp_t *p_smp;
 	ib_api_status_t status = IB_SUCCESS;
 
 	OSM_LOG_ENTER(sm->p_log);
@@ -77,19 +73,13 @@ __osm_si_rcv_get_port_info(IN osm_sm_t * sm,
 	CL_ASSERT(p_sw);
 
 	p_node = p_sw->p_node;
-	p_smp = osm_madw_get_smp_ptr(p_madw);
 
 	CL_ASSERT(osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH);
 
 	/*
 	   Request PortInfo attribute for each port on the switch.
-	   Don't trust the port's own DR Path, since it may no longer
-	   be a legitimate path through the subnet.
-	   Build a path from the mad instead, since we know that path works.
-	   The port's DR Path info gets updated when the PortInfo
-	   attribute is received.
 	 */
-	p_physp = osm_node_get_any_physp_ptr(p_node);
+	p_physp = osm_node_get_physp_ptr(p_node, 0);
 
 	context.pi_context.node_guid = osm_node_get_node_guid(p_node);
 	context.pi_context.port_guid = osm_physp_get_port_guid(p_physp);
@@ -98,12 +88,10 @@ __osm_si_rcv_get_port_info(IN osm_sm_t * sm,
 	context.pi_context.active_transition = FALSE;
 
 	num_ports = osm_node_get_num_physp(p_node);
-	osm_dr_path_init(&dr_path, osm_madw_get_bind_handle(p_madw),
-			 p_smp->hop_count, p_smp->initial_path);
 
 	for (port_num = 0; port_num < num_ports; port_num++) {
-		status = osm_req_get(sm, &dr_path, IB_MAD_ATTR_PORT_INFO,
-				     cl_hton32(port_num),
+		status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp),
+				     IB_MAD_ATTR_PORT_INFO, cl_hton32(port_num),
 				     CL_DISP_MSGID_NONE, &context);
 		if (status != IB_SUCCESS)
 			/* continue the loop despite the error */
@@ -138,13 +126,12 @@ __osm_si_rcv_get_fwd_tbl(IN osm_sm_t * sm, IN osm_switch_t * const p_sw)
 
 	CL_ASSERT(osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH);
 
-	p_physp = osm_node_get_any_physp_ptr(p_node);
-
 	context.lft_context.node_guid = osm_node_get_node_guid(p_node);
 	context.lft_context.set_method = FALSE;
 
 	max_block_id_ho = osm_switch_get_max_block_id_in_use(p_sw);
 
+	p_physp = osm_node_get_physp_ptr(p_node, 0);
 	p_dr_path = osm_physp_get_dr_path_ptr(p_physp);
 
 	for (block_id_ho = 0; block_id_ho <= max_block_id_ho; block_id_ho++) {
@@ -197,12 +184,10 @@ __osm_si_rcv_get_mcast_fwd_tbl(IN osm_sm_t * sm, IN osm_switch_t * const p_sw)
 		goto Exit;
 	}
 
-	p_physp = osm_node_get_any_physp_ptr(p_node);
-	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
-
 	context.mft_context.node_guid = osm_node_get_node_guid(p_node);
 	context.mft_context.set_method = FALSE;
 
+	p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
 	max_block_id_ho = osm_mcast_tbl_get_max_block(p_tbl);
 
 	if (max_block_id_ho > IB_MCAST_MAX_BLOCK_ID) {
@@ -221,6 +206,7 @@ __osm_si_rcv_get_mcast_fwd_tbl(IN osm_sm_t * sm, IN osm_switch_t * const p_sw)
 		"Max MFT block = %u, Max position = %u\n", max_block_id_ho,
 		max_position);
 
+	p_physp = osm_node_get_physp_ptr(p_node, 0);
 	p_dr_path = osm_physp_get_dr_path_ptr(p_physp);
 
 	for (block_id_ho = 0; block_id_ho <= max_block_id_ho; block_id_ho++) {
@@ -331,7 +317,7 @@ __osm_si_rcv_process_new(IN osm_sm_t * sm,
 	/*
 	   Get the PortInfo attribute for every port.
 	 */
-	__osm_si_rcv_get_port_info(sm, p_sw, p_madw);
+	__osm_si_rcv_get_port_info(sm, p_sw);
 
 	/*
 	   Don't bother retrieving the current unicast and multicast tables
@@ -426,7 +412,7 @@ __osm_si_rcv_process_existing(IN osm_sm_t * sm,
 
 			/* If this is the first discovery - then get the port_info */
 			if (p_sw->discovery_count == 1)
-				__osm_si_rcv_get_port_info(sm, p_sw, p_madw);
+				__osm_si_rcv_get_port_info(sm, p_sw);
 			else
 				OSM_LOG(sm->p_log, OSM_LOG_DEBUG,
 					"Not discovering again through switch:0x%"
-- 
1.6.0.4.766.g6fc4a


From ogerlitz at voltaire.com  Sat Nov 22 23:22:48 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 23 Nov 2008 09:22:48 +0200
Subject: [ofa-general] infiniband problem, no NICs
In-Reply-To: <4925BD78.4030003@tu-berlin.de>
References: <4925BD78.4030003@tu-berlin.de>
Message-ID: <492904C8.7000402@voltaire.com>

Michael Oevermann wrote:

> However, when I directly start a mpi job (without using a scheduler) via:
> /usr/mpi/gcc4/openmpi-1.2.2-1/bin/mpirun -np 4 -hostfile 
> /home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1 
>
>
> I get the error message:
>
> 0,1,0]: uDAPL on host n01 was unable to find any NICs. Another 
> transport will be used instead, although this may result in lower 
> performance.
The BTL you are working with uses a library named udapl and this library 
relies on the IPoIB (IP over Infiniband) NICs (e.g ib0, ib1) existence. 
Assuming these nics are not configured on your system, you can either 
configure them (modprobe ib_ipoib / ifconfig ib0 x.y.z.w) or use a verb 
(native IB access layer) BTL which does not reply on operative ipoib.
   
Or.


From sashak at voltaire.com  Sun Nov 23 00:34:05 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 10:34:05 +0200
Subject: [ofa-general] [PATCH] opensm: remove osm_node_get_any_dr_part_ptr()
	function
Message-ID: <20081123083405.GD21967@sashak.voltaire.com>


The function osm_node_get_any_dr_path_ptr() is dangerous because it uses
potentially outdated local port number from NodeInfo.

The port moving in combination with PortInfo Get failure may cause that
wrong DR path will be used and subnet will never up (the issue was
simulated with ibsim).

This patch removes this funtion completely and instead uses DR path of
switch port 0 which is always up to date.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_node.h |   39 --------------------------------------
 opensm/opensm/osm_mcast_mgr.c    |    4 +--
 opensm/opensm/osm_state_mgr.c    |    2 +-
 opensm/opensm/osm_ucast_mgr.c    |    4 +--
 4 files changed, 3 insertions(+), 46 deletions(-)

diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h
index 24e399e..8d90f88 100644
--- a/opensm/include/opensm/osm_node.h
+++ b/opensm/include/opensm/osm_node.h
@@ -272,45 +272,6 @@ static inline osm_physp_t *osm_node_get_any_physp_ptr(IN const osm_node_t *
 *	Node object
 *********/
 
-/****f* OpenSM: Node/osm_node_get_any_path
-* NAME
-*	osm_node_get_any_path
-*
-* DESCRIPTION
-*	Returns a pointer to the physical port object at the
-*	specified local port number.
-*
-* SYNOPSIS
-*/
-static inline osm_dr_path_t *osm_node_get_any_dr_path_ptr(IN const osm_node_t *
-							  const p_node)
-{
-	CL_ASSERT(p_node);
-	return (osm_physp_get_dr_path_ptr
-		(&p_node->
-		 physp_table[ib_node_info_get_local_port_num
-			     (&p_node->node_info)]));
-}
-
-/*
-* PARAMETERS
-*	p_node
-*		[in] Pointer to an osm_node_t object.
-*
-*	port_num
-*		[in] Local port number.
-*
-* RETURN VALUES
-*	Returns a pointer to the physical port object at the
-*	specified local port number.
-*	A return value of zero means the port number was out of range.
-*
-* NOTES
-*
-* SEE ALSO
-*	Node object
-*********/
-
 /****f* OpenSM: Node/osm_node_get_type
 * NAME
 *	osm_node_get_type
diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index 6d26694..2f9cb5e 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -356,9 +356,7 @@ __osm_mcast_mgr_set_tbl(osm_sm_t * sm, IN osm_switch_t * const p_sw)
 
 	CL_ASSERT(p_node);
 
-	p_path = osm_node_get_any_dr_path_ptr(p_node);
-
-	CL_ASSERT(p_path);
+	p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, 0));
 
 	/*
 	   Send multicast forwarding table blocks to the switch
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 9404e24..599af0a 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -135,7 +135,7 @@ static void __osm_state_mgr_get_sw_info(IN cl_map_item_t * const p_object,
 	OSM_LOG_ENTER(sm->p_log);
 
 	p_node = p_sw->p_node;
-	p_dr_path = osm_node_get_any_dr_path_ptr(p_node);
+	p_dr_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, 0));
 
 	memset(&mad_context, 0, sizeof(mad_context));
 
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 175817c..1409e15 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -336,9 +336,7 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
 
 	CL_ASSERT(p_node);
 
-	p_path = osm_node_get_any_dr_path_ptr(p_node);
-
-	CL_ASSERT(p_path);
+	p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_node, 0));
 
 	/*
 	   Set the top of the unicast forwarding table.
-- 
1.6.0.4.766.g6fc4a


From sashak at voltaire.com  Sun Nov 23 01:05:57 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 11:05:57 +0200
Subject: [ofa-general] Re: [PATCH] opensm/opensm/osm_state_mgr.c: Add check
	for valid physical port before using pointer.
In-Reply-To: <20081118140608.19ac0963.weiny2@llnl.gov>
References: <20081104095744.35893d4a.weiny2@llnl.gov>
	<20081110201333.GM313@sashak.voltaire.com>
	<20081110131140.52561f42.weiny2@llnl.gov>
	<20081112185457.GD27271@sashak.voltaire.com>
	<20081118123000.GO10251@sashak.voltaire.com>
	<20081118140608.19ac0963.weiny2@llnl.gov>
Message-ID: <20081123090557.GF21967@sashak.voltaire.com>

Hi Ira,

On 14:06 Tue 18 Nov     , Ira Weiny wrote:
> I am not sure this will fix my bug.
> 
> The stack trace in my bug ended with:
> 
>    #0  osm_vendor_get (h_bind=0x0, mad_size=256, p_vw=0x69bbe8) at
> 
> The h_bind was being extracted from the osm_physp_t object.  Would this fix
> ensure that the h_bind pointer was valid in the osm_physp_t object returned?

Not always :(. It will protect against port moving, but may not help in
case of PortInfo Get failure (as far as I understand now it is your case).

Finally I just removed osm_node_get_any_physp_ptr() (as well as
osm_node_get_any_dr_path() which uses similar assumption about local
port number in NodeInfo) in all places where is was used.

I think we can do the same in __osm_state_mgr_get_node_desc() and remove
osm_node_get_any_physp_ptr() completely. The patch shortly.

Sasha


From alekseys at voltaire.com  Sun Nov 23 01:16:34 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Sun, 23 Nov 2008 09:16:34 +0000
Subject: [ofa-general] RDMA CM and IPv6 support
Message-ID: <1227431794.4180.7.camel@alst60.voltaire.com>

Hi, Roland.
There was a set of kernel patches written by me and approved by Sean for
RDMA CM to support IPv6 protocol. Is there any reason why it not
applied? I'll be glad fix them.
Here is the reference to this thread.


http://lists.openfabrics.org/pipermail/general/2008-August/053663.html


From ogerlitz at voltaire.com  Sun Nov 23 01:23:37 2008
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 23 Nov 2008 11:23:37 +0200
Subject: [ofa-general] [PATCH] iser: avoid recv buf exhaustion
In-Reply-To: <1227247845-16023-1-git-send-email-ddiss@sgi.com>
References: <1227247845-16023-1-git-send-email-ddiss@sgi.com>
Message-ID: <49292119.9080105@voltaire.com>

David Disseldorp wrote:
> iSCSI/iSER targets may send PDUs without a prior request from the initiator, RFC 5046 refers to these PDUs as "unexpected". NOP-In PDUs with itt=RESERVED and Asynchronous Message PDUs occupy this category. Currently when an iSER target sends an "unexpected" PDU, the initiators recv buffer consumed by the PDU is not replaced. If over initial_post_recv_bufs_num "unexpected" PDUs are received then the receive queue will run out of receive work requests.
Assuming these target initiated NOP-Ins are echoed back by the 
initiator, the current code of iser_send_control would post a receive 
buffer when sending the NOP-Out which will account for the buffer 
consumed by the NOP-In. So we are remained with the Asynchronous PDUs  
for which your patch indeed seems to fix a hole in the implementation.
>
> This patch ensures recv buffers consumed by "unexpected" PDUs are replaced prior to sending the next control-type PDU.
The practice used by the patch is account unexpected receives and refill 
the receive buffer queue when ever possible with as many as unexpected 
receives that took place since the last refill attempt. To ease with 
future maintainance and debugging / simplicity of the code, I would 
prefer a patch with zero foot-print at the iser_send_xxx functions, 
something like account --async-- receives and when calling 
iser_post_receive_control fill-in the missing buffers.

> @@ -586,6 +635,21 @@ void iser_rcv_completion(struct iser_desc *rx_desc,
>  	 * parallel to the execution of iser_conn_term. So the code that waits *
>  	 * for the posted rx bufs refcount to become zero handles everything   */
>  	atomic_dec(&conn->ib_conn->post_recv_buf_count);
> +
> +	/*
> +	 * if an unexpected PDU was received then the recv wr consumed must
> +	 * be replaced, this is done in the next send of a control-type PDU
> +	 */
> +	if ((opcode == ISCSI_OP_NOOP_IN)
> +	 && (hdr->itt == RESERVED_ITT)) {
> +		/* nop-in with itt = 0xffffffff */
> +		atomic_inc(&conn->ib_conn->unexpected_pdu_count);
> +	}
As I wrote above, this seems to be unneeded

Or.


From sashak at voltaire.com  Sun Nov 23 01:32:08 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 11:32:08 +0200
Subject: [ofa-general] [PATCH] opensm: remove osm_node_get_any_physp_ptr()
	function
Message-ID: <20081123093208.GG21967@sashak.voltaire.com>


The function osm_node_get_any_physp_ptr() is dangerous because it uses
potentially outdated local port number from NodeInfo. The port moving
and/or PortInfo Get failures may cause that pointer to a wrong
(unintialized) port will be returned. This patch removes this funtion
completely.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/include/opensm/osm_node.h |   36 ------------------------------------
 opensm/opensm/osm_state_mgr.c    |   13 +++++++++----
 2 files changed, 9 insertions(+), 40 deletions(-)

diff --git a/opensm/include/opensm/osm_node.h b/opensm/include/opensm/osm_node.h
index 8d90f88..50b3598 100644
--- a/opensm/include/opensm/osm_node.h
+++ b/opensm/include/opensm/osm_node.h
@@ -236,42 +236,6 @@ static inline osm_physp_t *osm_node_get_physp_ptr(IN osm_node_t * const p_node,
 *	Node object
 *********/
 
-/****f* OpenSM: Node/osm_node_get_any_physp_ptr
-* NAME
-*	osm_node_get_any_physp_ptr
-*
-* DESCRIPTION
-*	Returns a pointer to any valid physical port object associated
-*	with this node.  This operation is mostly meaningful for switches,
-*	in which case all the Physical Ports share the same GUID.
-*
-* SYNOPSIS
-*/
-static inline osm_physp_t *osm_node_get_any_physp_ptr(IN const osm_node_t *
-						      const p_node)
-{
-	CL_ASSERT(p_node);
-	return ((osm_physp_t *) & p_node->
-		physp_table[ib_node_info_get_local_port_num
-			    (&p_node->node_info)]);
-}
-
-/*
-* PARAMETERS
-*	p_node
-*		[in] Pointer to an osm_node_t object.
-*
-* RETURN VALUES
-*	Returns a pointer to any valid physical port object associated
-*	with this node.  This operation is mostly meaningful for switches,
-*	in which case all the Physical Ports share the same GUID.
-*
-* NOTES
-*
-* SEE ALSO
-*	Node object
-*********/
-
 /****f* OpenSM: Node/osm_node_get_type
 * NAME
 *	osm_node_get_type
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 599af0a..56212fe 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -524,7 +524,8 @@ static void __osm_state_mgr_get_node_desc(IN cl_map_item_t * const p_object,
 	osm_madw_context_t mad_context;
 	osm_node_t *const p_node = (osm_node_t *) p_object;
 	osm_sm_t *sm = context;
-	osm_physp_t *p_physp;
+	osm_physp_t *p_physp = NULL;
+	unsigned i, num_ports;
 	ib_api_status_t status;
 
 	OSM_LOG_ENTER(sm->p_log);
@@ -541,10 +542,14 @@ static void __osm_state_mgr_get_node_desc(IN cl_map_item_t * const p_object,
 		cl_ntoh64(osm_node_get_node_guid (p_node)));
 
 	/* get a physp to request from. */
-	p_physp = osm_node_get_any_physp_ptr(p_node);
-	if (!osm_physp_is_valid(p_physp)) {
+	num_ports = osm_node_get_num_physp(p_node);
+	for (i = 0; i < num_ports; i++)
+		if ((p_physp = osm_node_get_physp_ptr(p_node, i)))
+			break;
+
+	if (!p_physp) {
 		OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 331C: "
-			"Failed to get valid physical port object\n");
+			"Failed to find any valid physical port object.\n");
 		goto exit;
 	}
 
-- 
1.6.0.4.766.g6fc4a


From sashak at voltaire.com  Sun Nov 23 03:05:56 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 13:05:56 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <49255C13.5030503@dev.mellanox.co.il>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
	<49255C13.5030503@dev.mellanox.co.il>
Message-ID: <20081123110556.GH21967@sashak.voltaire.com>

Hi Yevgeny,

On 14:46 Thu 20 Nov     , Yevgeny Kliteynik wrote:
>
> I can do something like the following patch, but I have
> some strange feeling that I'm missing something...

I cannot see any errors here. But probably you can use simpler approach
- just cleanup all switch's lft_buf separately after ucast_mgr is
finished (including wait_for_pending_transactions()). Something like
below (if it is fine for you I can just apply this patch).

BTW, what about to rename lft_buf to new_lft (to improve readability)?

Sasha


diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 56212fe..c810106 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1001,6 +1001,23 @@ static void __osm_state_mgr_check_tbl_consistency(IN osm_sm_t * sm)
 	OSM_LOG_EXIT(sm->p_log);
 }
 
+static void cleanup_switch(cl_map_item_t *item, void *log)
+{
+	osm_switch_t *sw = (osm_switch_t *)item;
+
+	if (!sw->lft_buf)
+		return;
+	
+	if (memcmp(sw->lft, sw->lft_buf, IB_LID_UCAST_END_HO + 1))
+		osm_log(log, OSM_LOG_ERROR, "ERR 331D: "
+			"LFT of switch 0x%016" PRIx64 " is not up to date.\n",
+			cl_ntoh64(sw->p_node->node_info.node_guid));
+	else {
+		free(sw->lft_buf);
+		sw->lft_buf = NULL;
+	}
+}
+
 /**********************************************************************
  **********************************************************************/
 int wait_for_pending_transactions(osm_stats_t * stats)
@@ -1254,6 +1271,9 @@ _repeat_discovery:
 	if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats))
 		return;
 
+	/* cleanup switch lft buffers */
+	cl_qmap_apply_func(&sm->p_subn->sw_guid_tbl, cleanup_switch, sm->p_log);
+
 	/* We are done setting all LFTs so clear the ignore existing.
 	 * From now on, as long as we are still master, we want to
 	 * take into account these lfts. */
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index 642dcd1..c446f4f 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -114,13 +114,6 @@ osm_switch_init(IN osm_switch_t * const p_sw,
 	/* Initialize the table to OSM_NO_PATH, which is "invalid port" */
 	memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
 
-	p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1);
-	if (!p_sw->lft_buf) {
-		status = IB_INSUFFICIENT_MEMORY;
-		goto Exit;
-	}
-	memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
-
 	p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports);
 	if (p_sw->p_prof == NULL) {
 		status = IB_INSUFFICIENT_MEMORY;
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 1409e15..3d47640 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -397,13 +397,6 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
 		goto Exit;
 	}
 
-	if (!p_sw->need_update &&
-	    !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) {
-		free(p_sw->lft_buf);
-		p_sw->lft_buf = NULL;
-		goto Exit;
-	}
-
 	for (block_id_ho = 0;
 	     osm_switch_get_lft_block(p_sw, block_id_ho, block);
 	     block_id_ho++) {


From vlad at lists.openfabrics.org  Sun Nov 23 03:22:02 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 23 Nov 2008 03:22:02 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081123-0200 daily build status
Message-ID: <20081123112202.4115AE60CD2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From kliteyn at dev.mellanox.co.il  Sun Nov 23 03:58:20 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 23 Nov 2008 13:58:20 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <20081123110556.GH21967@sashak.voltaire.com>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
	<49255C13.5030503@dev.mellanox.co.il>
	<20081123110556.GH21967@sashak.voltaire.com>
Message-ID: <4929455C.2080407@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 14:46 Thu 20 Nov     , Yevgeny Kliteynik wrote:
>> I can do something like the following patch, but I have
>> some strange feeling that I'm missing something...
> 
> I cannot see any errors here. But probably you can use simpler approach
> - just cleanup all switch's lft_buf separately after ucast_mgr is
> finished (including wait_for_pending_transactions()). Something like
> below (if it is fine for you I can just apply this patch).

In general, looks good. See below.

> BTW, what about to rename lft_buf to new_lft (to improve readability)?

Sure, why not.

> Sasha
> 
> 
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index 56212fe..c810106 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -1001,6 +1001,23 @@ static void __osm_state_mgr_check_tbl_consistency(IN osm_sm_t * sm)
>  	OSM_LOG_EXIT(sm->p_log);
>  }
>  
> +static void cleanup_switch(cl_map_item_t *item, void *log)
> +{
> +	osm_switch_t *sw = (osm_switch_t *)item;
> +
> +	if (!sw->lft_buf)
> +		return;
> +	
> +	if (memcmp(sw->lft, sw->lft_buf, IB_LID_UCAST_END_HO + 1))

Should it turn on the p_subn->subnet_initialization_error flag?

> +		osm_log(log, OSM_LOG_ERROR, "ERR 331D: "
> +			"LFT of switch 0x%016" PRIx64 " is not up to date.\n",
> +			cl_ntoh64(sw->p_node->node_info.node_guid));
> +	else {
> +		free(sw->lft_buf);
> +		sw->lft_buf = NULL;
> +	}
> +}
> +
>  /**********************************************************************
>   **********************************************************************/
>  int wait_for_pending_transactions(osm_stats_t * stats)
> @@ -1254,6 +1271,9 @@ _repeat_discovery:
>  	if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats))
>  		return;
>  
> +	/* cleanup switch lft buffers */
> +	cl_qmap_apply_func(&sm->p_subn->sw_guid_tbl, cleanup_switch, sm->p_log);
> +
>  	/* We are done setting all LFTs so clear the ignore existing.
>  	 * From now on, as long as we are still master, we want to
>  	 * take into account these lfts. */
> diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
> index 642dcd1..c446f4f 100644
> --- a/opensm/opensm/osm_switch.c
> +++ b/opensm/opensm/osm_switch.c
> @@ -114,13 +114,6 @@ osm_switch_init(IN osm_switch_t * const p_sw,
>  	/* Initialize the table to OSM_NO_PATH, which is "invalid port" */
>  	memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>  
> -	p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1);
> -	if (!p_sw->lft_buf) {
> -		status = IB_INSUFFICIENT_MEMORY;
> -		goto Exit;
> -	}
> -	memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
> -

This part is relevant even w/o the rest of the patch, right?

-- Yevgeny

>  	p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports);
>  	if (p_sw->p_prof == NULL) {
>  		status = IB_INSUFFICIENT_MEMORY;
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 1409e15..3d47640 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -397,13 +397,6 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
>  		goto Exit;
>  	}
>  
> -	if (!p_sw->need_update &&
> -	    !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) {
> -		free(p_sw->lft_buf);
> -		p_sw->lft_buf = NULL;
> -		goto Exit;
> -	}
> -
>  	for (block_id_ho = 0;
>  	     osm_switch_get_lft_block(p_sw, block_id_ho, block);
>  	     block_id_ho++) {
> 


From kliteyn at dev.mellanox.co.il  Sun Nov 23 04:03:57 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 23 Nov 2008 14:03:57 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <20081123110556.GH21967@sashak.voltaire.com>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
	<49255C13.5030503@dev.mellanox.co.il>
	<20081123110556.GH21967@sashak.voltaire.com>
Message-ID: <492946AD.5090308@dev.mellanox.co.il>

Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 14:46 Thu 20 Nov     , Yevgeny Kliteynik wrote:
>> I can do something like the following patch, but I have
>> some strange feeling that I'm missing something...
> 
> I cannot see any errors here. But probably you can use simpler approach
> - just cleanup all switch's lft_buf separately after ucast_mgr is
> finished (including wait_for_pending_transactions()).

I've been doing some thinking...
Basically, what you're saying is that at the end of each and
every heavy sweep you will free ALL the lft_buf arrays, unless
there was some error, that should trigger a new heavy sweep
anyway. So what's the point of having lft_buf in the first place?

It was relevant in the beginning of ucast cache implementation,
but now after all the lft simplifications, I don't see how it
is used. Am I missing something here, or should we just remove
all these lft_buf and go back to single ucast_mgr_t.lft_buf?

-- Yevgeny


From sashak at voltaire.com  Sun Nov 23 04:17:38 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 14:17:38 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <4929455C.2080407@dev.mellanox.co.il>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
	<49255C13.5030503@dev.mellanox.co.il>
	<20081123110556.GH21967@sashak.voltaire.com>
	<4929455C.2080407@dev.mellanox.co.il>
Message-ID: <20081123121738.GJ21967@sashak.voltaire.com>

On 13:58 Sun 23 Nov     , Yevgeny Kliteynik wrote:
> Hi Sasha,
>
> Sasha Khapyorsky wrote:
>> Hi Yevgeny,
>> On 14:46 Thu 20 Nov     , Yevgeny Kliteynik wrote:
>>> I can do something like the following patch, but I have
>>> some strange feeling that I'm missing something...
>> I cannot see any errors here. But probably you can use simpler approach
>> - just cleanup all switch's lft_buf separately after ucast_mgr is
>> finished (including wait_for_pending_transactions()). Something like
>> below (if it is fine for you I can just apply this patch).
>
> In general, looks good. See below.
>
>> BTW, what about to rename lft_buf to new_lft (to improve readability)?
>
> Sure, why not.
>
>> Sasha
>> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
>> index 56212fe..c810106 100644
>> --- a/opensm/opensm/osm_state_mgr.c
>> +++ b/opensm/opensm/osm_state_mgr.c
>> @@ -1001,6 +1001,23 @@ static void 
>> __osm_state_mgr_check_tbl_consistency(IN osm_sm_t * sm)
>>  	OSM_LOG_EXIT(sm->p_log);
>>  }
>>  +static void cleanup_switch(cl_map_item_t *item, void *log)
>> +{
>> +	osm_switch_t *sw = (osm_switch_t *)item;
>> +
>> +	if (!sw->lft_buf)
>> +		return;
>> +	
>> +	if (memcmp(sw->lft, sw->lft_buf, IB_LID_UCAST_END_HO + 1))
>
> Should it turn on the p_subn->subnet_initialization_error flag?

Maybe, but I'm not sure - this is more for bug#1401 materials :),
basically I would expect subnet_initialization_error flag setup when LFT
Set fails.

>
>> +		osm_log(log, OSM_LOG_ERROR, "ERR 331D: "
>> +			"LFT of switch 0x%016" PRIx64 " is not up to date.\n",
>> +			cl_ntoh64(sw->p_node->node_info.node_guid));
>> +	else {
>> +		free(sw->lft_buf);
>> +		sw->lft_buf = NULL;
>> +	}
>> +}
>> +
>>  /**********************************************************************
>>   **********************************************************************/
>>  int wait_for_pending_transactions(osm_stats_t * stats)
>> @@ -1254,6 +1271,9 @@ _repeat_discovery:
>>  	if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats))
>>  		return;
>>  +	/* cleanup switch lft buffers */
>> +	cl_qmap_apply_func(&sm->p_subn->sw_guid_tbl, cleanup_switch, sm->p_log);
>> +
>>  	/* We are done setting all LFTs so clear the ignore existing.
>>  	 * From now on, as long as we are still master, we want to
>>  	 * take into account these lfts. */
>> diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
>> index 642dcd1..c446f4f 100644
>> --- a/opensm/opensm/osm_switch.c
>> +++ b/opensm/opensm/osm_switch.c
>> @@ -114,13 +114,6 @@ osm_switch_init(IN osm_switch_t * const p_sw,
>>  	/* Initialize the table to OSM_NO_PATH, which is "invalid port" */
>>  	memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>>  -	p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1);
>> -	if (!p_sw->lft_buf) {
>> -		status = IB_INSUFFICIENT_MEMORY;
>> -		goto Exit;
>> -	}
>> -	memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>> -
>
> This part is relevant even w/o the rest of the patch, right?

Yes.

Sasha

>
> -- Yevgeny
>
>>  	p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports);
>>  	if (p_sw->p_prof == NULL) {
>>  		status = IB_INSUFFICIENT_MEMORY;
>> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
>> index 1409e15..3d47640 100644
>> --- a/opensm/opensm/osm_ucast_mgr.c
>> +++ b/opensm/opensm/osm_ucast_mgr.c
>> @@ -397,13 +397,6 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * 
>> const p_mgr,
>>  		goto Exit;
>>  	}
>>  -	if (!p_sw->need_update &&
>> -	    !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) {
>> -		free(p_sw->lft_buf);
>> -		p_sw->lft_buf = NULL;
>> -		goto Exit;
>> -	}
>> -
>>  	for (block_id_ho = 0;
>>  	     osm_switch_get_lft_block(p_sw, block_id_ho, block);
>>  	     block_id_ho++) {
>


From kliteyn at dev.mellanox.co.il  Sun Nov 23 04:20:37 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 23 Nov 2008 14:20:37 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_switch.h: use updated LFT for
	routing
In-Reply-To: <20081121192428.GB8310@sashak.voltaire.com>
References: <492550E3.90805@dev.mellanox.co.il>
	<20081121192428.GB8310@sashak.voltaire.com>
Message-ID: <49294A95.3060100@dev.mellanox.co.il>

Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 13:58 Thu 20 Nov     , Yevgeny Kliteynik wrote:
>> Function osm_switch_get_port_by_lid() was using the switch's
>> LFT, so this LFT might not be updated to recent routing.
> 
> I guess it could be only with 'subnet_initialization_error' flag up
> (failed LinFwdTbl set will trigger this flag).
>> I think that this was also relevant before the LFT simplification.
> 
> Yes, logically it should be so, but...
> 
>> One immediate outcome of this bug is opensm.fdbs file - when it
>> is dumped from the switch LFT (and not from lft_buf),
> 
> Why this bug is triggered only now?

I had sometimes errors in simulations, and after aome analysis
I decided that they are timing problems with the tests.
Now that I did some stress testing of ucast cache, I started
to see more of these errors.

>> it sometimes
>> doesn't match the lst file.
> 
> What this "sometimes" mean? I think the case should be investigated
> deeper. By such patch we are just trying to hide a possible issue.
> 
> As far as I understand opensm.fdbs (and other routing dump) are
> generated only after all LinFwdTbl responses are arrived, when some of
> them failed 'subnet_initialization_error' flag is up and OpenSM will
> resweep. If so why is 'opensm.fdbs' broken? It is not immediately
> clear for me.

I didn't see 'subnet_initialization_error' in such cases.
Anyway, here's what I can do: at the end of each ucast_mgr_process
I'll compare lft and lft_buf (something that the other patch is
doing, the one that frees lft_buf), and if there is a difference,
then we have a problem. In not - then I'll look for the cause
elsewhere.

-- Yevgeny

> Sasha
> 
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  opensm/include/opensm/osm_switch.h |    6 +++++-
>>  1 files changed, 5 insertions(+), 1 deletions(-)
>>
>> diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
>> index caa0bc5..f06931c 100644
>> --- a/opensm/include/opensm/osm_switch.h
>> +++ b/opensm/include/opensm/osm_switch.h
>> @@ -411,7 +411,11 @@ osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw,
>>  {
>>  	if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO)
>>  		return OSM_NO_PATH;
>> -	return p_sw->lft[lid_ho];
>> +
>> +	if (p_sw->lft_buf)
>> +		return p_sw->lft_buf[lid_ho];
>> +	else
>> +		return p_sw->lft[lid_ho];
>>  }
>>  /*
>>  * PARAMETERS
>> -- 
>> 1.5.1.4
>>
>>
> 


From sashak at voltaire.com  Sun Nov 23 04:24:07 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 14:24:07 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <492946AD.5090308@dev.mellanox.co.il>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
	<49255C13.5030503@dev.mellanox.co.il>
	<20081123110556.GH21967@sashak.voltaire.com>
	<492946AD.5090308@dev.mellanox.co.il>
Message-ID: <20081123122407.GK21967@sashak.voltaire.com>

On 14:03 Sun 23 Nov     , Yevgeny Kliteynik wrote:
> Sasha,
>
> Sasha Khapyorsky wrote:
>> Hi Yevgeny,
>> On 14:46 Thu 20 Nov     , Yevgeny Kliteynik wrote:
>>> I can do something like the following patch, but I have
>>> some strange feeling that I'm missing something...
>> I cannot see any errors here. But probably you can use simpler approach
>> - just cleanup all switch's lft_buf separately after ucast_mgr is
>> finished (including wait_for_pending_transactions()).
>
> I've been doing some thinking...
> Basically, what you're saying is that at the end of each and
> every heavy sweep you will free ALL the lft_buf arrays, unless
> there was some error, that should trigger a new heavy sweep
> anyway. So what's the point of having lft_buf in the first place?
>
> It was relevant in the beginning of ucast cache implementation,
> but now after all the lft simplifications, I don't see how it
> is used. Am I missing something here, or should we just remove
> all these lft_buf and go back to single ucast_mgr_t.lft_buf?

As far as I remember it was your idea to use newly generated lft_buf in
cache regarless to the state of current LFTs. No?

Also we have strange bug#1406 yet...

Sasha


From sashak at voltaire.com  Sun Nov 23 04:25:09 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 14:25:09 +0200
Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches
	switch's lft
In-Reply-To: <20081123121738.GJ21967@sashak.voltaire.com>
References: <4909DAC8.4040602@dev.mellanox.co.il>
	<20081030214519.GN7502@sashak.voltaire.com>
	<490A2C5D.4080309@dev.mellanox.co.il>
	<20081031043226.GH16455@sashak.voltaire.com>
	<49255C13.5030503@dev.mellanox.co.il>
	<20081123110556.GH21967@sashak.voltaire.com>
	<4929455C.2080407@dev.mellanox.co.il>
	<20081123121738.GJ21967@sashak.voltaire.com>
Message-ID: <20081123122509.GL21967@sashak.voltaire.com>

On 14:17 Sun 23 Nov     , Sasha Khapyorsky wrote:
> >> +	if (!sw->lft_buf)
> >> +		return;
> >> +	
> >> +	if (memcmp(sw->lft, sw->lft_buf, IB_LID_UCAST_END_HO + 1))
> >
> > Should it turn on the p_subn->subnet_initialization_error flag?
> 
> Maybe, but I'm not sure - this is more for bug#1401 materials :),

bug#1406

Sasha


From sashak at voltaire.com  Sun Nov 23 04:33:00 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 14:33:00 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_switch.h: use updated LFT for
	routing
In-Reply-To: <49294A95.3060100@dev.mellanox.co.il>
References: <492550E3.90805@dev.mellanox.co.il>
	<20081121192428.GB8310@sashak.voltaire.com>
	<49294A95.3060100@dev.mellanox.co.il>
Message-ID: <20081123123300.GN21967@sashak.voltaire.com>

On 14:20 Sun 23 Nov     , Yevgeny Kliteynik wrote:
>>> One immediate outcome of this bug is opensm.fdbs file - when it
>>> is dumped from the switch LFT (and not from lft_buf),
>> Why this bug is triggered only now?
>
> I had sometimes errors in simulations, and after aome analysis
> I decided that they are timing problems with the tests.
> Now that I did some stress testing of ucast cache, I started
> to see more of these errors.

If you are sure that this is simulator or test problems then just close
#1406 as invalid. Obviously we don't need such patch then.

>
>>> it sometimes
>>> doesn't match the lst file.
>> What this "sometimes" mean? I think the case should be investigated
>> deeper. By such patch we are just trying to hide a possible issue.
>> As far as I understand opensm.fdbs (and other routing dump) are
>> generated only after all LinFwdTbl responses are arrived, when some of
>> them failed 'subnet_initialization_error' flag is up and OpenSM will
>> resweep. If so why is 'opensm.fdbs' broken? It is not immediately
>> clear for me.
>
> I didn't see 'subnet_initialization_error' in such cases.
> Anyway, here's what I can do: at the end of each ucast_mgr_process
> I'll compare lft and lft_buf (something that the other patch is
> doing, the one that frees lft_buf), and if there is a difference,
> then we have a problem. In not - then I'll look for the cause
> elsewhere.

Yes, seems deeper investigation is needed here. Thanks.

Sasha


From kliteyn at dev.mellanox.co.il  Sun Nov 23 05:24:37 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sun, 23 Nov 2008 15:24:37 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_switch.h: use updated LFT for
	routing
In-Reply-To: <20081123123300.GN21967@sashak.voltaire.com>
References: <492550E3.90805@dev.mellanox.co.il>
	<20081121192428.GB8310@sashak.voltaire.com>
	<49294A95.3060100@dev.mellanox.co.il>
	<20081123123300.GN21967@sashak.voltaire.com>
Message-ID: <49295995.7080304@dev.mellanox.co.il>

Sasha,

Sasha Khapyorsky wrote:
> On 14:20 Sun 23 Nov     , Yevgeny Kliteynik wrote:
>>>> One immediate outcome of this bug is opensm.fdbs file - when it
>>>> is dumped from the switch LFT (and not from lft_buf),
>>> Why this bug is triggered only now?
>> I had sometimes errors in simulations, and after aome analysis
>> I decided that they are timing problems with the tests.
>> Now that I did some stress testing of ucast cache, I started
>> to see more of these errors.
> 
> If you are sure that this is simulator or test problems then just close
> #1406 as invalid. Obviously we don't need such patch then.

No, I'm not sure. My original patch has eliminated this problem.
I any case, deeper investigation is needed.

-- Yevgeny

>>>> it sometimes
>>>> doesn't match the lst file.
>>> What this "sometimes" mean? I think the case should be investigated
>>> deeper. By such patch we are just trying to hide a possible issue.
>>> As far as I understand opensm.fdbs (and other routing dump) are
>>> generated only after all LinFwdTbl responses are arrived, when some of
>>> them failed 'subnet_initialization_error' flag is up and OpenSM will
>>> resweep. If so why is 'opensm.fdbs' broken? It is not immediately
>>> clear for me.
>> I didn't see 'subnet_initialization_error' in such cases.
>> Anyway, here's what I can do: at the end of each ucast_mgr_process
>> I'll compare lft and lft_buf (something that the other patch is
>> doing, the one that frees lft_buf), and if there is a difference,
>> then we have a problem. In not - then I'll look for the cause
>> elsewhere.
> 
> Yes, seems deeper investigation is needed here. Thanks.
> 
> Sasha
> 


From sashak at voltaire.com  Sun Nov 23 06:16:54 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 16:16:54 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_switch.h: use updated LFT for
	routing
In-Reply-To: <49295995.7080304@dev.mellanox.co.il>
References: <492550E3.90805@dev.mellanox.co.il>
	<20081121192428.GB8310@sashak.voltaire.com>
	<49294A95.3060100@dev.mellanox.co.il>
	<20081123123300.GN21967@sashak.voltaire.com>
	<49295995.7080304@dev.mellanox.co.il>
Message-ID: <20081123141654.GP21967@sashak.voltaire.com>

On 15:24 Sun 23 Nov     , Yevgeny Kliteynik wrote:
>> If you are sure that this is simulator or test problems then just close
>> #1406 as invalid. Obviously we don't need such patch then.
>
> No, I'm not sure. My original patch has eliminated this problem.

So? Should we workaround simulator/test bugs in OpenSM code? I think we
shouldn't.

> I any case, deeper investigation is needed.

Ok.

Sasha


From sashak at voltaire.com  Sun Nov 23 10:27:41 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 20:27:41 +0200
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081120163809.26a3c499.weiny2@llnl.gov>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
Message-ID: <20081123182741.GS21967@sashak.voltaire.com>

Hi Ira,

On 16:38 Thu 20 Nov     , Ira Weiny wrote:
> The following 3 patches implement "libibnetdisc" which provides the
> functionality of ibnetdiscover in a C library.
> 
> I mentioned this to Sasha at the last Sonoma conference and posted the bulk of
> this code to the list a few months ago.  This libary is still providing the 85%
> performance speed up of iblinkinfo.pl on our clusters.

This is great!

Do not you think this library should be rather part of infiniband-diags,
rather than separate package/management sub-project? Personally I would
prefer to have this as part of infiniband-diags.

Sasha


From sashak at voltaire.com  Sun Nov 23 10:35:17 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 20:35:17 +0200
Subject: [ofa-general] Re: [PATCH 3/3] Convert ibnetdiscover to use new
	ibnetdisc library.
In-Reply-To: <20081120163815.5cd110fb.weiny2@llnl.gov>
References: <20081120163815.5cd110fb.weiny2@llnl.gov>
Message-ID: <20081123183517.GT21967@sashak.voltaire.com>

Hi Ira,

On 16:38 Thu 20 Nov     , Ira Weiny wrote:
> From e2b8bac5d651c2278719d511dee2ab2e8ad05706 Mon Sep 17 00:00:00 2001
> From: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu, 20 Nov 2008 09:29:57 -0800
> Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library.
> 
>    Removed -e and -v since they were somewhat redundant with the -d option.

I think it would be better to preserve an options for backward
compatibility. At least '-v' is used in dump_ftts.sh. It can be used in
other scripts...

Sasha

> 
>    All other functionality is preserved
> 
> Signed-off-by: Ira Weiny <weiny2 at llnl.gov>
> ---
>  infiniband-diags/Makefile.am         |    4 +-
>  infiniband-diags/man/ibnetdiscover.8 |   10 +-
>  infiniband-diags/src/ibnetdiscover.c |  910 ++++++++++------------------------
>  3 files changed, 254 insertions(+), 670 deletions(-)
> 
> diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am
> index 8f26749..420c69e 100644
> --- a/infiniband-diags/Makefile.am
> +++ b/infiniband-diags/Makefile.am
> @@ -35,9 +35,9 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \
>  src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c
>  src_ibaddr_CFLAGS = -Wall $(DBGFLAGS)
>  
> -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c
> +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/ibdiag_common.c
>  src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS)
> -src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir)
> +src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -libnetdisc
>  
>  src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c
>  src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS)
> diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8
> index 958efa9..768d392 100644
> --- a/infiniband-diags/man/ibnetdiscover.8
> +++ b/infiniband-diags/man/ibnetdiscover.8
> @@ -5,7 +5,7 @@ ibnetdiscover \- discover InfiniBand topology
>  
>  .SH SYNOPSIS
>  .B ibnetdiscover
> -[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map <node-name-map>] [\-p(orts)] [\-h(elp)] [<topology-file>]
> +[\-d(ebug)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map <node-name-map>] [\-p(orts)] [\-h(elp)] [<topology-file>]
>  
>  .SH DESCRIPTION
>  .PP
> @@ -37,7 +37,7 @@ List of connected switches
>  List of connected routers
>  .TP
>  \fB\-s\fR, \fB\-\-show\fR
> -Show more information
> +Show progress information during discovery.
>  .TP
>  \fB\-\-node\-name\-map\fR <node-name-map>
>  Specify a node name map.  The node name map file maps GUIDs to more user friendly
> @@ -57,15 +57,9 @@ using the util_name -h syntax.
>  # Debugging flags
>  .PP
>  \-d      raise the IB debugging level.
> -        May be used several times (-ddd or -d -d -d).
> -.PP
> -\-e      show send and receive errors (timeouts and others)
>  .PP
>  \-h      show the usage message
>  .PP
> -\-v      increase the application verbosity level.
> -        May be used several times (-vv or -v -v -v)
> -.PP
>  \-V      show the version info.
>  
>  # Other common flags:
> diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c
> index 2cfaa8a..d8ead48 100644
> --- a/infiniband-diags/src/ibnetdiscover.c
> +++ b/infiniband-diags/src/ibnetdiscover.c
> @@ -1,6 +1,7 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire Inc.  All rights reserved.
>   * Copyright (c) 2007 Xsigo Systems Inc.  All rights reserved.
> + * Copyright (c) 2008 Lawrence Livermore National Lab.  All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -47,483 +48,108 @@
>  #include <errno.h>
>  #include <inttypes.h>
>  
> -#include <infiniband/common.h>
> -#include <infiniband/umad.h>
> -#include <infiniband/mad.h>
>  #include <infiniband/complib/cl_nodenamemap.h>
> +#include <infiniband/ibnetdisc.h>
> +#include <infiniband/common.h>
>  
> -#include "ibnetdiscover.h"
> -#include "grouping.h"
>  #include "ibdiag_common.h"
>  
> -static char *node_type_str[] = {
> -	"???",
> -	"ca",
> -	"switch",
> -	"router",
> -	"iwarp rnic"
> -};
> -
> -static char *linkwidth_str[] = {
> -	"??",
> -	"1x",
> -	"4x",
> -	"??",
> -	"8x",
> -	"??",
> -	"??",
> -	"??",
> -	"12x"
> -};
> -
> -static char *linkspeed_str[] = {
> -	"???",
> -	"SDR",
> -	"DDR",
> -	"???",
> -	"QDR"
> -};
> -
> -static int timeout = 2000;		/* ms */
> -static int dumplevel = 0;
> +static int debug;
>  static int verbose;
> -static FILE *f;
> +#define LIST_CA_NODE	 (1 << IBND_CA_NODE)
> +#define LIST_SWITCH_NODE (1 << IBND_SWITCH_NODE)
> +#define LIST_ROUTER_NODE (1 << IBND_ROUTER_NODE)
>  
>  char *argv0 = "ibnetdiscover";
> +static FILE *f;
>  
>  static char *node_name_map_file = NULL;
>  static nn_map_t *node_name_map = NULL;
>  
> -Node *nodesdist[MAXHOPS+1];     /* last is Ca list */
> -Node *mynode;
> -int maxhops_discovered = 0;
> -
> -struct ChassisList *chassis = NULL;
> -
> -static char *
> -get_linkwidth_str(int linkwidth)
> -{
> -	if (linkwidth > 8)
> -		return linkwidth_str[0];
> -	else
> -		return linkwidth_str[linkwidth];
> -}
> -
> -static char *
> -get_linkspeed_str(int linkspeed)
> -{
> -	if (linkspeed > 4)
> -		return linkspeed_str[0];
> -	else
> -		return linkspeed_str[linkspeed];
> -}
> -
> -static inline const char*
> -node_type_str2(Node *node)
> -{
> -	switch(node->type) {
> -	case SWITCH_NODE: return "SW";
> -	case CA_NODE:     return "CA";
> -	case ROUTER_NODE: return "RT";
> -	}
> -	return "??";
> -}
> -
> -void
> -decode_port_info(void *pi, Port *port)
> -{
> -	mad_decode_field(pi, IB_PORT_LID_F, &port->lid);
> -	mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc);
> -	mad_decode_field(pi, IB_PORT_STATE_F, &port->state);
> -	mad_decode_field(pi, IB_PORT_PHYS_STATE_F, &port->physstate);
> -	mad_decode_field(pi, IB_PORT_LINK_WIDTH_ACTIVE_F, &port->linkwidth);
> -	mad_decode_field(pi, IB_PORT_LINK_SPEED_ACTIVE_F, &port->linkspeed);
> -}
> -
> -
> -int
> -get_port(Port *port, int portnum, ib_portid_t *portid)
> -{
> -	char portinfo[64];
> -	void *pi = portinfo;
> -
> -	port->portnum = portnum;
> -
> -	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout))
> -		return -1;
> -	decode_port_info(pi, port);
> -
> -	DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s",
> -		portid2str(portid), portnum, port->lid, port->state, port->physstate, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed));
> -	return 1;
> -}
> -/*
> - * Returns 0 if non switch node is found, 1 if switch is found, -1 if error.
> - */
> -int
> -get_node(Node *node, Port *port, ib_portid_t *portid)
> -{
> -	char portinfo[64];
> -	char switchinfo[64];
> -	void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc;
> -	void *si = switchinfo;
> -
> -	if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout))
> -		return -1;
> -
> -	mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid);
> -	mad_decode_field(ni, IB_NODE_TYPE_F, &node->type);
> -	mad_decode_field(ni, IB_NODE_NPORTS_F, &node->numports);
> -	mad_decode_field(ni, IB_NODE_DEVID_F, &node->devid);
> -	mad_decode_field(ni, IB_NODE_VENDORID_F, &node->vendid);
> -	mad_decode_field(ni, IB_NODE_SYSTEM_GUID_F, &node->sysimgguid);
> -	mad_decode_field(ni, IB_NODE_PORT_GUID_F, &node->portguid);
> -	mad_decode_field(ni, IB_NODE_LOCAL_PORT_F, &node->localport);
> -	port->portnum = node->localport;
> -	port->portguid = node->portguid;
> -
> -	if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout))
> -		return -1;
> -
> -	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout))
> -		return -1;
> -	decode_port_info(pi, port);
> -
> -	if (node->type != SWITCH_NODE)
> -		return 0;
> -
> -	node->smalid = port->lid;
> -	node->smalmc = port->lmc;
> -
> -	/* after we have the sma information find out the real PortInfo for this port */
> -	if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, node->localport, timeout))
> -	        return -1;
> -	decode_port_info(pi, port);
> -
> -        if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout))
> -                node->smaenhsp0 = 0;	/* assume base SP0 */
> -	else
> -        	mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0);
> -
> -	DEBUG("portid %s: got switch node %" PRIx64 " '%s'",
> -	      portid2str(portid), node->nodeguid, node->nodedesc);
> -	return 1;
> -}
> -
> -static int
> -extend_dpath(ib_dr_path_t *path, int nextport)
> -{
> -	if (path->cnt+2 >= sizeof(path->p))
> -		return -1;
> -	++path->cnt;
> -	if (path->cnt > maxhops_discovered)
> -		maxhops_discovered = path->cnt;
> -	path->p[path->cnt] = nextport;
> -	return path->cnt;
> -}
> -
> -static void
> -dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port)
> -{
> -	if (!dumplevel)
> -		return;
> -
> -	fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n",
> -		portid2str(path), prompt,
> -		(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
> -		node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum,
> -		port->lid, port->lid + (1 << port->lmc) - 1,
> -		clean_nodedesc(node->nodedesc));
> -}
> -
> -#define HASHGUID(guid)		((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
> -#define HTSZ 137
> -
> -static Node *nodestbl[HTSZ];
> -
> -static Node *
> -find_node(Node *new)
> -{
> -	int hash = HASHGUID(new->nodeguid) % HTSZ;
> -	Node *node;
> -
> -	for (node = nodestbl[hash]; node; node = node->htnext)
> -		if (node->nodeguid == new->nodeguid)
> -			return node;
> -
> -	return NULL;
> -}
> -
> -static Node *
> -create_node(Node *temp, ib_portid_t *path, int dist)
> -{
> -	Node *node;
> -	int hash = HASHGUID(temp->nodeguid) % HTSZ;
> -
> -	node = malloc(sizeof(*node));
> -	if (!node)
> -		return NULL;
> -
> -	memcpy(node, temp, sizeof(*node));
> -	node->dist = dist;
> -	node->path = *path;
> -
> -	node->htnext = nodestbl[hash];
> -	nodestbl[hash] = node;
> -
> -	if (node->type != SWITCH_NODE)
> -		dist = MAXHOPS; 	/* special Ca list */
> -
> -	node->dnext = nodesdist[dist];
> -	nodesdist[dist] = node;
> -
> -	return node;
> -}
> -
> -static Port *
> -find_port(Node *node, Port *port)
> -{
> -	Port *old;
> -
> -	for (old = node->ports; old; old = old->next)
> -		if (old->portnum == port->portnum)
> -			return old;
> -
> -	return NULL;
> -}
> -
> -static Port *
> -create_port(Node *node, Port *temp)
> -{
> -	Port *port;
> -
> -	port = malloc(sizeof(*port));
> -	if (!port)
> -		return NULL;
> -
> -	memcpy(port, temp, sizeof(*port));
> -	port->node = node;
> -	port->next = node->ports;
> -	node->ports = port;
> -
> -	return port;
> -}
> -
> -static void
> -link_ports(Node *node, Port *port, Node *remotenode, Port *remoteport)
> -{
> -	DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u",
> -		node->nodeguid, node, port, port->portnum,
> -		remotenode->nodeguid, remotenode, remoteport, remoteport->portnum);
> -	if (port->remoteport)
> -		port->remoteport->remoteport = NULL;
> -	if (remoteport->remoteport)
> -		remoteport->remoteport->remoteport = NULL;
> -	port->remoteport = remoteport;
> -	remoteport->remoteport = port;
> -}
> -
> -static int
> -handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist)
> -{
> -	Node node_buf;
> -	Port port_buf;
> -	Node *remotenode, *oldnode;
> -	Port *remoteport, *oldport;
> -
> -	memset(&node_buf, 0, sizeof(node_buf));
> -	memset(&port_buf, 0, sizeof(port_buf));
> -
> -	DEBUG("handle node %p port %p:%d dist %d", node, port, portnum, dist);
> -	if (port->physstate != 5)	/* LinkUp */
> -		return -1;
> -
> -	if (extend_dpath(&path->drpath, portnum) < 0)
> -		return -1;
> -
> -	if (get_node(&node_buf, &port_buf, path) < 0) {
> -		IBWARN("NodeInfo on %s failed, skipping port",
> -			portid2str(path));
> -		path->drpath.cnt--;	/* restore path */
> -		return -1;
> -	}
> -
> -	oldnode = find_node(&node_buf);
> -	if (oldnode)
> -		remotenode = oldnode;
> -	else if (!(remotenode = create_node(&node_buf, path, dist + 1)))
> -		IBERROR("no memory");
> -
> -	oldport = find_port(remotenode, &port_buf);
> -	if (oldport) {
> -		remoteport = oldport;
> -		if (node != remotenode || port != remoteport)
> -			IBWARN("port moving...");
> -	} else if (!(remoteport = create_port(remotenode, &port_buf)))
> -		IBERROR("no memory");
> -
> -	dump_endnode(path, oldnode ? "known remote" : "new remote",
> -		     remotenode, remoteport);
> -
> -	link_ports(node, port, remotenode, remoteport);
> -
> -	path->drpath.cnt--;	/* restore path */
> -	return 0;
> -}
> -
> -/*
> - * Return 1 if found, 0 if not, -1 on errors.
> - */
> -static int
> -discover(ib_portid_t *from)
> -{
> -	Node node_buf;
> -	Port port_buf;
> -	Node *node;
> -	Port *port;
> -	int i;
> -	int dist = 0;
> -	ib_portid_t *path;
> -
> -	DEBUG("from %s", portid2str(from));
> -
> -	memset(&node_buf, 0, sizeof(node_buf));
> -	memset(&port_buf, 0, sizeof(port_buf));
> -
> -	if (get_node(&node_buf, &port_buf, from) < 0) {
> -		IBWARN("can't reach node %s", portid2str(from));
> -		return -1;
> -	}
> -
> -	node = create_node(&node_buf, from, 0);
> -	if (!node)
> -		IBERROR("out of memory");
> -
> -	mynode = node;
> -
> -	port = create_port(node, &port_buf);
> -	if (!port)
> -		IBERROR("out of memory");
> -
> -	if (node->type != SWITCH_NODE &&
> -	    handle_port(node, port, from, node->localport, 0) < 0)
> -		return 0;
> -
> -	for (dist = 0; dist < MAXHOPS; dist++) {
> -
> -		for (node = nodesdist[dist]; node; node = node->dnext) {
> -
> -			path = &node->path;
> -
> -			DEBUG("dist %d node %p", dist, node);
> -			dump_endnode(path, "processing", node, port);
> -
> -			for (i = 1; i <= node->numports; i++) {
> -				if (i == node->localport)
> -					continue;
> -
> -				if (get_port(&port_buf, i, path) < 0) {
> -					IBWARN("can't reach node %s port %d", portid2str(path), i);
> -					continue;
> -				}
> -
> -				port = find_port(node, &port_buf);
> -				if (port)
> -					continue;
> -
> -				port = create_port(node, &port_buf);
> -				if (!port)
> -					IBERROR("out of memory");
> -
> -				/* If switch, set port GUID to node GUID */
> -				if (node->type == SWITCH_NODE)
> -					port->portguid = node->portguid;
> -
> -				handle_port(node, port, path, i, dist);
> -			}
> -		}
> -	}
> +static int timeout_ms = 2000;
> +static int dumplevel = 0;
>  
> -	return 0;
> -}
>  
>  char *
> -node_name(Node *node)
> +node_name(ibnd_node_t *node)
>  {
>  	static char buf[256];
>  
> -	switch(node->type) {
> -	case SWITCH_NODE:
> -		sprintf(buf, "\"%s", "S");
> -		break;
> -	case CA_NODE:
> +	switch(node->info.type) {
> +	case IBND_CA_NODE:
>  		sprintf(buf, "\"%s", "H");
>  		break;
> -	case ROUTER_NODE:
> +	case IBND_SWITCH_NODE:
> +		sprintf(buf, "\"%s", "S");
> +		break;
> +	case IBND_ROUTER_NODE:
>  		sprintf(buf, "\"%s", "R");
>  		break;
>  	default:
>  		sprintf(buf, "\"%s", "?");
>  		break;
>  	}
> -	sprintf(buf+2, "-%016" PRIx64 "\"", node->nodeguid);
> +	sprintf(buf+2, "-%016" PRIx64 "\"", node->info.nodeguid);
>  
>  	return buf;
>  }
>  
>  void
> -list_node(Node *node)
> +list_node(ibnd_node_t *node, void *user_data)
>  {
> -	char *node_type;
> -	char *nodename = remap_node_name(node_name_map, node->nodeguid,
> +	char *nodename = remap_node_name(node_name_map, node->info.nodeguid,
>  					      node->nodedesc);
>  
> -	switch(node->type) {
> -	case SWITCH_NODE:
> -		node_type = "Switch";
> -		break;
> -	case CA_NODE:
> -		node_type = "Ca";
> -		break;
> -	case ROUTER_NODE:
> -		node_type = "Router";
> -		break;
> -	default:
> -		node_type = "???";
> -		break;
> -	}
>  	fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n",
> -		node_type,
> -		node->nodeguid, node->numports, node->devid, node->vendid,
> +		ibnd_node_type_str(node),
> +		node->info.nodeguid, node->info.numports, node->info.devid,
> +		node->info.vendid,
>  		nodename);
>  
>  	free(nodename);
>  }
>  
>  void
> -out_ids(Node *node, int group, char *chname)
> +list_nodes(ibnd_fabric_t *fabric, int list)
> +{
> +	if (list & LIST_CA_NODE) {
> +		ibnd_iter_nodes_type(fabric, list_node, IBND_CA_NODE, NULL);
> +	}
> +	if (list & LIST_SWITCH_NODE) {
> +		ibnd_iter_nodes_type(fabric, list_node, IBND_SWITCH_NODE, NULL);
> +	}
> +	if (list & LIST_ROUTER_NODE) {
> +		ibnd_iter_nodes_type(fabric, list_node, IBND_ROUTER_NODE, NULL);
> +	}
> +}
> +
> +void
> +out_ids(ibnd_node_t *node, int group, char *chname)
>  {
> -	fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid);
> -	if (node->sysimgguid)
> -		fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid);
> +	fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->info.vendid, node->info.devid);
> +	if (node->info.sysimgguid)
> +		fprintf(f, "sysimgguid=0x%" PRIx64, node->info.sysimgguid);
>  	if (group
>  	    && node->chrecord && node->chrecord->chassisnum) {
>  		fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum);
>  		if (chname)
> -			fprintf(f, " (%s)", chname);
> -		if (is_xsigo_tca(node->nodeguid) && node->ports->remoteport)
> -			fprintf(f, " slot %d", node->ports->remoteport->portnum);
> +			fprintf(f, " (%s)", clean_nodedesc(chname));
> +		if (ibnd_is_xsigo_tca(node->info.nodeguid)
> +				&& node->ports[1]
> +				&& node->ports[1]->remoteport)
> +			fprintf(f, " slot %d", node->ports[1]->remoteport->portnum);
>  	}
>  	fprintf(f, "\n");
>  }
>  
> +
>  uint64_t
> -out_chassis(int chassisnum)
> +out_chassis(ibnd_fabric_t *fabric, int chassisnum)
>  {
>  	uint64_t guid;
>  
>  	fprintf(f, "\nChassis %d", chassisnum);
> -	guid = get_chassis_guid(chassisnum);
> +	guid = ibnd_get_chassis_guid(fabric, chassisnum);
>  	if (guid)
>  		fprintf(f, " (guid 0x%" PRIx64 ")", guid);
>  	fprintf(f, "\n");
> @@ -531,54 +157,49 @@ out_chassis(int chassisnum)
>  }
>  
>  void
> -out_switch(Node *node, int group, char *chname)
> +out_switch(ibnd_node_t *node, int group, char *chname)
>  {
>  	char *str;
> +	char  str2[256];
>  	char *nodename = NULL;
>  
>  	out_ids(node, group, chname);
> -	fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid);
> -	fprintf(f, "(%" PRIx64 ")", node->portguid);
> -	/* Currently, only if Voltaire chassis */
> -	if (group
> -	    && node->chrecord && node->chrecord->chassisnum
> -	    && node->vendid == VTR_VENDOR_ID) {
> -		str = get_chassis_type(node->chrecord->chassistype);
> +	fprintf(f, "switchguid=0x%" PRIx64, node->info.nodeguid);
> +	fprintf(f, "(%" PRIx64 ")", node->info.nodeportguid);
> +	if (group) {
> +		str = ibnd_get_chassis_type(node);
>  		if (str)
>  			fprintf(f, "%s ", str);
> -		str = get_chassis_slot(node->chrecord->chassisslot);
> +		str = ibnd_get_chassis_slot_str(node, str2, 256);
>  		if (str)
> -			fprintf(f, "%s ", str);
> -		fprintf(f, "%d Chip %d", node->chrecord->slotnum, node->chrecord->anafanum);
> +			fprintf(f, "%s", str);
>  	}
>  
> -	nodename = remap_node_name(node_name_map, node->nodeguid,
> +	nodename = remap_node_name(node_name_map, node->info.nodeguid,
>  				node->nodedesc);
>  
>  	fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n",
> -		node->numports, node_name(node),
> +		node->info.numports, node_name(node),
>  		nodename,
> -		node->smaenhsp0 ? "enhanced" : "base",
> +		node->sw_info.smaenhsp0 ? "enhanced" : "base",
>  		node->smalid, node->smalmc);
>  
>  	free(nodename);
>  }
>  
>  void
> -out_ca(Node *node, int group, char *chname)
> +out_ca(ibnd_node_t *node, int group, char *chname)
>  {
>  	char *node_type;
>  	char *node_type2;
> -	char *nodename = remap_node_name(node_name_map, node->nodeguid,
> -					      node->nodedesc);
>  
>  	out_ids(node, group, chname);
> -	switch(node->type) {
> -	case CA_NODE:
> +	switch(node->info.type) {
> +	case IBND_CA_NODE:
>  		node_type = "ca";
>  		node_type2 = "Ca";
>  		break;
> -	case ROUTER_NODE:
> +	case IBND_ROUTER_NODE:
>  		node_type = "rt";
>  		node_type2 = "Rt";
>  		break;
> @@ -588,37 +209,37 @@ out_ca(Node *node, int group, char *chname)
>  		break;
>  	}
>  
> -	fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid);
> +	fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->info.nodeguid);
>  	fprintf(f, "%s\t%d %s\t\t# \"%s\"",
> -		node_type2, node->numports, node_name(node),
> -		nodename);
> -	if (group && is_xsigo_hca(node->nodeguid))
> +		node_type2, node->info.numports, node_name(node),
> +		clean_nodedesc(node->nodedesc));
> +	if (group && ibnd_is_xsigo_hca(node->info.nodeguid))
>  		fprintf(f, " (scp)");
>  	fprintf(f, "\n");
> -
> -	free(nodename);
>  }
>  
> +#define OUT_BUFFER_SIZE 16
>  static char *
> -out_ext_port(Port *port, int group)
> +out_ext_port(ibnd_port_t *port, int group)
>  {
> -	char *str = NULL;
> +	static char mapping[OUT_BUFFER_SIZE];
>  
> -	/* Currently, only if Voltaire chassis */
> -	if (group
> -	    && port->node->chrecord && port->node->vendid == VTR_VENDOR_ID)
> -		str = portmapstring(port);
> +	if (group && port->ext_portnum != 0) {
> +		snprintf(mapping, OUT_BUFFER_SIZE,
> +			"[ext %d]", port->ext_portnum);
> +		return (mapping);
> +	}
>  
> -	return (str);
> +	return (NULL);
>  }
>  
>  void
> -out_switch_port(Port *port, int group)
> +out_switch_port(ibnd_port_t *port, int group)
>  {
>  	char *ext_port_str = NULL;
>  	char *rem_nodename = NULL;
>  
> -	DEBUG("port %p:%d remoteport %p", port, port->portnum, port->remoteport);
> +	DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport);
>  	fprintf(f, "[%d]", port->portnum);
>  
>  	ext_port_str = out_ext_port(port, group);
> @@ -626,7 +247,7 @@ out_switch_port(Port *port, int group)
>  		fprintf(f, "%s", ext_port_str);
>  
>  	rem_nodename = remap_node_name(node_name_map,
> -				port->remoteport->node->nodeguid,
> +				port->remoteport->node->info.nodeguid,
>  				port->remoteport->node->nodedesc);
>  
>  	ext_port_str = out_ext_port(port->remoteport, group);
> @@ -634,17 +255,17 @@ out_switch_port(Port *port, int group)
>  		node_name(port->remoteport->node),
>  		port->remoteport->portnum,
>  		ext_port_str ? ext_port_str : "");
> -	if (port->remoteport->node->type != SWITCH_NODE)
> -		fprintf(f, "(%" PRIx64 ") ", port->remoteport->portguid);
> +	if (port->remoteport->node->info.type != IBND_SWITCH_NODE)
> +		fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid);
>  	fprintf(f, "\t\t# \"%s\" lid %d %s%s",
>  		rem_nodename,
> -		port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid,
> -		get_linkwidth_str(port->linkwidth),
> -		get_linkspeed_str(port->linkspeed));
> +		port->remoteport->node->info.type == IBND_SWITCH_NODE ?  port->remoteport->node->smalid : port->remoteport->info.lid,
> +		ibnd_linkwidth_str(port->info.link_width_active),
> +		ibnd_linkspeed_str(port->info.link_speed_active));
>  
> -	if (is_xsigo_tca(port->remoteport->portguid))
> +	if (ibnd_is_xsigo_tca(port->remoteport->guid))
>  		fprintf(f, " slot %d", port->portnum);
> -	else if (is_xsigo_hca(port->remoteport->portguid))
> +	else if (ibnd_is_xsigo_hca(port->remoteport->guid))
>  		fprintf(f, " (scp)");
>  	fprintf(f, "\n");
>  
> @@ -652,68 +273,80 @@ out_switch_port(Port *port, int group)
>  }
>  
>  void
> -out_ca_port(Port *port, int group)
> +out_ca_port(ibnd_port_t *port, int group)
>  {
>  	char *str = NULL;
>  	char *rem_nodename = NULL;
>  
>  	fprintf(f, "[%d]", port->portnum);
> -	if (port->node->type != SWITCH_NODE)
> -		fprintf(f, "(%" PRIx64 ") ", port->portguid);
> +	if (port->node->info.type != IBND_SWITCH_NODE)
> +		fprintf(f, "(%" PRIx64 ") ", port->guid);
>  	fprintf(f, "\t%s[%d]",
>  		node_name(port->remoteport->node),
>  		port->remoteport->portnum);
>  	str = out_ext_port(port->remoteport, group);
>  	if (str)
>  		fprintf(f, "%s", str);
> -	if (port->remoteport->node->type != SWITCH_NODE)
> -		fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid);
> +	if (port->remoteport->node->info.type != IBND_SWITCH_NODE)
> +		fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid);
>  
>  	rem_nodename = remap_node_name(node_name_map,
> -				port->remoteport->node->nodeguid,
> +				port->remoteport->node->info.nodeguid,
>  				port->remoteport->node->nodedesc);
>  
>  	fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n",
> -		port->lid, port->lmc, rem_nodename,
> -		port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid,
> -		get_linkwidth_str(port->linkwidth),
> -		get_linkspeed_str(port->linkspeed));
> +		port->info.lid, port->info.lmc, rem_nodename,
> +		port->remoteport->node->info.type == IBND_SWITCH_NODE ?  port->remoteport->node->smalid : port->remoteport->info.lid,
> +		ibnd_linkwidth_str(port->info.link_width_active),
> +		ibnd_linkspeed_str(port->info.link_speed_active));
>  
>  	free(rem_nodename);
>  }
>  
>  int
> -dump_topology(int listtype, int group)
> +dump_topology(int group, ibnd_fabric_t *fabric)
>  {
> -	Node *node;
> -	Port *port;
> -	int i = 0, dist = 0;
> +	ibnd_node_t *node;
> +	ibnd_port_t *port;
> +	int i = 0, dist = 0, p = 0;
>  	time_t t = time(0);
>  	uint64_t chguid;
>  	char *chname = NULL;
>  
> -	if (!listtype) {
> -		fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
> -		fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered);
> -		fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid);
> -	}
> +	fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
> +	fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered);
> +	fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n",
> +		fabric->from_node->info.nodeguid, fabric->from_node->info.nodeportguid);
>  
>  	/* Make pass on switches */
> -	if (group && !listtype) {
> -		ChassisList *ch = NULL;
> +	if (group) {
> +		ibnd_chassis_list_t *ch = NULL;
>  
>  		/* Chassis based switches first */
> -		for (ch = chassis; ch; ch = ch->next) {
> +		for (ch = fabric->chassis; ch; ch = ch->next) {
>  			int n = 0;
>  
>  			if (!ch->chassisnum)
>  				continue;
> -			chguid = out_chassis(ch->chassisnum);
> -			if (chname)
> -				free(chname);
> +			chguid = out_chassis(fabric, ch->chassisnum);
> +
>  			chname = NULL;
> -			if (is_xsigo_guid(chguid)) {
> -				for (node = nodesdist[MAXHOPS]; node; node = node->dnext) {
> +/**
> + * Hal will this work for Xsigo?
> + */
> +			if (ibnd_is_xsigo_guid(chguid)) {
> +				for (node = ch->nodes; node; node = node->chassis_next) {
> +					if (ibnd_is_xsigo_hca(node->info.nodeguid)) {
> +						chname = node->nodedesc;
> +						fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc));
> +					}
> +				}
> +
> +#if 0
> +/**
> + * vs. this?
> + */
> +				for (node = fabric->nodesdist[MAXHOPS]; node; node = node->dnext) {
>  					if (!node->chrecord ||
>  					    !node->chrecord->chassisnum)
>  						continue;
> @@ -721,209 +354,171 @@ dump_topology(int listtype, int group)
>  					if (node->chrecord->chassisnum != ch->chassisnum)
>  						continue;
>  
> -					if (is_xsigo_hca(node->nodeguid)) {
> -						chname = remap_node_name(node_name_map,
> -								node->nodeguid,
> -								node->nodedesc);
> -						fprintf(f, "Hostname: %s\n", chname);
> +					if (ibnd_is_xsigo_hca(node->nodeguid)) {
> +						chname = node->nodedesc;
> +						fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc));
>  					}
>  				}
> +#endif
>  			}
>  
>  			fprintf(f, "\n# Spine Nodes");
> -			for (n = 1; n <= (SPINES_MAX_NUM+1); n++) {
> +			for (n = 1; n <= SPINES_MAX_NUM; n++) {
>  				if (ch->spinenode[n]) {
>  					out_switch(ch->spinenode[n], group, chname);
> -					for (port = ch->spinenode[n]->ports; port; port = port->next, i++)
> -						if (port->remoteport)
> +					for (p = 1; p <= ch->spinenode[n]->info.numports; p++) {
> +						port = ch->spinenode[n]->ports[p];
> +						if (port && port->remoteport)
>  							out_switch_port(port, group);
> +					}
>  				}
>  			}
>  			fprintf(f, "\n# Line Nodes");
> -			for (n = 1; n <= (LINES_MAX_NUM+1); n++) {
> +			for (n = 1; n <= LINES_MAX_NUM; n++) {
>  				if (ch->linenode[n]) {
>  					out_switch(ch->linenode[n], group, chname);
> -					for (port = ch->linenode[n]->ports; port; port = port->next, i++)
> -						if (port->remoteport)
> +					for (p = 1; p <= ch->linenode[n]->info.numports; p++) {
> +						port = ch->linenode[n]->ports[p];
> +						if (port && port->remoteport)
>  							out_switch_port(port, group);
> +					}
>  				}
>  			}
>  
>  			fprintf(f, "\n# Chassis Switches");
> -			for (dist = 0; dist <= maxhops_discovered; dist++) {
> -
> -				for (node = nodesdist[dist]; node; node = node->dnext) {
> -
> -					/* Non Voltaire chassis */
> -					if (node->vendid == VTR_VENDOR_ID)
> -						continue;
> -					if (!node->chrecord ||
> -					    !node->chrecord->chassisnum)
> -						continue;
> -
> -					if (node->chrecord->chassisnum != ch->chassisnum)
> -						continue;
> -
> +			for (node = ch->nodes; node; node = node->chassis_next) {
> +				if (node->info.type == IBND_SWITCH_NODE) {
>  					out_switch(node, group, chname);
> -					for (port = node->ports; port; port = port->next, i++)
> -						if (port->remoteport)
> +					for (p = 1; p <= node->info.numports; p++) {
> +						port = node->ports[p];
> +						if (port && port->remoteport)
>  							out_switch_port(port, group);
> -
> +					}
>  				}
> -
>  			}
>  
>  			fprintf(f, "\n# Chassis CAs");
> -			for (node = nodesdist[MAXHOPS]; node; node = node->dnext) {
> -				if (!node->chrecord ||
> -				    !node->chrecord->chassisnum)
> -					continue;
> -
> -				if (node->chrecord->chassisnum != ch->chassisnum)
> -					continue;
> -
> -				out_ca(node, group, chname);
> -				for (port = node->ports; port; port = port->next, i++)
> -					if (port->remoteport)
> -						out_ca_port(port, group);
> -
> +			for (node = ch->nodes; node; node = node->chassis_next) {
> +				if (node->info.type == IBND_CA_NODE) {
> +					out_ca(node, group, chname);
> +					for (p = 1; p <= node->info.numports; p++) {
> +						port = node->ports[p];
> +						if (port && port->remoteport)
> +							out_ca_port(port, group);
> +					}
> +				}
>  			}
>  
>  		}
>  
> -	} else {
> -		for (dist = 0; dist <= maxhops_discovered; dist++) {
> -
> -			for (node = nodesdist[dist]; node; node = node->dnext) {
> -
> -				DEBUG("SWITCH: dist %d node %p", dist, node);
> -				if (!listtype)
> -					out_switch(node, group, chname);
> -				else {
> -					if (listtype & LIST_SWITCH_NODE)
> -						list_node(node);
> -					continue;
> -				}
> -
> -				for (port = node->ports; port; port = port->next, i++)
> -					if (port->remoteport)
> +	} else { /* !group */
> +		for (node = fabric->switches; node; node = node->type_next) {
> +				DEBUG("SWITCH: dist %d node %p\n", dist, node);
> +				out_switch(node, group, chname);
> +				for (p = 1; p <= node->info.numports; p++) {
> +					port = node->ports[p];
> +					if (port && port->remoteport)
>  						out_switch_port(port, group);
> -			}
> +				}
>  		}
>  	}
>  
> -	if (chname)
> -		free(chname);
>  	chname = NULL;
> -	if (group && !listtype) {
> -
> +	if (group) {
>  		fprintf(f, "\nNon-Chassis Nodes\n");
> -
> -		for (dist = 0; dist <= maxhops_discovered; dist++) {
> -
> -			for (node = nodesdist[dist]; node; node = node->dnext) {
> -
> -				DEBUG("SWITCH: dist %d node %p", dist, node);
> +		for (node = fabric->switches; node; node = node->type_next) {
> +				DEBUG("SWITCH: dist %d node %p\n", dist, node);
>  				/* Now, skip chassis based switches */
>  				if (node->chrecord &&
>  				    node->chrecord->chassisnum)
>  					continue;
>  				out_switch(node, group, chname);
>  
> -				for (port = node->ports; port; port = port->next, i++)
> -					if (port->remoteport)
> +				for (p = 1; p <= node->info.numports; p++) {
> +					port = node->ports[p];
> +					if (port && port->remoteport)
>  						out_switch_port(port, group);
> -			}
> -
> +				}
>  		}
>  
>  	}
>  
>  	/* Make pass on CAs */
> -	for (node = nodesdist[MAXHOPS]; node; node = node->dnext) {
> -
> -		DEBUG("CA: dist %d node %p", dist, node);
> -		if (!listtype) {
> -			/* Now, skip chassis based CAs */
> -			if (group && node->chrecord &&
> -			    node->chrecord->chassisnum)
> -				continue;
> -			out_ca(node, group, chname);
> -		} else {
> -			if (((listtype & LIST_CA_NODE) && (node->type == CA_NODE)) ||
> -			    ((listtype & LIST_ROUTER_NODE) && (node->type == ROUTER_NODE)))
> -				list_node(node);
> +	for (node = fabric->ch_adapters; node; node = node->type_next) {
> +		DEBUG("CA: dist %d node %p\n", dist, node);
> +		/* Now, skip chassis based CAs */
> +		if (group && node->chrecord &&
> +		    node->chrecord->chassisnum)
>  			continue;
> -		}
> +		out_ca(node, group, chname);
>  
> -		for (port = node->ports; port; port = port->next, i++)
> -			if (port->remoteport)
> +		for (p = 1; p <= node->info.numports; p++) {
> +			port = node->ports[p];
> +			if (port && port->remoteport)
>  				out_ca_port(port, group);
> +		}
>  	}
>  
> -	if (chname)
> -		free(chname);
> +	/* make pass on routers */
> +	for (node = fabric->routers; node; node = node->type_next) {
> +		DEBUG("RT: dist %d node %p\n", dist, node);
> +		/* Now, skip chassis based CAs */
> +		if (group && node->chrecord &&
> +		    node->chrecord->chassisnum)
> +			continue;
> +		out_ca(node, group, chname);
> +		for (p = 1; p <= node->info.numports; p++) {
> +			port = node->ports[p];
> +			if (port && port->remoteport)
> +				out_ca_port(port, group);
> +		}
> +	}
>  
>  	return i;
>  }
>  
> -void dump_ports_report ()
> +
> +void dump_ports_report (ibnd_node_t *node, void *user_data)
>  {
> -	int b, n = 0, p;
> -	Node *node;
> -	Port *port;
> -
> -	// If switch and LID == 0, search of other switch ports with
> -	// valid LID and assign it to all ports of that switch
> -	for (b = 0; b <= MAXHOPS; b++)
> -		for (node = nodesdist[b]; node; node = node->dnext)
> -			if (node->type == SWITCH_NODE) {
> -				int swlid = 0;
> -				for (p = 0, port = node->ports;
> -				     p < node->numports && port && !swlid;
> -				     port = port->next)
> -					if (port->lid != 0)
> -						swlid = port->lid;
> -				for (p = 0, port = node->ports;
> -				     p < node->numports && port;
> -				     port = port->next)
> -					port->lid = swlid;
> -			}
> +	int p = 0;
> +	ibnd_port_t *port = NULL;
> +
> +	/* for each port */
> +	for (p = node->info.numports, port = node->ports[p];
> +	     p > 0;
> +	     port = node->ports[--p]) {
> +		if (port == NULL)
> +			continue;
>  
> -	for (b = 0; b <= MAXHOPS; b++)
> -		for (node = nodesdist[b]; node; node = node->dnext) {
> -			for (p = 0, port = node->ports;
> -			     p < node->numports && port;
> -			     p++, port = port->next) {
> -				fprintf(stdout,
> -					"%2s %5d %2d 0x%016" PRIx64 " %s %s",
> -					node_type_str2(port->node), port->lid,
> -					port->portnum,
> -					port->portguid,
> -					get_linkwidth_str(port->linkwidth),
> -					get_linkspeed_str(port->linkspeed));
> -				if (port->remoteport)
> -					fprintf(stdout,
> -						" - %2s %5d %2d 0x%016" PRIx64
> -						" ( '%s' - '%s' )\n",
> -						node_type_str2(port->remoteport->node),
> -						port->remoteport->lid,
> -						port->remoteport->portnum,
> -						port->remoteport->portguid,
> -						port->node->nodedesc,
> -						port->remoteport->node->nodedesc);
> -				else
> -					fprintf(stdout, "%36s'%s'\n", "",
> -						port->node->nodedesc);
> -			}
> -			n++;
> -		}
> +		fprintf(stdout,
> +			"%2s %5d %2d 0x%016" PRIx64 " %s %s",
> +			ibnd_node_type_str_short(node),
> +			node->info.type == IBND_SWITCH_NODE ? node->smalid : port->info.lid,
> +			port->portnum,
> +			port->guid,
> +			ibnd_linkwidth_str(port->info.link_width_active),
> +			ibnd_linkspeed_str(port->info.link_speed_active));
> +		if (port->remoteport)
> +			fprintf(stdout,
> +				" - %2s %5d %2d 0x%016" PRIx64
> +				" ( '%s' - '%s' )\n",
> +				ibnd_node_type_str_short(port->remoteport->node),
> +				port->remoteport->node->info.type == IBND_SWITCH_NODE ?
> +					port->remoteport->node->smalid : port->remoteport->info.lid,
> +				port->remoteport->portnum,
> +				port->remoteport->guid,
> +				port->node->nodedesc,
> +				port->remoteport->node->nodedesc);
> +		else
> +			fprintf(stdout, "%36s'%s'\n", "",
> +				port->node->nodedesc);
> +	}
>  }
>  
>  void
>  usage(void)
>  {
> -	fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port "
> +	fprintf(stderr, "Usage: %s [-d(ebug)] -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port "
>  			"-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) [<topology-file>]\n",
>  			argv0);
>  	fprintf(stderr, "       --node-name-map <node-name-map> specify a node name map file\n");
> @@ -933,20 +528,18 @@ usage(void)
>  int
>  main(int argc, char **argv)
>  {
> -	int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS};
> -	ib_portid_t my_portid = {0};
> -	int udebug = 0, list = 0;
> +	int list = 0;
>  	char *ca = 0;
>  	int ca_port = 0;
>  	int group = 0;
>  	int ports_report = 0;
> +	ibnd_fabric_t *fabric = NULL;
>  
>  	static char const str_opts[] = "C:P:t:devslgHSRpVhu";
>  	static const struct option long_opts[] = {
>  		{ "C", 1, 0, 'C'},
>  		{ "P", 1, 0, 'P'},
>  		{ "debug", 0, 0, 'd'},
> -		{ "err_show", 0, 0, 'e'},
>  		{ "verbose", 0, 0, 'v'},
>  		{ "show", 0, 0, 's'},
>  		{ "list", 0, 0, 'l'},
> @@ -982,23 +575,17 @@ main(int argc, char **argv)
>  			ca_port = strtoul(optarg, 0, 0);
>  			break;
>  		case 'd':
> -			ibdebug++;
> -			madrpc_show_errors(1);
> -			umad_debug(udebug);
> -			udebug++;
> +			debug = 1;
> +			ibnd_debug(1);
>  			break;
>  		case 't':
> -			timeout = strtoul(optarg, 0, 0);
> +			timeout_ms = strtoul(optarg, 0, 0);
>  			break;
>  		case 'v':
>  			verbose++;
> -			dumplevel++;
>  			break;
>  		case 's':
> -			dumplevel = 1;
> -			break;
> -		case 'e':
> -			madrpc_show_errors(1);
> +			ibnd_show_progress(1);
>  			break;
>  		case 'l':
>  			list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE;
> @@ -1007,13 +594,13 @@ main(int argc, char **argv)
>  			group = 1;
>  			break;
>  		case 'S':
> -			list = LIST_SWITCH_NODE;
> +			list |= LIST_SWITCH_NODE;
>  			break;
>  		case 'H':
> -			list = LIST_CA_NODE;
> +			list |= LIST_CA_NODE;
>  			break;
>  		case 'R':
> -			list = LIST_ROUTER_NODE;
> +			list |= LIST_ROUTER_NODE;
>  			break;
>  		case 'V':
>  			fprintf(stderr, "%s %s\n", argv0, get_build_version() );
> @@ -1030,22 +617,25 @@ main(int argc, char **argv)
>  	argv += optind;
>  
>  	if (argc && !(f = fopen(argv[0], "w")))
> -		IBERROR("can't open file %s for writing", argv[0]);
> +		fprintf(stderr, "can't open file %s for writing", argv[0]);
>  
> -	madrpc_init(ca, ca_port, mgmt_classes, 2);
>  	node_name_map = open_node_name_map(node_name_map_file);
>  
> -	if (discover(&my_portid) < 0)
> -		IBERROR("discover");
> -
> -	if (group)
> -		chassis = group_nodes();
> +	if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) {
> +		fprintf(stderr, "discover failed\n");
> +		exit(1);
> +	}
>  
>  	if (ports_report)
> -		dump_ports_report();
> +		ibnd_iter_nodes(fabric,
> +				dump_ports_report,
> +				NULL);
> +	else if (list)
> +		list_nodes(fabric, list);
>  	else
> -		dump_topology(list, group);
> +		dump_topology(group, fabric);
>  
> +	ibnd_destroy_fabric(fabric);
>  	close_node_name_map(node_name_map);
>  	exit(0);
>  }
> -- 
> 1.5.4.5
> 


From sashak at voltaire.com  Sun Nov 23 10:58:36 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 23 Nov 2008 20:58:36 +0200
Subject: [ofa-general] Re: [PATCH] Opensm: main exit codes
In-Reply-To: <4923678D.3080701@llnl.gov>
References: <4923678D.3080701@llnl.gov>
Message-ID: <20081123185836.GU21967@sashak.voltaire.com>

Hi Tim,

On 17:10 Tue 18 Nov     , Timothy A. Meier wrote:
> 
>   I thought it would be useful to define a set of exit codes for opensm.  A quick examination of main.c
> showed a few different ways to terminate.  How about this patch?  Obviously this doesn't catch every
> possible exit scenario, but its a start that can be built upon.

Personally I read 'exit(0)' faster than 'exit(OSM_EXIT_TYPE_NORMAL)',
but maybe it is just me :).

Maybe error codes could be formalized, but I'm not sure that it would be
beneficial without any practical uses (and clear requirements
understanding). Finally we can found us in a middle of the total mess
similar to how OSM_LOG_* is used today.

Sasha


From jeff at splitrockpr.com  Sun Nov 23 16:56:08 2008
From: jeff at splitrockpr.com (Jeffrey Scott)
Date: Sun, 23 Nov 2008 16:56:08 -0800
Subject: [ofa-general] We want your input for Sonoma 2009
Message-ID: <8F3AA2A8A5174958B80AF670B24BC534@Gaucho>

OFA Members-

We're putting together the agenda for the 2009 Sonoma Workshop.  We'd like
your input.  Please let us know what topics or content would be of most
interest to you.  Would you like to hear about vendor support for the OFA
software stack?  Real-world implementations by end users?  Are there
specific technologies that you'd like to see covered on the agenda?  Would
you like to spend more time discussing the future of OFED and WinOF,
including possible new features?  Would you like to hear presentations from
the major OS vendors?  Do you want to discuss InfiniBand/Ethernet issues?
Other topics that are of particular interest to you?

 
Also, are there things you'd like to see changed from last year's event?
Now is your chance to weigh in.  Please help the MWG make the Sonoma
Workshop as compelling and valuable as possible.

 
One final request, please let us know if you think it would be worthwhile to
host a hands-on training event at Sonoma to familiarize end users with the
OFA software stack.

 
Thanks in advance for your input.

-MWG

 
-----------------------------------

Jeffrey Scott

Split Rock Communications

 
408-884-4017

408-348-3651 Mobile

408-884-3900 Fax

www.SplitRockPR.com

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081123/0b25f742/attachment.html>

From sashak at voltaire.com  Sun Nov 23 22:17:25 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 08:17:25 +0200
Subject: [ofa-general] Re: your mail
In-Reply-To: <003701c94420$67840f80$368c2e80$@com>
References: <003701c94420$67840f80$368c2e80$@com>
Message-ID: <20081124061725.GV21967@sashak.voltaire.com>

Hi Bob,

On 11:10 Tue 11 Nov     , Robert Pearson wrote:
> 
> Here is the sixth patch implementing the mesh analysis algorithm.

Could you provide description for this [PATCH 6]?

Also note that your mailer breaks long lines and corrupts patches (not
in the case of this patch but with anothers where long lines are used).

Sasha

> 
> This patch implements
>       - a table of polynomials for all 2D and 3D regular Cartesian meshes
>       - a routine to classify each switch based on the table
> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index 9254de3..30d09c2 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -48,6 +48,76 @@
>  #include <opensm/osm_mesh.h>
>  #include <opensm/osm_ucast_lash.h>
>  
> +#define MAX_DIMENSION (4)
> +#define MAX_DEGREE (10)
> +
> +/*
> + * characteristic polynomials for 2d and 3d regular tori
> + * since 4 == 2x2 we choose to take 2x2
> + */
> +struct _mesh_info {
> +	int dimension;			/* dimension of the torus */
> +	int size[MAX_DIMENSION];	/* size of the torus */
> +	int degree;			/* degree of polynomial */
> +	int poly[MAX_DEGREE+1];		/* polynomial */
> +} mesh_info[] = {
> +	{0, {0},       0, {0},					},
> +
> +	{2, {2, 2},    2, {-4, 0, 1},				},
> +	{2, {3, 2},    3, {8, 9, 0, -1},			},
> +	//{2, {4, 2},    3, {16, 12, 0, -1},			},
> +	{2, {5, 2},    3, {24, 17, 0, -1},			},
> +	{2, {6, 2},    3, {32, 24, 0, -1},			},
> +	{2, {3, 3},    4, {-15, -32, -18, 0, 1},		},
> +	//{2, {4, 3},    4, {-28, -48, -21, 0, 1},		},
> +	{2, {5, 3},    4, {-39, -64, -26, 0, 1},		},
> +	{2, {6, 3},    4, {-48, -80, -33, 0, 1},		},
> +	//{2, {4, 4},    4, {-48, -64, -24, 0, 1},		},
> +	//{2, {5, 4},    4, {-60, -80, -29, 0, 1},		},
> +	//{2, {6, 4},    4, {-64, -96, -36, 0, 1},		},
> +	{2, {5, 5},    4, {-63, -96, -34, 0, 1},		},
> +	{2, {6, 5},    4, {-48, -112, -41, 0, 1},		},
> +	{2, {6, 6},    4, {0, -128, -48, 0, 1},			},
> +
> +	{3, {2, 2, 2}, 3, {16, 12, 0, -1},			},
> +	{3, {3, 2, 2}, 4, {-28, -48, -21, 0, 1},		},
> +	{3, {4, 2, 2}, 4, {-48, -64, -24, 0, 1},		},
> +	{3, {5, 2, 2}, 4, {-60, -80, -29, 0, 1},		},
> +	{3, {6, 2, 2}, 4, {-64, -96, -36, 0, 1},		},
> +	{3, {3, 3, 2}, 5, {48, 127, 112, 34, 0, -1},		},
> +	{3, {4, 3, 2}, 5, {80, 180, 136, 37, 0, -1},		},
> +	{3, {5, 3, 2}, 5, {96, 215, 160, 42, 0, -1},		},
> +	{3, {6, 3, 2}, 5, {96, 232, 184, 49, 0, -1},		},
> +	{3, {4, 4, 2}, 5, {128, 240, 160, 40, 0, -1},		},
> +	{3, {5, 4, 2}, 5, {144, 276, 184, 45, 0, -1},		},
> +	{3, {6, 4, 2}, 5, {128, 288, 208, 52, 0, -1},		},
> +	{3, {5, 5, 2}, 5, {144, 303, 208, 50, 0, -1},		},
> +	{3, {6, 5, 2}, 5, {96, 296, 232, 57, 0, -1},		},
> +	{3, {6, 6, 2}, 5, {0, 256, 256, 64, 0, -1},		},
> +	{3, {3, 3, 3}, 6, {-81, -288, -381, -224, -51, 0, 1},	},
> +	{3, {4, 3, 3}, 6, {-132, -416, -487, -256, -54, 0, 1},	},
> +	{3, {5, 3, 3}, 6, {-153, -480, -557, -288, -59, 0, 1},	},
> +	{3, {6, 3, 3}, 6, {-144, -480, -591, -320, -66, 0, 1},	},
> +	{3, {4, 4, 3}, 6, {-208, -576, -600, -288, -57, 0, 1},	},
> +	{3, {5, 4, 3}, 6, {-228, -640, -671, -320, -62, 0, 1},	},
> +	{3, {6, 4, 3}, 6, {-192, -608, -700, -352, -69, 0, 1},	},
> +	{3, {5, 5, 3}, 6, {-225, -672, -733, -352, -67, 0, 1},	},
> +	{3, {6, 5, 3}, 6, {-144, -576, -743, -384, -74, 0, 1},	},
> +	{3, {6, 6, 3}, 6, {0, -384, -720, -416, -81, 0, 1},	},
> +	{3, {4, 4, 4}, 6, {-320, -768, -720, -320, -60, 0, 1},	},
> +	{3, {5, 4, 4}, 6, {-336, -832, -792, -352, -65, 0, 1},	},
> +	{3, {6, 4, 4}, 6, {-256, -768, -816, -384, -72, 0, 1},	},
> +	{3, {5, 5, 4}, 6, {-324, -864, -855, -384, -70, 0, 1},	},
> +	{3, {6, 5, 4}, 6, {-192, -736, -860, -416, -77, 0, 1},	},
> +	{3, {6, 6, 4}, 6, {0, -512, -832, -448, -84, 0, 1},	},
> +	{3, {5, 5, 5}, 6, {-297, -864, -909, -416, -75, 0, 1},	},
> +	{3, {6, 5, 5}, 6, {-144, -672, -895, -448, -82, 0, 1},	},
> +	{3, {6, 6, 5}, 6, {0, -384, -848, -480, -89, 0, 1},	},
> +	{3, {6, 6, 6}, 6, {0, 0, -768, -512, -96, 0, 1},	},
> +
> +	{-1, {0,}, 0, {0, },					},
> +};
> +
>  /*
>   * poly_alloc
>   * 
> @@ -489,6 +559,30 @@ static void classify_switch(lash_t *p_lash, int sw)
>  }
>  
>  /*
> + * classify_mesh_type
> + *
> + * try to look up node polynomial in table
> + */
> +static void classify_mesh_type(lash_t *p_lash, int sw)
> +{
> +	int i;
> +	switch_t *s = p_lash->switches[sw];
> +	struct _mesh_info *t;
> +
> +	for (i = 1; (t = &mesh_info[i])->dimension != -1; i++) {
> +		if (poly_diff(t->degree, t->poly, s))
> +			continue;
> +
> +		s->node->type = i;
> +		s->node->dimension = t->dimension;
> +		return;
> +	}
> +
> +	s->node->type = 0;
> +	return;
> +}
> +
> +/*
>   * get_local_geometry
>   *
>   * analyze the local geometry around each switch
> @@ -500,6 +594,7 @@ static void get_local_geometry(lash_t *p_lash)
>  	for (sw = 0; sw < p_lash->num_switches; sw++) {
>  		get_switch_metric(p_lash, sw);
>  		classify_switch(p_lash, sw);
> +		classify_mesh_type(p_lash, sw);
>  	}
>  }
>  
> 
> 


From sashak at voltaire.com  Sun Nov 23 22:25:42 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 08:25:42 +0200
Subject: [ofa-general] Re: [PATCH][8] opensm: measure size and reorder links
In-Reply-To: <004501c94424$23551620$69ff4260$@com>
References: <004501c94424$23551620$69ff4260$@com>
Message-ID: <20081124062542.GW21967@sashak.voltaire.com>

Hi Bob,

On 11:37 Tue 11 Nov     , Robert Pearson wrote:
> 
> Here is the eighth patch implementing the mesh analysis algorithm.

All white spaces are mangled in this patch, I cannot apply it. Could you
resend in plain format? Thanks.

Sasha

> 
>  
> 
> This patch implements
> 
>       - routine to reorder links and measure the size of the mesh
> 
>  
> 
> Regards,
> 
>  
> 
> Bob Pearson
> 
>  
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> 
> ----
> 
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> 
> index 65afae6..a248522 100644
> 
> --- a/opensm/opensm/osm_mesh.c
> 
> +++ b/opensm/opensm/osm_mesh.c
> 
> @@ -832,6 +832,183 @@ next_j:
> 
>  }
> 
>  
> 
>  /*
> 
> + * return |a| < |b|
> 
> + */
> 
> +static inline int ltmag(int a, int b)
> 
> +{
> 
> +     int a1 = (a >= 0)? a : -a;
> 
> +     int b1 = (b >= 0)? b : -b;
> 
> +
> 
> +     return (a1 < b1) || (a1 == b1 && a > b);
> 
> +}
> 
> +
> 
> +/*
> 
> + * reorder_links
> 
> + *
> 
> + * reorder the links out of a switch in sign/dimension order
> 
> + */
> 
> +static int reorder_links(lash_t *p_lash, int sw)
> 
> +{
> 
> +     osm_log_t *p_log = &p_lash->p_osm->log;
> 
> +     switch_t *s = p_lash->switches[sw];
> 
> +     mesh_node_t *node = s->node;
> 
> +     int n = node->num_links;
> 
> +     link_t **links;
> 
> +     int *axes;
> 
> +     int i, j;
> 
> +     int c;
> 
> +     int next = 0;
> 
> +
> 
> +     if (!(links = calloc(n, sizeof(link_t *)))) {
> 
> +           OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating temp array -
> out of memory\n");
> 
> +           return -1;
> 
> +     }
> 
> +
> 
> +     if (!(axes = calloc(n, sizeof(int)))) {
> 
> +           free(links);
> 
> +           OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating temp array -
> out of memory\n");
> 
> +           return -1;
> 
> +     }
> 
> +
> 
> +     /*
> 
> +     * find the links with axes
> 
> +     */
> 
> +     for (j = 1; j <= 2*node->dimension; j++) {
> 
> +           c = j;
> 
> +           if (node->coord[(c-1)/2] > 0)
> 
> +                 c = opposite(s, c);
> 
> +
> 
> +           for (i = 0; i < n; i++) {
> 
> +                 if (!node->links[i])
> 
> +                       continue;
> 
> +                 if (node->axes[i] == c) {
> 
> +                       links[next] = node->links[i];
> 
> +                       axes[next] = node->axes[i];
> 
> +                       node->links[i] = NULL;
> 
> +                       next++;
> 
> +                 }
> 
> +           }
> 
> +     }
> 
> +
> 
> +     /*
> 
> +     * get the rest
> 
> +     */
> 
> +     for (i = 0; i < n; i++) {
> 
> +           if (!node->links[i])
> 
> +                 continue;
> 
> +
> 
> +           links[next] = node->links[i];
> 
> +           axes[next] = node->axes[i];
> 
> +           node->links[i] = NULL;
> 
> +           next++;
> 
> +     }
> 
> +
> 
> +     for (i = 0; i < n; i++) {
> 
> +           node->links[i] = links[i];
> 
> +           node->axes[i] = axes[i];
> 
> +     }
> 
> +
> 
> +     free(links);
> 
> +     free(axes);
> 
> +
> 
> +     return 0;
> 
> +}
> 
> +
> 
> +/*
> 
> + * measure geometry
> 
> + */
> 
> +static int measure_geometry(lash_t *p_lash, int seed)
> 
> +{
> 
> +     int i, j, k;
> 
> +     int sw;
> 
> +     switch_t *s, *s1;
> 
> +     int change;
> 
> +     int dimension = p_lash->mesh->dimension;
> 
> +     int num_switches = p_lash->num_switches;
> 
> +     int assigned_axes = 0, unassigned_axes = 0;
> 
> +     int *max, *min;
> 
> +
> 
> +     for (sw = 0; sw < num_switches; sw++) {
> 
> +           s = p_lash->switches[sw];
> 
> +
> 
> +           s->node->coord = calloc(dimension, sizeof(int));
> 
> +           for (i = 0; i < dimension; i++)
> 
> +                 s->node->coord[i] = (sw == seed)? 0 : 0x7fffffff;
> 
> +
> 
> +           for (i = 0; i < s->node->num_links; i++)
> 
> +                 if (s->node->axes[i] == 0)
> 
> +                       unassigned_axes++;
> 
> +                 else
> 
> +                       assigned_axes++;
> 
> +     }
> 
> +
> 
> +     printf("lash: %d/%d unassigned/assigned axes\n", unassigned_axes,
> assigned_axes);
> 
> +
> 
> +     do {
> 
> +           change = 0;
> 
> +
> 
> +           for (sw = 0; sw < num_switches; sw++) {
> 
> +                 s = p_lash->switches[sw];
> 
> +
> 
> +                 if (s->node->coord[0] == 0x7fffffff)
> 
> +                       continue;
> 
> +
> 
> +                 for (j = 0; j < s->node->num_links; j++) {
> 
> +                       if (!s->node->axes[j])
> 
> +                             continue;
> 
> +
> 
> +                       s1 = p_lash->switches[s->node->links[j]->switch_id];
> 
> +
> 
> +                       for (k = 0; k < dimension; k++) {
> 
> +                             int coord = s->node->coord[k];
> 
> +                             int axis = s->node->axes[j] - 1;
> 
> +
> 
> +                             if (k == axis/2)
> 
> +                                   coord += (axis & 1)? -1 : +1;
> 
> +
> 
> +                             if (ltmag(coord, s1->node->coord[k])) {
> 
> +                                   s1->node->coord[k] = coord;
> 
> +                                   change++;
> 
> +                             }
> 
> +                       }
> 
> +                 }
> 
> +           }
> 
> +     } while (change);
> 
> +
> 
> +     for (sw = 0; sw < num_switches; sw++) {
> 
> +           if (reorder_links(p_lash, sw))
> 
> +                 return -1;
> 
> +     }
> 
> +
> 
> +     max = calloc(dimension, sizeof(int));
> 
> +     min = calloc(dimension, sizeof(int));
> 
> +     p_lash->mesh->size = calloc(dimension, sizeof(int));
> 
> +
> 
> +     for (i = 0; i < dimension; i++) {
> 
> +           max[i] = -0x7fffffff;
> 
> +           min[i] = 0x7fffffff;
> 
> +     }
> 
> +
> 
> +     for (sw = 0; sw < num_switches; sw++) {
> 
> +           s = p_lash->switches[sw];
> 
> +
> 
> +           for (i = 0; i < dimension; i++) {
> 
> +                 if (s->node->coord[i] == 0x7fffffff)
> 
> +                       continue;
> 
> +                 if (s->node->coord[i] > max[i])
> 
> +                       max[i] = s->node->coord[i];
> 
> +                 if (s->node->coord[i] < min[i])
> 
> +                       min[i] = s->node->coord[i];
> 
> +           }
> 
> +     }
> 
> +
> 
> +     for (i = 0; i < dimension; i++)
> 
> +           p_lash->mesh->size[i] = max[i] - min[i] + 1;
> 
> +
> 
> +     return 0;
> 
> +}
> 
> +
> 
> +/*
> 
>   * osm_mesh_cleanup - free per mesh resources
> 
>   */
> 
>  void osm_mesh_cleanup(lash_t *p_lash)
> 
> @@ -941,6 +1118,14 @@ int osm_do_mesh_analysis(lash_t *p_lash)
> 
>  
> 
>       if (s->node->type) {
> 
>             make_geometry(p_lash, max_class_type);
> 
> +
> 
> +           if (measure_geometry(p_lash, max_class_type))
> 
> +                 return -1;
> 
> +
> 
> +           printf("lash: found ");
> 
> +           for (i = 0; i < mesh->dimension; i++)
> 
> +                 printf("%s%d", i? "X" : "", mesh->size[i]);
> 
> +           printf(" mesh\n");
> 
>       }
> 
>  
> 
>       OSM_LOG_EXIT(p_log);
> 
>  
> 
>  
> 


From sashak at voltaire.com  Sun Nov 23 23:01:00 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 09:01:00 +0200
Subject: [ofa-general] [PATCH] opensm/ftree: save lft_buf memory allocations
Message-ID: <20081124070100.GZ21967@sashak.voltaire.com>


Use OpenSM switch lft_buf directly and save memory (48k per switch) in
local structures.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/opensm/osm_ucast_ftree.c |   53 +++++++-------------------------------
 1 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index fb26247..875954b 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -48,7 +48,6 @@
 #include <errno.h>
 #include <iba/ib_types.h>
 #include <complib/cl_qmap.h>
-#include <complib/cl_pool.h>
 #include <complib/cl_debug.h>
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_switch.h>
@@ -119,15 +118,6 @@ typedef struct {
 
 /***************************************************
  **
- **  ftree_fwd_tbl_t definition
- **
- ***************************************************/
-
-typedef uint8_t *ftree_fwd_tbl_t;
-#define FTREE_FWD_TBL_LEN (IB_LID_UCAST_END_HO + 1)
-
-/***************************************************
- **
  **  ftree_port_t definition
  **
  ***************************************************/
@@ -184,7 +174,6 @@ typedef struct ftree_sw_t_ {
 	uint8_t down_port_groups_num;
 	ftree_port_group_t **up_port_groups;
 	uint8_t up_port_groups_num;
-	ftree_fwd_tbl_t lft_buf;
 	boolean_t is_leaf;
 	int down_port_groups_idx;
 } ftree_sw_t;
@@ -222,7 +211,6 @@ typedef struct ftree_fabric_t_ {
 	ftree_sw_t **leaf_switches;
 	uint32_t leaf_switches_num;
 	uint16_t max_cn_per_leaf;
-	cl_pool_t sw_fwd_tbl_pool;
 	uint16_t lft_max_lid_ho;
 	boolean_t fabric_built;
 } ftree_fabric_t;
@@ -579,9 +567,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN ftree_fabric_t * p_ftree,
 	p_sw->up_port_groups_num = 0;
 
 	/* initialize lft buffer */
-	p_sw->lft_buf =
-	    (ftree_fwd_tbl_t) cl_pool_get(&p_ftree->sw_fwd_tbl_pool);
-	memset(p_sw->lft_buf, OSM_NO_PATH, FTREE_FWD_TBL_LEN);
+	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
 
 	p_sw->down_port_groups_idx = -1;
 
@@ -607,10 +593,6 @@ static void __osm_ftree_sw_destroy(IN ftree_fabric_t * p_ftree,
 	if (p_sw->up_port_groups)
 		free(p_sw->up_port_groups);
 
-	/* return switch fwd_tbl to pool */
-	if (p_sw->lft_buf)
-		cl_pool_put(&p_ftree->sw_fwd_tbl_pool, (void *)p_sw->lft_buf);
-
 	free(p_sw);
 }				/* __osm_ftree_sw_destroy() */
 
@@ -892,7 +874,6 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca,
 
 static ftree_fabric_t *__osm_ftree_fabric_create()
 {
-	cl_status_t status;
 	ftree_fabric_t *p_ftree =
 	    (ftree_fabric_t *) malloc(sizeof(ftree_fabric_t));
 	if (p_ftree == NULL)
@@ -907,16 +888,6 @@ static ftree_fabric_t *__osm_ftree_fabric_create()
 
 	cl_qlist_init(&p_ftree->root_guid_list);
 
-	status = cl_pool_init(&p_ftree->sw_fwd_tbl_pool, 8,	/* min pool size */
-			      0,	/* max pool size - unlimited */
-			      8,	/* grow size */
-			      FTREE_FWD_TBL_LEN,	/* object_size */
-			      NULL,	/* object initializer */
-			      NULL,	/* object destructor */
-			      NULL);	/* context */
-	if (status != CL_SUCCESS)
-		return NULL;
-
 	return p_ftree;
 }
 
@@ -1008,7 +979,6 @@ static void __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree)
 	if (!p_ftree)
 		return;
 	__osm_ftree_fabric_clear(p_ftree);
-	cl_pool_destroy(&p_ftree->sw_fwd_tbl_pool);
 	free(p_ftree);
 }
 
@@ -1924,9 +1894,6 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
 	ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
 
 	p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho;
-
-	memcpy(p_sw->p_osm_sw->new_lft, p_sw->lft_buf,
-	       IB_LID_UCAST_END_HO + 1);
 	osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
 				    p_sw->p_osm_sw);
 }
@@ -2065,13 +2032,13 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
 		/* second case: skip the port group if the remote (lower)
 		   switch has been already configured for this target LID */
 		if (is_real_lid && !is_main_path &&
-		    p_remote_sw->lft_buf[cl_ntoh16(target_lid)] != OSM_NO_PATH)
+		    p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH)
 			continue;
 
 		/* setting fwd tbl port only if this is real LID */
 		if (is_real_lid) {
-			p_remote_sw->lft_buf[cl_ntoh16(target_lid)] =
-                                p_min_port->remote_port_num;
+			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
+				p_min_port->remote_port_num;
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to CA LID %u through port %u\n",
 				__osm_ftree_tuple_to_str(p_remote_sw->tuple),
@@ -2249,7 +2216,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		p_min_group->counter_down++;
 		p_min_port->counter_down++;
 		if (is_real_lid) {
-			p_remote_sw->lft_buf[cl_ntoh16(target_lid)] =
+			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
 				p_min_port->remote_port_num;
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to CA LID %u through port %u\n",
@@ -2325,7 +2292,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
 
 		/* skip if target lid has been already set on remote switch fwd tbl */
-		if (p_remote_sw->lft_buf[cl_ntoh16(target_lid)] != OSM_NO_PATH)
+		if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH)
 			continue;
 
 		if (p_sw->is_leaf) {
@@ -2343,7 +2310,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
 		   trying to balance these routes - always pick port 0. */
 
 		cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port);
-		p_remote_sw->lft_buf[cl_ntoh16(target_lid)] =
+		p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
 			p_port->remote_port_num;
 
 		/* On the remote switch that is pointed by the p_group,
@@ -2435,7 +2402,7 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
 			/* set local LFT(LID) to the port that is connected to HCA */
 			cl_ptr_vector_at(&p_leaf_port_group->ports, 0,
 					 (void *)&p_port);
-			p_sw->lft_buf[cl_ntoh16(hca_lid)] = p_port->port_num;
+			p_sw->p_osm_sw->new_lft[cl_ntoh16(hca_lid)] = p_port->port_num;
 
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to CN LID %u through port %u\n",
@@ -2544,7 +2511,7 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree)
 			cl_ptr_vector_at(&p_hca_port_group->ports, 0,
 					 (void *)&p_hca_port);
 			port_num_on_switch = p_hca_port->remote_port_num;
-			p_sw->lft_buf[cl_ntoh16(hca_lid)] = port_num_on_switch;
+			p_sw->p_osm_sw->new_lft[cl_ntoh16(hca_lid)] = port_num_on_switch;
 
 			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 				"Switch %s: set path to non-CN HCA LID %u through port %u\n",
@@ -2600,7 +2567,7 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree)
 		p_next_sw = (ftree_sw_t *) cl_qmap_next(&p_sw->map_item);
 
 		/* set local LFT(LID) to 0 (route to itself) */
-		p_sw->lft_buf[cl_ntoh16(p_sw->base_lid)] = 0;
+		p_sw->p_osm_sw->new_lft[cl_ntoh16(p_sw->base_lid)] = 0;
 
 		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
 			"Switch %s (LID %u): routing switch-to-switch paths\n",
-- 
1.6.0.4.766.g6fc4a


From dorfman.eli at gmail.com  Mon Nov 24 00:01:27 2008
From: dorfman.eli at gmail.com (Eli Dorfman)
Date: Mon, 24 Nov 2008 10:01:27 +0200
Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_trap_rcv.c disable
	the port with the least hop count
In-Reply-To: <20081121094514.GC6965@sashak.voltaire.com>
References: <49251926.9090509@gmail.com>
	<20081121094514.GC6965@sashak.voltaire.com>
Message-ID: <694d48600811240001g1673d3aeo26ff7bc3bce0a6e8@mail.gmail.com>

On Fri, Nov 21, 2008 at 11:45 AM, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> Hi Eli,
>
> On 10:00 Thu 20 Nov     , Eli Dorfman wrote:
>> disable the port with the least hop count.
>> this will address the case of inter switch link where the
>> most remote port (from opensm) is sending traps.
>> in that case we would like to disable the nearest switch port (from opensm).
>>
>> Signed-off-by: Eli Dorfman <elid at voltaire.com>
>
> I applied the patch. However have some question.
>
>> ---
>>  opensm/opensm/osm_trap_rcv.c |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
>> index 07c5183..d1dfbd4 100644
>> --- a/opensm/opensm/osm_trap_rcv.c
>> +++ b/opensm/opensm/osm_trap_rcv.c
>> @@ -239,8 +239,8 @@ static int disable_port(osm_sm_t *sm, osm_physp_t *p)
>>       ib_port_info_t *pi = (ib_port_info_t *)payload;
>>       int ret;
>>
>> -     /* in case of endport - disable switch's peer port */
>> -     if (osm_node_get_type(p->p_node) != IB_NODE_TYPE_SWITCH)
>> +     /* select the nearest port to master opensm */
>> +     if (p->dr_path.hop_count > p->p_remote_physp->dr_path.hop_count)
>>               p = p->p_remote_physp;
>
> Is it possible that this noisy port is switch external port, "the
> nearest" to OpenSM node and doesn't have remote port (due to unstable
> link)? We saw such cases in practice and it is handled by OpenSM in a
> light sweep (see __osm_state_mgr_get_remote_port_info() calls in
> __osm_state_mgr_light_sweep_start() function).
>
> With endports check only is is impossible IMO, but with I don't see that
> it cannot happen with switch ports. Right?
>
> If so then maybe the code should look like:
>
>        if (p->p_remote_physp &&
>            p->dr_path.hop_count > p->p_remote_physp->dr_path.hop_count)
>                p = p->p_remote_physp;
>


you are absolutely right. please add the above fix.

Thanks,
Eli

>
> Sasha
>
>>
>>       /* If trap 131, might want to disable peer port if available */
>> --
>> 1.5.5
>>
>


From sashak at voltaire.com  Mon Nov 24 00:20:55 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 10:20:55 +0200
Subject: [ofa-general] Re: [PATCH] opensm/osm_trap_rcv.c disable the port
	with the least hop count
In-Reply-To: <694d48600811240001g1673d3aeo26ff7bc3bce0a6e8@mail.gmail.com>
References: <49251926.9090509@gmail.com>
	<20081121094514.GC6965@sashak.voltaire.com>
	<694d48600811240001g1673d3aeo26ff7bc3bce0a6e8@mail.gmail.com>
Message-ID: <20081124082055.GE21967@sashak.voltaire.com>

On 10:01 Mon 24 Nov     , Eli Dorfman wrote:
> >
> > If so then maybe the code should look like:
> >
> >        if (p->p_remote_physp &&
> >            p->dr_path.hop_count > p->p_remote_physp->dr_path.hop_count)
> >                p = p->p_remote_physp;
> >
> 
> 
> you are absolutely right. please add the above fix.

Applied.

Sasha


From kliteyn at dev.mellanox.co.il  Mon Nov 24 01:06:54 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 24 Nov 2008 11:06:54 +0200
Subject: [ofa-general] Re: [PATCH] opensm/ftree: save lft_buf memory
	allocations
In-Reply-To: <20081124070100.GZ21967@sashak.voltaire.com>
References: <20081124070100.GZ21967@sashak.voltaire.com>
Message-ID: <492A6EAE.7020909@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Use OpenSM switch lft_buf directly and save memory (48k per switch) in
> local structures.

Looks good, thanks.

-- Yevgeny

> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  opensm/opensm/osm_ucast_ftree.c |   53 +++++++-------------------------------
>  1 files changed, 10 insertions(+), 43 deletions(-)
> 
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index fb26247..875954b 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -48,7 +48,6 @@
>  #include <errno.h>
>  #include <iba/ib_types.h>
>  #include <complib/cl_qmap.h>
> -#include <complib/cl_pool.h>
>  #include <complib/cl_debug.h>
>  #include <opensm/osm_opensm.h>
>  #include <opensm/osm_switch.h>
> @@ -119,15 +118,6 @@ typedef struct {
>  
>  /***************************************************
>   **
> - **  ftree_fwd_tbl_t definition
> - **
> - ***************************************************/
> -
> -typedef uint8_t *ftree_fwd_tbl_t;
> -#define FTREE_FWD_TBL_LEN (IB_LID_UCAST_END_HO + 1)
> -
> -/***************************************************
> - **
>   **  ftree_port_t definition
>   **
>   ***************************************************/
> @@ -184,7 +174,6 @@ typedef struct ftree_sw_t_ {
>  	uint8_t down_port_groups_num;
>  	ftree_port_group_t **up_port_groups;
>  	uint8_t up_port_groups_num;
> -	ftree_fwd_tbl_t lft_buf;
>  	boolean_t is_leaf;
>  	int down_port_groups_idx;
>  } ftree_sw_t;
> @@ -222,7 +211,6 @@ typedef struct ftree_fabric_t_ {
>  	ftree_sw_t **leaf_switches;
>  	uint32_t leaf_switches_num;
>  	uint16_t max_cn_per_leaf;
> -	cl_pool_t sw_fwd_tbl_pool;
>  	uint16_t lft_max_lid_ho;
>  	boolean_t fabric_built;
>  } ftree_fabric_t;
> @@ -579,9 +567,7 @@ static ftree_sw_t *__osm_ftree_sw_create(IN ftree_fabric_t * p_ftree,
>  	p_sw->up_port_groups_num = 0;
>  
>  	/* initialize lft buffer */
> -	p_sw->lft_buf =
> -	    (ftree_fwd_tbl_t) cl_pool_get(&p_ftree->sw_fwd_tbl_pool);
> -	memset(p_sw->lft_buf, OSM_NO_PATH, FTREE_FWD_TBL_LEN);
> +	memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
>  
>  	p_sw->down_port_groups_idx = -1;
>  
> @@ -607,10 +593,6 @@ static void __osm_ftree_sw_destroy(IN ftree_fabric_t * p_ftree,
>  	if (p_sw->up_port_groups)
>  		free(p_sw->up_port_groups);
>  
> -	/* return switch fwd_tbl to pool */
> -	if (p_sw->lft_buf)
> -		cl_pool_put(&p_ftree->sw_fwd_tbl_pool, (void *)p_sw->lft_buf);
> -
>  	free(p_sw);
>  }				/* __osm_ftree_sw_destroy() */
>  
> @@ -892,7 +874,6 @@ __osm_ftree_hca_add_port(IN ftree_hca_t * p_hca,
>  
>  static ftree_fabric_t *__osm_ftree_fabric_create()
>  {
> -	cl_status_t status;
>  	ftree_fabric_t *p_ftree =
>  	    (ftree_fabric_t *) malloc(sizeof(ftree_fabric_t));
>  	if (p_ftree == NULL)
> @@ -907,16 +888,6 @@ static ftree_fabric_t *__osm_ftree_fabric_create()
>  
>  	cl_qlist_init(&p_ftree->root_guid_list);
>  
> -	status = cl_pool_init(&p_ftree->sw_fwd_tbl_pool, 8,	/* min pool size */
> -			      0,	/* max pool size - unlimited */
> -			      8,	/* grow size */
> -			      FTREE_FWD_TBL_LEN,	/* object_size */
> -			      NULL,	/* object initializer */
> -			      NULL,	/* object destructor */
> -			      NULL);	/* context */
> -	if (status != CL_SUCCESS)
> -		return NULL;
> -
>  	return p_ftree;
>  }
>  
> @@ -1008,7 +979,6 @@ static void __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree)
>  	if (!p_ftree)
>  		return;
>  	__osm_ftree_fabric_clear(p_ftree);
> -	cl_pool_destroy(&p_ftree->sw_fwd_tbl_pool);
>  	free(p_ftree);
>  }
>  
> @@ -1924,9 +1894,6 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
>  	ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
>  
>  	p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho;
> -
> -	memcpy(p_sw->p_osm_sw->new_lft, p_sw->lft_buf,
> -	       IB_LID_UCAST_END_HO + 1);
>  	osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
>  				    p_sw->p_osm_sw);
>  }
> @@ -2065,13 +2032,13 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  		/* second case: skip the port group if the remote (lower)
>  		   switch has been already configured for this target LID */
>  		if (is_real_lid && !is_main_path &&
> -		    p_remote_sw->lft_buf[cl_ntoh16(target_lid)] != OSM_NO_PATH)
> +		    p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH)
>  			continue;
>  
>  		/* setting fwd tbl port only if this is real LID */
>  		if (is_real_lid) {
> -			p_remote_sw->lft_buf[cl_ntoh16(target_lid)] =
> -                                p_min_port->remote_port_num;
> +			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
> +				p_min_port->remote_port_num;
>  			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>  				"Switch %s: set path to CA LID %u through port %u\n",
>  				__osm_ftree_tuple_to_str(p_remote_sw->tuple),
> @@ -2249,7 +2216,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  		p_min_group->counter_down++;
>  		p_min_port->counter_down++;
>  		if (is_real_lid) {
> -			p_remote_sw->lft_buf[cl_ntoh16(target_lid)] =
> +			p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
>  				p_min_port->remote_port_num;
>  			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>  				"Switch %s: set path to CA LID %u through port %u\n",
> @@ -2325,7 +2292,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  		p_remote_sw = p_group->remote_hca_or_sw.p_sw;
>  
>  		/* skip if target lid has been already set on remote switch fwd tbl */
> -		if (p_remote_sw->lft_buf[cl_ntoh16(target_lid)] != OSM_NO_PATH)
> +		if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != OSM_NO_PATH)
>  			continue;
>  
>  		if (p_sw->is_leaf) {
> @@ -2343,7 +2310,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree,
>  		   trying to balance these routes - always pick port 0. */
>  
>  		cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port);
> -		p_remote_sw->lft_buf[cl_ntoh16(target_lid)] =
> +		p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] =
>  			p_port->remote_port_num;
>  
>  		/* On the remote switch that is pointed by the p_group,
> @@ -2435,7 +2402,7 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree)
>  			/* set local LFT(LID) to the port that is connected to HCA */
>  			cl_ptr_vector_at(&p_leaf_port_group->ports, 0,
>  					 (void *)&p_port);
> -			p_sw->lft_buf[cl_ntoh16(hca_lid)] = p_port->port_num;
> +			p_sw->p_osm_sw->new_lft[cl_ntoh16(hca_lid)] = p_port->port_num;
>  
>  			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>  				"Switch %s: set path to CN LID %u through port %u\n",
> @@ -2544,7 +2511,7 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree)
>  			cl_ptr_vector_at(&p_hca_port_group->ports, 0,
>  					 (void *)&p_hca_port);
>  			port_num_on_switch = p_hca_port->remote_port_num;
> -			p_sw->lft_buf[cl_ntoh16(hca_lid)] = port_num_on_switch;
> +			p_sw->p_osm_sw->new_lft[cl_ntoh16(hca_lid)] = port_num_on_switch;
>  
>  			OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>  				"Switch %s: set path to non-CN HCA LID %u through port %u\n",
> @@ -2600,7 +2567,7 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree)
>  		p_next_sw = (ftree_sw_t *) cl_qmap_next(&p_sw->map_item);
>  
>  		/* set local LFT(LID) to 0 (route to itself) */
> -		p_sw->lft_buf[cl_ntoh16(p_sw->base_lid)] = 0;
> +		p_sw->p_osm_sw->new_lft[cl_ntoh16(p_sw->base_lid)] = 0;
>  
>  		OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
>  			"Switch %s (LID %u): routing switch-to-switch paths\n",


From sashak at voltaire.com  Mon Nov 24 01:27:40 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 11:27:40 +0200
Subject: [ofa-general] Re: [PATCH] opensm/ftree: save lft_buf memory
	allocations
In-Reply-To: <492A6EAE.7020909@dev.mellanox.co.il>
References: <20081124070100.GZ21967@sashak.voltaire.com>
	<492A6EAE.7020909@dev.mellanox.co.il>
Message-ID: <20081124092740.GG21967@sashak.voltaire.com>

Hi Yevgeny,

On 11:06 Mon 24 Nov     , Yevgeny Kliteynik wrote:
>
> Sasha Khapyorsky wrote:
>> Use OpenSM switch lft_buf directly and save memory (48k per switch) in
>> local structures.
>
> Looks good, thanks.

The only potential downside I could see here - is that this will require
some handling if we will remove new_lft field (after #1406 and other
debugging).

Sasha


From kliteyn at dev.mellanox.co.il  Mon Nov 24 01:57:34 2008
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 24 Nov 2008 11:57:34 +0200
Subject: [ofa-general] Re: [PATCH] opensm/ftree: save lft_buf memory
	allocations
In-Reply-To: <20081124092740.GG21967@sashak.voltaire.com>
References: <20081124070100.GZ21967@sashak.voltaire.com>
	<492A6EAE.7020909@dev.mellanox.co.il>
	<20081124092740.GG21967@sashak.voltaire.com>
Message-ID: <492A7A8E.1050109@dev.mellanox.co.il>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 11:06 Mon 24 Nov     , Yevgeny Kliteynik wrote:
>> Sasha Khapyorsky wrote:
>>> Use OpenSM switch lft_buf directly and save memory (48k per switch) in
>>> local structures.
>> Looks good, thanks.
> 
> The only potential downside I could see here - is that this will require
> some handling if we will remove new_lft field (after #1406 and other
> debugging).

Right, I thought about it too, but I decided that removing new_buf
might be not so good. Right now we have two types of routing engines:
engines that are basing their decisions on the min_hop tables, and engines
that make their own decisions and creating min_hop tables as a by-product,
just for multicast routing.
The example of latter at this point is only fat-tree routing, but I'm sure
that more will follow. New routing for 3D mesh/torus comes to mind (not
necessarily the one that was already posted to the list). For this type
of routing you will need new_buf anyway, so instead of having it inside
of every routing (as it was with fat-tree before the unicast cache and lft
simplification), we'd better have one in osm_switch_t.

-- Yevgeny

> Sasha
> 


From vlad at lists.openfabrics.org  Mon Nov 24 03:23:04 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Mon, 24 Nov 2008 03:23:04 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081124-0200 daily build status
Message-ID: <20081124112304.A9909E608DC@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From amirv at mellanox.co.il  Mon Nov 24 03:53:36 2008
From: amirv at mellanox.co.il (Amir Vadai)
Date: Mon, 24 Nov 2008 13:53:36 +0200
Subject: [ofa-general] Re: [ewg] OFED 1.4 - delay the GA to Dec 4
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01006654@mtlexch01.mtl.com>
References: <5D49E7A8952DC44FB38C38FA0D758EAD01006654@mtlexch01.mtl.com>
Message-ID: <492A95C0.9050500@mellanox.co.il>

Both bugs on SDP are fixed (BUG1348, BUG1349) - Currently there are no
major bugs relevant to this release.


- Amir


Tziporet Koren wrote:

> Hi All,
>
> I have Just reviewed bugs status with Vlad.
>
> We have 11 major and critical bugs, and we will not be able to fix all
> of them in one week
>
> Thus - I delay the GA release to Dec 4 (since we have thanks-giving
> holiday next week)
>
> I also suggest we will create RC6 by end of next week - since most of
> the bugs are assigned to people in Israel and we do not have vacation
> next week
>
> We will review the release status at the EWG meeting next week.
>
> Bug owners - please reply with status update and also update bug report
>
> Bugs list:
>
> 1370            blo     vlad at mellanox.co.il     Ping over IPoIB I/F
> fails after ifconfig down and up
>
> 1242    cri     yannick.cote at qlogic.com kernel panic while running
> mpi2007 against ofed1.4 -- ib_...
>
> 1198    cri     yosefe at voltaire.com     hang during ipoib
> create_child/ifdown
>
> 1348    maj     amirv at mellanox.co.il    Sdp sockets doesnt closed
> after programs end
>
> 1349    maj     amirv at mellanox.co.il    Kernel panic on sdp
>
> 1289    maj     jackm at mellanox.co.il    Ib and ipoib doesnt respond
> while running multiple tests ...
>
> 1389    maj     jackm at mellanox.co.il    poll_cq sometimes fail in a
> multithreaded test
>
> 1401    maj     sashak at voltaire.com     segmentation fault when
> running opensm -Q
>
> 1377    maj     vu at mellanox.com         Deadlock occured during HA test
>
> 1380    maj     vu at mellanox.com         Cannot unload ib_srpt module
> on SRP target
>
> 1395    maj     vu at mellanox.com         kernel panic during SRP HA test
>
>
> Tziporet & Vlad
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


From wangchen at cn.fujitsu.com  Mon Nov 24 01:31:59 2008
From: wangchen at cn.fujitsu.com (Wang Chen)
Date: Mon, 24 Nov 2008 17:31:59 +0800
Subject: [ofa-general] [PATCH next]infiniband: Kill directly reference of
	netdev->priv
Message-ID: <492A748F.9040308@cn.fujitsu.com>

This use of netdev->priv is wrong.
The right way is:
alloc_netdev() with no memory for private data.
make netdev->ml_priv to point to c2_dev.

I am doing this kind of work for net-next tree.
So I send this patch to Dave, although infiniband's maintainer is
not him.

Signed-off-by: Wang Chen <wangchen at cn.fujitsu.com>
---
 drivers/infiniband/hw/amso1100/c2_provider.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 69580e2..5119d65 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -653,7 +653,7 @@ static int c2_service_destroy(struct iw_cm_id *cm_id)
 static int c2_pseudo_up(struct net_device *netdev)
 {
 	struct in_device *ind;
-	struct c2_dev *c2dev = netdev->priv;
+	struct c2_dev *c2dev = netdev->ml_priv;
 
 	ind = in_dev_get(netdev);
 	if (!ind)
@@ -678,7 +678,7 @@ static int c2_pseudo_up(struct net_device *netdev)
 static int c2_pseudo_down(struct net_device *netdev)
 {
 	struct in_device *ind;
-	struct c2_dev *c2dev = netdev->priv;
+	struct c2_dev *c2dev = netdev->ml_priv;
 
 	ind = in_dev_get(netdev);
 	if (!ind)
@@ -746,14 +746,14 @@ static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev)
 	/* change ethxxx to iwxxx */
 	strcpy(name, "iw");
 	strcat(name, &c2dev->netdev->name[3]);
-	netdev = alloc_netdev(sizeof(*netdev), name, setup);
+	netdev = alloc_netdev(0, name, setup);
 	if (!netdev) {
 		printk(KERN_ERR PFX "%s -  etherdev alloc failed",
 			__func__);
 		return NULL;
 	}
 
-	netdev->priv = c2dev;
+	netdev->ml_priv = c2dev;
 
 	SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev);
 
-- 
1.5.3.4


From vlad at mellanox.co.il  Mon Nov 24 07:37:00 2008
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Mon, 24 Nov 2008 17:37:00 +0200
Subject: [ofa-general] [PATCH] IPoIB: Prevent address handles leak.
Message-ID: <20081124153700.GA27848@mellanox.co.il>

In case of removing ib_ipoib module ipoib_ib_dev_stop() function will be
called and all address handles (ah) in dead_ahs list will be reaped.
But some ah will be added to the dead list after ipoib_ib_dev_stop done
by ipoib_mcast_free. These ahs will not be freed.

The solution here is to wait till multicast_list will be empty. So, all
ahs will be added to dead_ahs list.

Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
---
Roland,
There may be some extremely slight window for leaking address handles
still, since the multicast list is emptied in ipoib_mcast_dev_flush() before
it calls ipoib_mcast_free() (which adds address handles to the dead list).

However, this seems to be the best compromise that I can see without
a lot of nasty (and possibly buggy) changes.

 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 66cafa2..6cc0c59 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -863,7 +863,7 @@ timeout:
 
 	begin = jiffies;
 
-	while (!list_empty(&priv->dead_ahs)) {
+	while (!list_empty(&priv->dead_ahs) || !list_empty(&priv->multicast_list)) {
 		__ipoib_reap_ah(dev);
 
 		if (time_after(jiffies, begin + HZ)) {
-- 
1.5.6.3


From tziporet at mellanox.co.il  Mon Nov 24 07:59:02 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 24 Nov 2008 17:59:02 +0200
Subject: [ofa-general] OFED 1.4 meeting agenda for today - Nov 24
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD010AC6CC@mtlexch01.mtl.com>

This is the agenda for the OFED meeting today

1. Bugs status review:
1370    	blo  	vlad at mellanox.co.il  	  	Ping over IPoIB
I/F fails after ifconfig down and up - there is a fix but its not
integrated
1242 	cri 	yannick.cote at qlogic.com 	kernel panic while
running mpi2007 against ofed1.4 -- ib_...
1410    	cri  	vlad at mellanox.co.il  	  	Memory leak
(address handler not reped) in IPoIB
1289 	maj 	jackm at mellanox.co.il 		Ib and ipoib doesn't
respond while running multiple tests ...
1407 	maj 	monis at voltaire.com 		Active-Backup failure
when disabling an active slave inte...
1377 	maj 	vu at mellanox.com 		Deadlock occurred during
HA test
1380 	maj 	vu at mellanox.com 		Cannot unload ib_srpt
module on SRP target
1395 	maj 	vu at mellanox.com 		kernel panic during SRP
HA test
1384    	maj  	eli at mellanox.co.il  	  	netperf latency
small messages increase 5%
1385 	maj 	eli at mellanox.co.il 		ofed 1.4 - netperf udp
BW small messages decrease ~8%
1386 	maj 	eli at mellanox.co.il 		ofed 1.4 - iperf tcp
connected mode BW large messages dec...


2. Decide on release date: Not clear if we can make it for Dec 4, since
we have many open bugs

3. Decide on next meetings dates:
    I suggest Dec 1, Dec 8 and Dec 15 (only if the release is not done)

4. Open discussion

Tziporet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081124/b0232d10/attachment.html>

From rdreier at cisco.com  Mon Nov 24 09:00:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 24 Nov 2008 09:00:59 -0800
Subject: [ofa-general] Re: RDMA CM and IPv6 support
In-Reply-To: <1227431794.4180.7.camel@alst60.voltaire.com> (Aleksey Senin's
	message of "Sun, 23 Nov 2008 09:16:34 +0000")
References: <1227431794.4180.7.camel@alst60.voltaire.com>
Message-ID: <adaej11f48k.fsf@cisco.com>

 > There was a set of kernel patches written by me and approved by Sean for
 > RDMA CM to support IPv6 protocol. Is there any reason why it not
 > applied? I'll be glad fix them.

I didn't see any comment on them last time as I recall.  I would prefer
to get an ack from Sean before applying them.

In any case I've lost the original mails from my mailbox.  I think it
would be a good idea for you to repost the patches against the latest
kernel to move things forward.

 - R.


From rdreier at cisco.com  Mon Nov 24 09:01:28 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 24 Nov 2008 09:01:28 -0800
Subject: [ofa-general] Re: [PATCH next]infiniband: Kill directly reference of
	netdev->priv
In-Reply-To: <492A748F.9040308@cn.fujitsu.com> (Wang Chen's message of "Mon,
	24 Nov 2008 17:31:59 +0800")
References: <492A748F.9040308@cn.fujitsu.com>
Message-ID: <adaabbpf47r.fsf@cisco.com>

Looks fine to me.

Acked-by: Roland Dreier <rolandd at cisco.com>


From rdreier at cisco.com  Mon Nov 24 09:02:33 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 24 Nov 2008 09:02:33 -0800
Subject: [ofa-general] Re: [PATCH] IPoIB: Prevent address handles leak.
In-Reply-To: <20081124153700.GA27848@mellanox.co.il> (Vladimir Sokolovsky's
	message of "Mon, 24 Nov 2008 17:37:00 +0200")
References: <20081124153700.GA27848@mellanox.co.il>
Message-ID: <ada63mdf45y.fsf@cisco.com>

 > There may be some extremely slight window for leaking address handles
 > still, since the multicast list is emptied in ipoib_mcast_dev_flush() before
 > it calls ipoib_mcast_free() (which adds address handles to the dead list).
 > 
 > However, this seems to be the best compromise that I can see without
 > a lot of nasty (and possibly buggy) changes.

The impact of this bug seems very low to me, so this is 2.6.29 material
anyway.  I would really rather fix this bug for real rather than just
reducing the window and leaving the bug to cause problems in the future,
so could you try and think of a solution that doesn't leave a window at all?


From yossi.openib at gmail.com  Mon Nov 24 09:13:59 2008
From: yossi.openib at gmail.com (Yossi Etigin)
Date: Mon, 24 Nov 2008 19:13:59 +0200
Subject: ***SPAM*** Re: [ofa-general] [PATCH] IPoIB: Prevent address handles
	leak.
In-Reply-To: <20081124153700.GA27848@mellanox.co.il>
References: <20081124153700.GA27848@mellanox.co.il>
Message-ID: <492AE0D7.5010508@gmail.com>

I think the problem is that multicast is not really flushed when the
interface is downed. Therefore, a join can start after the device was
brought down, and the fix below will not reap the ah's.

How about reaping all remaining ah's after multicast device is really
flushed, that is in ipoib_ib_dev_cleanup(), which is called when
ib_ipoib is unloaded? This way you can have only a limited amount of 
non-reaped dead ah when interface is down (until the reap task is back
on), and you can be certain that all of them will be reaped when module
is unloaded.

--Yossi

Vladimir Sokolovsky wrote:
> In case of removing ib_ipoib module ipoib_ib_dev_stop() function will be
> called and all address handles (ah) in dead_ahs list will be reaped.
> But some ah will be added to the dead list after ipoib_ib_dev_stop done
> by ipoib_mcast_free. These ahs will not be freed.
> 
> The solution here is to wait till multicast_list will be empty. So, all
> ahs will be added to dead_ahs list.
> 
> Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
> ---
> Roland,
> There may be some extremely slight window for leaking address handles
> still, since the multicast list is emptied in ipoib_mcast_dev_flush() before
> it calls ipoib_mcast_free() (which adds address handles to the dead list).
> 
> However, this seems to be the best compromise that I can see without
> a lot of nasty (and possibly buggy) changes.
> 
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 66cafa2..6cc0c59 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -863,7 +863,7 @@ timeout:
>  
>  	begin = jiffies;
>  
> -	while (!list_empty(&priv->dead_ahs)) {
> +	while (!list_empty(&priv->dead_ahs) || !list_empty(&priv->multicast_list)) {
>  		__ipoib_reap_ah(dev);
>  
>  		if (time_after(jiffies, begin + HZ)) {


From weiny2 at llnl.gov  Mon Nov 24 09:16:05 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 24 Nov 2008 09:16:05 -0800
Subject: [ofa-general] [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <f0e08f230811210425y4cadbebdk2d18318074635de3@mail.gmail.com>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<f0e08f230811210425y4cadbebdk2d18318074635de3@mail.gmail.com>
Message-ID: <20081124091605.298547e9.weiny2@llnl.gov>

On Fri, 21 Nov 2008 07:25:23 -0500
"Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:

> Hi Ira,
> 
> On Thu, Nov 20, 2008 at 7:38 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> > The following 3 patches implement "libibnetdisc" which provides the
> > functionality of ibnetdiscover in a C library.
> >
> > I mentioned this to Sasha at the last Sonoma conference and posted the bulk of
> > this code to the list a few months ago.  This libary is still providing the 85%
> > performance speed up of iblinkinfo.pl on our clusters.
> >
> > This new series is heavily tested and, for our hardware, preserves the
> > functionality of ibnetdiscover.  Since I don't have a Xsigo box to test on I
> > can only verify that it compiles correctly.
> 
> Have you also verified this QLogic/Silverstorm and Cisco chassis
> switches ? They were supported too.

I did not see the code for their support.  I probably missed something.  We
have some QLogic switches on Hyperion now so I will test that.

Thanks for the catch,
Ira

> 
> -- Hal
> 
> > Ira
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> >
> 


From celine.bourde at ext.bull.net  Mon Nov 24 09:30:02 2008
From: celine.bourde at ext.bull.net (Celine Bourde)
Date: Mon, 24 Nov 2008 18:30:02 +0100
Subject: [ofa-general] QoS implementation
Message-ID: <492AE49A.7090607@ext.bull.net>

Hi,

I'm testing QoS on opensm.
I work with OFED-1.4-20081123-0600.tgz and opensm-3.2.4_20081122_c732c34.
I've set up qos-policy file, SL2VL and VLArbitration Table (all in 
attachement).
Results still have wrong values.

I've launch opensm -Q /etc/ofa/opensm.conf and use qperf tools to test 
QoS implementation :

bandwith test :
---------------
SL1 should have 33% of bandwidth (1:64) , SL2 sould have 66% of 
bandwidth (2:128)

cmd server :
qperf -lp 19766 & qperf -lp 19764

cmd client :
qperf 192.168.0.3 -lp 19766 -sl 1 rc_rdma_write_bw > sl1.txt & qperf 
192.168.0.3 -lp 19764 -sl 2 rc_rdma_write_bw > sl2.txt

results :
sl1.txt : rc_rdma_write_bw:
    bw  =  1.7 GB/sec

sl2.txt : rc_rdma_write_bw:
    bw  =  1.7 GB/sec

latency test :
--------------
cmd server:
qperf -lp 19766 & qperf -lp 19764

cmd client :
qperf 192.168.0.3 -lp 19766 -sl 1 rc_rdma_write_lat > sl1.txt & qperf 
192.168.0.3 -lp 19764 -sl 2 rc_rdma_write_lat > sl2.txt

results :
sl1.txt : rc_rdma_write_lat
        latency  =  12.9 us
sl2.txt : rc_rdma_write_lat:
        latency  =  12.9 us

I've tested pc to pc without switch between both.
I've a Mellanox ConnectX card on each pc with following features :
ConnectX® IB QDR, ConnectX IB HCA/TCA IC, dual-port, QDR, PCIe 2.0
PCIe 2.0 5.0GT/s

Firmware has been updated with latest version 2.5.9

[]# ibstat
CA 'mlx4_0'
   CA type: MT26428
   Number of ports: 2
   Firmware version: 2.5.900
   Hardware version: a0
   Node GUID: 0x0002c903000290aa
   System image GUID: 0x0002c903000290ad
   Capability mask: 0x0251086a


When I use smpquery, results are the following :

VLCap:...........................VL0-7
VLHighLimit:.....................0
VLArbHighCap:....................8
VLArbLowCap:.....................8
VLStallCount:....................0

So, my configuration (cf opensm.conf in attachement) doesn't correspond 
to smpquery results.
I've tried to restart openibd on both pc. I've restarted opensm (stop 
and start),
I'v tested "opensm -Q conf_file" but my configuration is always unset.

Did I miss something or is it a bug ?

Thanks for your help.

Céline Bourde.


MY CONFIGURATION IS THE FOLLOWING :

I've added "options mlx4_core enable_qos=1" in modprobe.conf to set on QoS

I've configured qos-policy file with following rules :
-----------------------
qos-levels

      qos-level
              name: DEFAULT
              sl: 0
      end-qos-level

      qos-level
              name: MPI
              sl: 1
      end-qos-level

      qos-level
              name: Lustre
              sl: 2
      end-qos-level


end-qos-levels
----------------------


my qos settings in /etc/ofa/opensm.conf
-----------------------
#
# QoS OPTIONS
#
# Enable QoS setup
qos TRUE

# QoS policy file to be used
qos_policy_file /etc/opensm/qos-policy.conf

# QoS default options
qos_max_vls 15
qos_high_limit 0
qos_vlarb_high 
0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_vlarb_low 
0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

# QoS CA options
qos_ca_max_vls 15
qos_ca_high_limit 0
qos_ca_vlarb_high 0:0,1:0,2:0
qos_ca_vlarb_low 0:1,1:64,2:128
qos_ca_sl2vl 0,1,2,3,4,6,7,8,9,10,11,12,13,14,7,5

# QoS Switch external ports options
qos_swe_max_vls 15
qos_swe_high_limit 0
qos_swe_vlarb_high 0:0,1:0,2:0
qos_swe_vlarb_low 0:1,1:64,2:128
qos_swe_sl2vl 0,1,2,3,4,6,7,8,9,10,11,12,13,14,7,5


From weiny2 at llnl.gov  Mon Nov 24 09:42:43 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 24 Nov 2008 09:42:43 -0800
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081123182741.GS21967@sashak.voltaire.com>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<20081123182741.GS21967@sashak.voltaire.com>
Message-ID: <20081124094243.4dbcff51.weiny2@llnl.gov>

On Sun, 23 Nov 2008 20:27:41 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 16:38 Thu 20 Nov     , Ira Weiny wrote:
> > The following 3 patches implement "libibnetdisc" which provides the
> > functionality of ibnetdiscover in a C library.
> > 
> > I mentioned this to Sasha at the last Sonoma conference and posted the bulk of
> > this code to the list a few months ago.  This libary is still providing the 85%
> > performance speed up of iblinkinfo.pl on our clusters.
> 
> This is great!
> 
> Do not you think this library should be rather part of infiniband-diags,
> rather than separate package/management sub-project? Personally I would
> prefer to have this as part of infiniband-diags.

No, I would like to see it be a stand alone library.  Currently
infiniband-diags does not provide any library functionality and simply depends
on the libraries provided by the rest of the management tree.  Don't you think
this is a good model to follow?

Ira


From weiny2 at llnl.gov  Mon Nov 24 09:55:07 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 24 Nov 2008 09:55:07 -0800
Subject: [ofa-general] Re: [PATCH 3/3] Convert ibnetdiscover to use new
	ibnetdisc library.
In-Reply-To: <20081123183517.GT21967@sashak.voltaire.com>
References: <20081120163815.5cd110fb.weiny2@llnl.gov>
	<20081123183517.GT21967@sashak.voltaire.com>
Message-ID: <20081124095507.785be95a.weiny2@llnl.gov>

On Sun, 23 Nov 2008 20:35:17 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> Hi Ira,
> 
> On 16:38 Thu 20 Nov     , Ira Weiny wrote:
> > From e2b8bac5d651c2278719d511dee2ab2e8ad05706 Mon Sep 17 00:00:00 2001
> > From: Ira Weiny <weiny2 at llnl.gov>
> > Date: Thu, 20 Nov 2008 09:29:57 -0800
> > Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library.
> > 
> >    Removed -e and -v since they were somewhat redundant with the -d option.
> 
> I think it would be better to preserve an options for backward
> compatibility. At least '-v' is used in dump_ftts.sh. It can be used in
> other scripts...
> 

Ah, ok...  Actually dump_[lm]fts.sh use output which is provided by the "-s"
option.  <sigh> I did not think any of the scripts would use "debugging" output
for their processing...

More testing is obviously needed.

Thanks,
Ira


From yosefe at Voltaire.COM  Mon Nov 24 09:58:22 2008
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Mon, 24 Nov 2008 19:58:22 +0200
Subject: [ofa-general] [PATCH] ipoib: do not join broadcast group if
	interface is brought down
In-Reply-To: <49246EB7.3070607@Voltaire.COM>
References: <49246EB7.3070607@Voltaire.COM>
Message-ID: <492AEB3E.4030202@Voltaire.COM>

Roland,
Can you please comment on this?

Yossi Etigin wrote:
> Because ipoib_workqueue is not flushed when ipoib interface is brought 
> down,
> ipoib_mcast_join() may trigger a join to the broadcast group after 
> priv->broadcast
> was set to NULL (during cleanup). This will cause ipoib to be joined to the
> broadcast group when interface is down.
> As a side effect, this breaks the optimization of setting qkey only when 
> joining
> the broadcast group.
> 
> Signed-off-by: Yossi Etigin <yosefe at voltaire.com>
> 
> -- 
> 
> Fix bugzilla 1370.
> 
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c    2008-11-19 
> 21:33:54.000000000 +0200
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c    2008-11-19 
> 21:40:12.000000000 +0200
> @@ -565,7 +565,8 @@ void ipoib_mcast_join_task(struct work_s
>             ipoib_warn(priv, "ib_query_port failed\n");
>     }
> 
> -    if (!priv->broadcast) {
> +    rtnl_lock();
> +    if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && !priv->broadcast) {
>         struct ipoib_mcast *broadcast;
> 
>         broadcast = ipoib_mcast_alloc(dev, 1);
> @@ -576,6 +577,7 @@ void ipoib_mcast_join_task(struct work_s
>                 queue_delayed_work(ipoib_workqueue,
>                            &priv->mcast_join_task, HZ);
>             mutex_unlock(&mcast_mutex);
> +            rtnl_unlock();
>             return;
>         }
> 
> @@ -587,6 +589,7 @@ void ipoib_mcast_join_task(struct work_s
>         __ipoib_mcast_add(dev, priv->broadcast);
>         spin_unlock_irq(&priv->lock);
>     }
> +    rtnl_unlock();
> 
>     if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
>         if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))

-- 
--Yossi


From sashak at voltaire.com  Mon Nov 24 10:02:18 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 20:02:18 +0200
Subject: [ofa-general] QoS implementation
In-Reply-To: <492AE49A.7090607@ext.bull.net>
References: <492AE49A.7090607@ext.bull.net>
Message-ID: <20081124180218.GR6183@sashak.voltaire.com>

Hi,

On 18:30 Mon 24 Nov     , Celine Bourde wrote:
>
> I'm testing QoS on opensm.
> I work with OFED-1.4-20081123-0600.tgz and opensm-3.2.4_20081122_c732c34.
> I've set up qos-policy file, SL2VL and VLArbitration Table (all in 
> attachement).
> Results still have wrong values.
>
> I've launch opensm -Q /etc/ofa/opensm.conf

Maybe you need: opensm -Q -F /etc/ofa/opensm.conf ?

Sasha


From cap at nsc.liu.se  Mon Nov 24 10:16:55 2008
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Mon, 24 Nov 2008 19:16:55 +0100
Subject: [ofa-general] infiniband problem, no NICs
In-Reply-To: <492904C8.7000402@voltaire.com>
References: <4925BD78.4030003@tu-berlin.de> <492904C8.7000402@voltaire.com>
Message-ID: <200811241917.00503.cap@nsc.liu.se>

On Sunday 23 November 2008, Or Gerlitz wrote:
> Michael Oevermann wrote:
> > However, when I directly start a mpi job (without using a scheduler) via:
> > /usr/mpi/gcc4/openmpi-1.2.2-1/bin/mpirun -np 4 -hostfile
> > /home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/tes
> >ts/IMB-2.3/IMB-MPI1
> >
> >
> > I get the error message:
> >
> > 0,1,0]: uDAPL on host n01 was unable to find any NICs...
...
> The BTL you are working with uses a library named udapl and this library
> relies on the IPoIB (IP over Infiniband) NICs (e.g ib0, ib1) existence.
> Assuming these nics are not configured on your system, you can either
> configure them (modprobe ib_ipoib / ifconfig ib0 x.y.z.w) or use a verb
> (native IB access layer) BTL which does not reply on operative ipoib.

Using verbs is the way to go. OpenMPI, afaik, does not recommend the udapl 
btl. I would recommend checking for the btl "openib" which is the verbs btl. 
If it does not exist rebuild OpenMPI (you will need libibverbs-devel).

/Peter

> Or.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081124/d9d476b6/attachment.sig>

From sashak at voltaire.com  Mon Nov 24 10:28:06 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 20:28:06 +0200
Subject: [ofa-general] [PATCH] opensm/man/opensm.8: add some missing stuff
Message-ID: <20081124182806.GS6183@sashak.voltaire.com>


Add some missing options, add arguments where needed.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 opensm/man/opensm.8.in |   90 ++++++++++++++++++++++++++++--------------------
 1 files changed, 53 insertions(+), 37 deletions(-)

diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index b64daba..5c08de9 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -6,28 +6,44 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
 .SH SYNOPSIS
 .B opensm
 [\-\-version]]
-[\-F | \-\-config <file_name>] [\-c(reate-config) <file_name>]
-[\-g(uid) <GUID in hex>] [\-l(mc) <LMC>]
-[\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)]
+[\-F | \-\-config <file_name>]
+[\-c(reate-config) <file_name>]
+[\-g(uid) <GUID in hex>]
+[\-l(mc) <LMC>]
+[\-p(riority) <PRIORITY>]
+[\-smkey <SM_Key>]
+[\-r(eassign_lids)]
 [\-R <engine name(s)> | \-\-routing_engine <engine name(s)>]
-[\-A | \-\-ucast_cache] [\-z | \-\-connect_roots]
+[\-A | \-\-ucast_cache]
+[\-z | \-\-connect_roots]
 [\-M <file name> | \-\-lid_matrix_file <file name>]
 [\-U <file name> | \-\-lfts_file <file name>]
-[\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>]
+[\-S | \-\-sadb_file <file name>]
+[\-a | \-\-root_guid_file <path to file>]
 [\-u | \-\-cn_guid_file <path to file>]
 [\-X | \-\-guid_routing_order_file <path to file>]
 [\-m | \-\-ids_guid_file <path to file>]
-[\-o(nce)] [\-s(weep) <interval>]
-[\-t(imeout) <milliseconds>] [\-maxsmps <number>]
-[\-console [off | local | socket | loopback]] [\-console-port <port>]
-[\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file]
-[\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)]
+[\-o(nce)]
+[\-s(weep) <interval>]
+[\-t(imeout) <milliseconds>]
+[\-maxsmps <number>]
+[\-console [off | local | socket | loopback]]
+[\-console-port <port>]
+[\-i(gnore-guids) <equalize-ignore-guids-file>]
+[\-f <log file path> | \-\-log_file <log file path> ]
+[\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)]
+[\-P(config) <partition config file> ]
+[\-N | \-\-no_part_enforce]
 [\-Q | \-\-qos [\-Y | \-\-qos_policy_file <file name>]]
-[\-N | \-\-no_part_enforce] [\-y | \-\-stay_on_fatal]
-[\-B | \-\-daemon] [\-I | \-\-inactive]
-[\-\-perfmgr] [\-\-perfmgr_sweep_time_s <seconds>]
+[\-y | \-\-stay_on_fatal]
+[\-B | \-\-daemon]
+[\-I | \-\-inactive]
+[\-\-perfmgr]
+[\-\-perfmgr_sweep_time_s <seconds>]
 [\-\-prefix_routes_file <path>]
-[\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
+[\-\-consolidate_ipv6_snm_req]
+[\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>]
+[\-h(elp)] [\-?]
 
 .SH DESCRIPTION
 .PP
@@ -68,15 +84,15 @@ setup the subnet correctly.
 \fB\-\-version\fR
 Prints OpenSM version and exits.
 .TP
-\fB\-F\fR, \fB\-\-config\fR
+\fB\-F\fR, \fB\-\-config\fR <config file>
 The name of the OpenSM config file. When not specified
 \fB\% @OPENSM_CONFIG_DIR@/@OPENSM_CONFIG_FILE@\fP will be used (if exists).
 .TP
-\fB\-c\fR, \fB\-\-create-config\fR
+\fB\-c\fR, \fB\-\-create-config\fR <file name>
 OpenSM will dump its configuration to the specified file and exit.
 This is a way to generate OpenSM configuration file template.
 .TP
-\fB\-g\fR, \fB\-\-guid\fR
+\fB\-g\fR, \fB\-\-guid\fR <GUID in hex>
 This option specifies the local port GUID value
 with which OpenSM should bind.  OpenSM may be
 bound to 1 port at a time.
@@ -84,7 +100,7 @@ If GUID given is 0, OpenSM displays a list
 of possible port GUIDs and waits for user input.
 Without -g, OpenSM tries to use the default port.
 .TP
-\fB\-l\fR, \fB\-\-lmc\fR
+\fB\-l\fR, \fB\-\-lmc\fR <LMC value>
 This option specifies the subnet's LMC value.
 The number of LIDs assigned to each port is 2^LMC.
 The LMC value must be in the range 0-7.
@@ -95,13 +111,13 @@ ports, i.e. multiple interconnects between switches.
 Without -l, OpenSM defaults to LMC = 0, which allows
 one path between any two ports.
 .TP
-\fB\-p\fR, \fB\-\-priority\fR
+\fB\-p\fR, \fB\-\-priority\fR <Priority value>
 This option specifies the SM\'s PRIORITY.
 This will effect the handover cases, where master
 is chosen by priority and GUID.  Range goes from 0
 (default and lowest priority) to 15 (highest).
 .TP
-\fB\-smkey\fR
+\fB\-smkey\fR <SM_Key value>
 This option specifies the SM\'s SM_Key (64 bits).
 This will effect SM authentication.
 Note that OpenSM version 3.2.1 and below used the default value '1'
@@ -115,7 +131,7 @@ may disrupt subnet traffic.
 Without -r, OpenSM attempts to preserve existing
 LID assignments resolving multiple use of same LID.
 .TP
-\fB\-R\fR, \fB\-\-routing_engine\fR
+\fB\-R\fR, \fB\-\-routing_engine\fR <Routing engine names>
 This option chooses routing engine(s) to use instead of Min Hop
 algorithm (default).  Multiple routing engines can be specified
 separated by commas so that specific ordering of routing algorithms
@@ -140,33 +156,33 @@ only) to make connectivity between root switches and in
 this way to be fully IBA complaint. In many cases this can
 violate "pure" deadlock free algorithm, so use it carefully.
 .TP
-\fB\-M\fR, \fB\-\-lid_matrix_file\fR
+\fB\-M\fR, \fB\-\-lid_matrix_file\fR <file name>
 This option specifies the name of the lid matrix dump file
 from where switch lid matrices (min hops tables will be
 loaded.
 .TP
-\fB\-U\fR, \fB\-\-lfts_file\fR
+\fB\-U\fR, \fB\-\-lfts_file\fR <file name>
 This option specifies the name of the LFTs file
 from where switch forwarding tables will be loaded.
 .TP
-\fB\-S\fR, \fB\-\-sadb_file\fR
+\fB\-S\fR, \fB\-\-sadb_file\fR <file name>
 This option specifies the name of the SA DB dump file
 from where SA database will be loaded.
 .TP
-\fB\-a\fR, \fB\-\-root_guid_file\fR
+\fB\-a\fR, \fB\-\-root_guid_file\fR <file name>
 Set the root nodes for the Up/Down or Fat-Tree routing
 algorithm to the guids provided in the given file (one to a line).
 .TP
-\fB\-u\fR, \fB\-\-cn_guid_file\fR
+\fB\-u\fR, \fB\-\-cn_guid_file\fR <file name>
 Set the compute nodes for the Fat-Tree routing algorithm
 to the guids provided in the given file (one to a line).
 .TP
-\fB\-m\fR, \fB\-\-ids_guid_file\fR
+\fB\-m\fR, \fB\-\-ids_guid_file\fR <file name>
 Name of the map file with set of the IDs which will be used
 by Up/Down routing algorithm instead of node GUIDs
 (format: <guid> <id> per line).
 .TP
-\fB\-X\fR, \fB\-\-guid_routing_order_file\fR
+\fB\-X\fR, \fB\-\-guid_routing_order_file\fR <file name>
 Set the order port guids will be routed for the MinHop
 and Up/Down routing algorithms to the guids provided in the
 given file (one to a line).
@@ -175,20 +191,20 @@ given file (one to a line).
 This option causes OpenSM to configure the subnet
 once, then exit.  Ports remain in the ACTIVE state.
 .TP
-\fB\-s\fR, \fB\-\-sweep\fR
+\fB\-s\fR, \fB\-\-sweep\fR <interval value>
 This option specifies the number of seconds between
 subnet sweeps.  Specifying -s 0 disables sweeping.
 Without -s, OpenSM defaults to a sweep interval of
 10 seconds.
 .TP
-\fB\-t\fR, \fB\-\-timeout\fR
+\fB\-t\fR, \fB\-\-timeout\fR <value>
 This option specifies the time in milliseconds
 used for transaction timeouts.
 Specifying -t 0 disables timeouts.
 Without -t, OpenSM defaults to a timeout value of
 200 milliseconds.
 .TP
-\fB\-maxsmps\fR
+\fB\-maxsmps\fR <number>
 This option specifies the number of VL15 SMP MADs
 allowed on the wire at any one time.
 Specifying -maxsmps 0 allows unlimited outstanding
@@ -217,7 +233,7 @@ when it comes out of Standby state, if such file exists
 under OSM_CACHE_DIR, and is valid.
 By default, this is FALSE.
 .TP
-\fB\-f\fR, \fB\-\-log_file\fR
+\fB\-f\fR, \fB\-\-log_file\fR <file name>
 This option defines the log to be the given file.
 By default, the log goes to /var/log/opensm.log.
 For the log to go to standard output use -f stdout.
@@ -232,11 +248,11 @@ This option will cause deletion of the log file
 (if it previously exists). By default, the log file
 is accumulative.
 .TP
-\fB\-P\fR, \fB\-\-Pconfig\fR
+\fB\-P\fR, \fB\-\-Pconfig\fR <partition config file>
 This option defines the optional partition configuration file.
 The default name is \fB\%@OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@\fP.
 .TP
-.BI --prefix_routes_file= path
+\fB\-\-prefix_routes_file\fR <file name>
 Prefix routes control how the SA responds to path record queries for
 off-subnet DGIDs.  By default, the SA fails such queries. The
 .B PREFIX ROUTES
@@ -246,7 +262,7 @@ The default path is \fB\%@OPENSM_CONFIG_DIR@/prefix\-routes.conf\fP.
 \fB\-Q\fR, \fB\-\-qos\fR
 This option enables QoS setup. It is disabled by default.
 .TP
-\fB\-Y\fR, \fB\-\-qos_policy_file\fR
+\fB\-Y\fR, \fB\-\-qos_policy_file\fR <file name>
 This option defines the optional QoS policy file. The default
 name is \fB\%@OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@\fP.
 .TP
@@ -295,7 +311,7 @@ The -V option is equivalent to \'-D 0xFF -d 2\'.
 See the -D option for more information about
 log verbosity.
 .TP
-\fB\-D\fR
+\fB\-D\fR <value>
 This option sets the log verbosity level.
 A flags field must follow the -D option.
 A bit set/clear in the flags enables/disables a
@@ -318,7 +334,7 @@ Specifying -D 0xFF enables all messages (see -V).
 High verbosity levels may require increasing
 the transaction timeout with the -t option.
 .TP
-\fB\-d\fR, \fB\-\-debug\fR
+\fB\-d\fR, \fB\-\-debug\fR <value>
 This option specifies a debug option.
 These options are not normally needed.
 The number following -d selects the debug
-- 
1.6.0.3.517.g759a


From meier3 at llnl.gov  Mon Nov 24 10:34:22 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Mon, 24 Nov 2008 10:34:22 -0800
Subject: [ofa-general] Re: [PATCH] Opensm: main exit codes
In-Reply-To: <20081123185836.GU21967@sashak.voltaire.com>
References: <4923678D.3080701@llnl.gov>
	<20081123185836.GU21967@sashak.voltaire.com>
Message-ID: <492AF3AE.3060605@llnl.gov>

Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Tim,
> 
> On 17:10 Tue 18 Nov     , Timothy A. Meier wrote:
>>   I thought it would be useful to define a set of exit codes for opensm.  A quick examination of main.c
>> showed a few different ways to terminate.  How about this patch?  Obviously this doesn't catch every
>> possible exit scenario, but its a start that can be built upon.
> 
> Personally I read 'exit(0)' faster than 'exit(OSM_EXIT_TYPE_NORMAL)',
> but maybe it is just me :).

Me too :^)  Not much confusion over a return code of 0.

The audience for this change wouldn't be the people writing the software, but admins, scripts, and tools that
start/stop/monitor opensm.  At least that is our use case.

> 
> Maybe error codes could be formalized, but I'm not sure that it would be
> beneficial without any practical uses (and clear requirements
> understanding). Finally we can found us in a middle of the total mess
> similar to how OSM_LOG_* is used today.
> 
> Sasha
>

So the uses/requirements would be to formalize how opensm handles the non-ideal termination condition,
for the purpose of providing quick, convenient, and consistent information for other system level tools
that are responsible for starting/stopping/monitoring/reporting opensm.

I can't think of any other reasons or needs.


-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
meier3 at llnl.gov


From sashak at voltaire.com  Mon Nov 24 11:02:51 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 21:02:51 +0200
Subject: [ofa-general] Re: [PATCH] Opensm: main exit codes
In-Reply-To: <492AF3AE.3060605@llnl.gov>
References: <4923678D.3080701@llnl.gov>
	<20081123185836.GU21967@sashak.voltaire.com>
	<492AF3AE.3060605@llnl.gov>
Message-ID: <20081124190251.GT6183@sashak.voltaire.com>

On 10:34 Mon 24 Nov     , Timothy A. Meier wrote:
> Hi Sasha,
> 
> Sasha Khapyorsky wrote:
> > Hi Tim,
> > 
> > On 17:10 Tue 18 Nov     , Timothy A. Meier wrote:
> >>   I thought it would be useful to define a set of exit codes for opensm.  A quick examination of main.c
> >> showed a few different ways to terminate.  How about this patch?  Obviously this doesn't catch every
> >> possible exit scenario, but its a start that can be built upon.
> > 
> > Personally I read 'exit(0)' faster than 'exit(OSM_EXIT_TYPE_NORMAL)',
> > but maybe it is just me :).
> 
> Me too :^)  Not much confusion over a return code of 0.
> 
> The audience for this change wouldn't be the people writing the software,

Somehow we need to care about yourselves too :)

> but admins, scripts, and tools that
> start/stop/monitor opensm.  At least that is our use case.
> 
> > 
> > Maybe error codes could be formalized, but I'm not sure that it would be
> > beneficial without any practical uses (and clear requirements
> > understanding). Finally we can found us in a middle of the total mess
> > similar to how OSM_LOG_* is used today.
> > 
> > Sasha
> >
> 
> So the uses/requirements would be to formalize how opensm handles the non-ideal termination condition,
> for the purpose of providing quick, convenient, and consistent information for other system level tools
> that are responsible for starting/stopping/monitoring/reporting opensm.

And are there any of such tools? Or any *real* use?

Sasha


From halr at obsidianresearch.com  Mon Nov 24 11:06:59 2008
From: halr at obsidianresearch.com (Hal Rosenstock)
Date: Mon, 24 Nov 2008 12:06:59 -0700
Subject: [ofa-general] [PATCH][TRIVIAL] opensm.8.in: Update email address
Message-ID: <492AFB53.7010007@obsidianresearch.com>

Sasha,

Attached patch is a trivial update of email address in opensm man page.

-- Hal
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-osm-man2
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081124/aee7bc1b/attachment.ksh>

From hal.rosenstock at gmail.com  Mon Nov 24 11:08:18 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Mon, 24 Nov 2008 14:08:18 -0500
Subject: [ofa-general] [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081124091605.298547e9.weiny2@llnl.gov>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<f0e08f230811210425y4cadbebdk2d18318074635de3@mail.gmail.com>
	<20081124091605.298547e9.weiny2@llnl.gov>
Message-ID: <f0e08f230811241108n6f9900cw353a7c8710a553e6@mail.gmail.com>

On Mon, Nov 24, 2008 at 12:16 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
> On Fri, 21 Nov 2008 07:25:23 -0500
> "Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:
>
>> Hi Ira,
>>
>> On Thu, Nov 20, 2008 at 7:38 PM, Ira Weiny <weiny2 at llnl.gov> wrote:
>> > The following 3 patches implement "libibnetdisc" which provides the
>> > functionality of ibnetdiscover in a C library.
>> >
>> > I mentioned this to Sasha at the last Sonoma conference and posted the bulk of
>> > this code to the list a few months ago.  This libary is still providing the 85%
>> > performance speed up of iblinkinfo.pl on our clusters.
>> >
>> > This new series is heavily tested and, for our hardware, preserves the
>> > functionality of ibnetdiscover.  Since I don't have a Xsigo box to test on I
>> > can only verify that it compiles correctly.
>>
>> Have you also verified this QLogic/Silverstorm and Cisco chassis
>> switches ? They were supported too.
>
> I did not see the code for their support.  I probably missed something.  We
> have some QLogic switches on Hyperion now so I will test that.

Just to be sure: it's the grouping option which should be tested.

-- Hal

> Thanks for the catch,
> Ira
>
>>
>> -- Hal
>>
>> > Ira
>> >
>> > _______________________________________________
>> > general mailing list
>> > general at lists.openfabrics.org
>> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>> >
>> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
>> >
>>
>


From sashak at voltaire.com  Mon Nov 24 11:10:50 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 21:10:50 +0200
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081124094243.4dbcff51.weiny2@llnl.gov>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<20081123182741.GS21967@sashak.voltaire.com>
	<20081124094243.4dbcff51.weiny2@llnl.gov>
Message-ID: <20081124191050.GU6183@sashak.voltaire.com>

On 09:42 Mon 24 Nov     , Ira Weiny wrote:
> > 
> > Do not you think this library should be rather part of infiniband-diags,
> > rather than separate package/management sub-project? Personally I would
> > prefer to have this as part of infiniband-diags.
> 
> No, I would like to see it be a stand alone library.  Currently
> infiniband-diags does not provide any library functionality and simply depends
> on the libraries provided by the rest of the management tree.  Don't you think
> this is a good model to follow?

Why it must be so - infiniband-diags will be useless without this library.

And I would really hate to handle one more package (actually not just one
- libibnetdisc, libibnetdisc-devel, libibnetdisc-static, etc.). I wanted
to remove libibcommon...

Sasha


From sashak at voltaire.com  Mon Nov 24 11:13:04 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 21:13:04 +0200
Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm.8.in: Update email address
In-Reply-To: <492AFB53.7010007@obsidianresearch.com>
References: <492AFB53.7010007@obsidianresearch.com>
Message-ID: <20081124191304.GV6183@sashak.voltaire.com>

On 12:06 Mon 24 Nov     , Hal Rosenstock wrote:
> Sasha,
>
> Attached patch is a trivial update of email address in opensm man page.
>
> -- Hal

> opensm.8.in: Update email address
> 
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>

Applied. Thanks.

Sasha


From weiny2 at llnl.gov  Mon Nov 24 11:30:05 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 24 Nov 2008 11:30:05 -0800
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081124191050.GU6183@sashak.voltaire.com>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<20081123182741.GS21967@sashak.voltaire.com>
	<20081124094243.4dbcff51.weiny2@llnl.gov>
	<20081124191050.GU6183@sashak.voltaire.com>
Message-ID: <20081124113005.4261cfd1.weiny2@llnl.gov>

On Mon, 24 Nov 2008 21:10:50 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 09:42 Mon 24 Nov     , Ira Weiny wrote:
> > > 
> > > Do not you think this library should be rather part of infiniband-diags,
> > > rather than separate package/management sub-project? Personally I would
> > > prefer to have this as part of infiniband-diags.
> > 
> > No, I would like to see it be a stand alone library.  Currently
> > infiniband-diags does not provide any library functionality and simply depends
> > on the libraries provided by the rest of the management tree.  Don't you think
> > this is a good model to follow?
> 
> Why it must be so - infiniband-diags will be useless without this library.
> 
> And I would really hate to handle one more package (actually not just one
> - libibnetdisc, libibnetdisc-devel, libibnetdisc-static, etc.). I wanted
> to remove libibcommon...
> 

I think the argument against ibcommon is that it does not provide enough
additional functionality to warrant an entire new library.  On the other hand
infiniband-diags depends on many libraries:

   AC_CHECK_LIB(ibcommon, ...  <== delete this...
   
And you still have the following...

   AC_CHECK_LIB(ibumad, ...
   AC_CHECK_LIB(ibmad, ...
   AC_CHECK_LIB(osmcomp, ...
   AC_CHECK_LIB(osmvendor, ...
   AC_CHECK_LIB(opensm, ...

I don't think it is in appropriate to have utilities which are dependent on
libraries, it is done all the time.

Ira


From sashak at voltaire.com  Mon Nov 24 12:01:51 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 24 Nov 2008 22:01:51 +0200
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081124113005.4261cfd1.weiny2@llnl.gov>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<20081123182741.GS21967@sashak.voltaire.com>
	<20081124094243.4dbcff51.weiny2@llnl.gov>
	<20081124191050.GU6183@sashak.voltaire.com>
	<20081124113005.4261cfd1.weiny2@llnl.gov>
Message-ID: <20081124200151.GX6183@sashak.voltaire.com>

On 11:30 Mon 24 Nov     , Ira Weiny wrote:
> On Mon, 24 Nov 2008 21:10:50 +0200
> Sasha Khapyorsky <sashak at voltaire.com> wrote:
> 
> > On 09:42 Mon 24 Nov     , Ira Weiny wrote:
> > > > 
> > > > Do not you think this library should be rather part of infiniband-diags,
> > > > rather than separate package/management sub-project? Personally I would
> > > > prefer to have this as part of infiniband-diags.
> > > 
> > > No, I would like to see it be a stand alone library.  Currently
> > > infiniband-diags does not provide any library functionality and simply depends
> > > on the libraries provided by the rest of the management tree.  Don't you think
> > > this is a good model to follow?
> > 
> > Why it must be so - infiniband-diags will be useless without this library.
> > 
> > And I would really hate to handle one more package (actually not just one
> > - libibnetdisc, libibnetdisc-devel, libibnetdisc-static, etc.). I wanted
> > to remove libibcommon...
> > 
> 
> I think the argument against ibcommon is that it does not provide enough
> additional functionality to warrant an entire new library.

It is probably the same case with libibnetdisc (at least now).

> On the other hand
> infiniband-diags depends on many libraries:
> 
>    AC_CHECK_LIB(ibcommon, ...  <== delete this...
>    
> And you still have the following...
> 
>    AC_CHECK_LIB(ibumad, ...
>    AC_CHECK_LIB(ibmad, ...
>    AC_CHECK_LIB(osmcomp, ...
>    AC_CHECK_LIB(osmvendor, ...
>    AC_CHECK_LIB(opensm, ...
> 
> I don't think it is in appropriate to have utilities which are dependent on
> libraries, it is done all the time.

OTOH it doesn't mean that any new shared code must be done as separate
subproject.

The stuff is new. I think it is better to integrate it in smaller
iterations, to start with the code and functionality and to not bother
with packaging, dependencies, etc.. If there will be a reason to make
separate library we can do it, but then we will have a stable code
already.

Sasha


From tziporet at mellanox.co.il  Mon Nov 24 13:15:32 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Mon, 24 Nov 2008 23:15:32 +0200
Subject: [ofa-general] OFED Nov 24, 2008 meeting minutes
In-Reply-To: <458BC6B0F287034F92FE78908BD01CE84EF35EF0@mtlexch01.mtl.com>
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0FE7A8@mtlexch01.mtl.com>

OFED Nov 24, 2008 meeting minutes
===========================

Meeting Summary:
==============
*	OFED 1.4 release: RC6 on Nov 28, GA on Dec 8
*	UNH will test RC6 as part of Logo program (will start with RC5
this week)
*	OFED documentation and training - Jim Ryan will raise in next
XWG meeting


Details:
=======

> 1. Bugs status review:
> 1370    	blo  	vlad at mellanox.co.il  	  	Ping over IPoIB
> I/F fails after ifconfig down and up - there is a fix but its not
> integrated - Vlad to take it
> 1242 	cri 	yannick.cote at qlogic.com 	kernel panic while
> running mpi2007 against ofed1.4 -- ib_... - should be delayed for
> 1.4.1 - move to normal
> 1410    	cri  	vlad at mellanox.co.il  	  	Memory leak
> (address handler not reped) in IPoIB - we have a fix, need to decide
> 1289 	maj 	jackm at mellanox.co.il 		Ib and ipoib doesn't
> respond while running multiple tests ... - should be fixed - ask
> Mellanox QA to check
> 1407 	maj 	monis at voltaire.com 		Active-Backup failure
> when disabling an active slave inte... - fixed with new bonding
> package
> 1377 	maj 	vu at mellanox.com 		Deadlock occurred during
> HA test - on work
> 1380 	maj 	vu at mellanox.com 		Cannot unload ib_srpt
> module on SRP target - moved to normal (involves scst's mid-layer
> module which we don't have *a lot of* control)
> 1395 	maj 	vu at mellanox.com 		kernel panic during SRP
> HA test - on work
> 1384    	maj  	eli at mellanox.co.il  	  	netperf latency
> small messages increase 5% - not reproduced on SLES10 SP2 with FW
> 2.5.0
> 1385 	maj 	eli at mellanox.co.il 		ofed 1.4 - netperf udp
> BW small messages decrease ~8%
> 1386 	maj 	eli at mellanox.co.il 		ofed 1.4 - iperf tcp
> connected mode BW large messages dec...
> 
mvapich2 - Going to update package + RN

> 2. Decided on release date:
*	RC6 - 28 for Nov
*	GA - Dec 8 after UNH testing for the logo program

> 3. Decided on next meetings dates:
>     Dec 1, Dec 8 and Dec 15 (only if the release is not done)
> 
4. Logo program and OFED release:
*	We wish that UNH will test it on each RC, as each vendor does,
so we will not find surprises in the last RC.
*	Rupert will see if they can start with RC5 this week.
*	UNH will test RC6 next week
*	Need to add Windows - Linux interop to the Logo program - Rupert
will lead the change to the test plan. Should be done in next interop
event.

5. Training and documentation:
*	In OFA BOF there was a request for documentation and training
for using OFED.
*	Jim Ryan will raise the subject of training/documents in the XWG
meeting, and see if OFA can finance it
*	Options - some company in the bay area (I did not captured the
name) and UNH on east cost
*	Olga to check with Voltaire if they have something to contribute
*	Tziporet will check with Mellanox to see if we can contribute
our verbs user manual


> Tziporet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081124/6109554e/attachment.html>

From weiny2 at llnl.gov  Mon Nov 24 13:49:38 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Mon, 24 Nov 2008 13:49:38 -0800
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081124200151.GX6183@sashak.voltaire.com>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<20081123182741.GS21967@sashak.voltaire.com>
	<20081124094243.4dbcff51.weiny2@llnl.gov>
	<20081124191050.GU6183@sashak.voltaire.com>
	<20081124113005.4261cfd1.weiny2@llnl.gov>
	<20081124200151.GX6183@sashak.voltaire.com>
Message-ID: <20081124134938.61c345e0.weiny2@llnl.gov>

On Mon, 24 Nov 2008 22:01:51 +0200
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 11:30 Mon 24 Nov     , Ira Weiny wrote:
> > On Mon, 24 Nov 2008 21:10:50 +0200
> > Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > 
> > > On 09:42 Mon 24 Nov     , Ira Weiny wrote:
> > > > > 
> > > > > Do not you think this library should be rather part of infiniband-diags,
> > > > > rather than separate package/management sub-project? Personally I would
> > > > > prefer to have this as part of infiniband-diags.
> > > > 
> > > > No, I would like to see it be a stand alone library.  Currently
> > > > infiniband-diags does not provide any library functionality and simply depends
> > > > on the libraries provided by the rest of the management tree.  Don't you think
> > > > this is a good model to follow?
> > > 
> > > Why it must be so - infiniband-diags will be useless without this library.
> > > 
> > > And I would really hate to handle one more package (actually not just one
> > > - libibnetdisc, libibnetdisc-devel, libibnetdisc-static, etc.). I wanted
> > > to remove libibcommon...
> > > 
> > 
> > I think the argument against ibcommon is that it does not provide enough
> > additional functionality to warrant an entire new library.
> 
> It is probably the same case with libibnetdisc (at least now).
> 
> > On the other hand
> > infiniband-diags depends on many libraries:
> > 
> >    AC_CHECK_LIB(ibcommon, ...  <== delete this...
> >    
> > And you still have the following...
> > 
> >    AC_CHECK_LIB(ibumad, ...
> >    AC_CHECK_LIB(ibmad, ...
> >    AC_CHECK_LIB(osmcomp, ...
> >    AC_CHECK_LIB(osmvendor, ...
> >    AC_CHECK_LIB(opensm, ...
> > 
> > I don't think it is in appropriate to have utilities which are dependent on
> > libraries, it is done all the time.
> 
> OTOH it doesn't mean that any new shared code must be done as separate
> subproject.
> 
> The stuff is new. I think it is better to integrate it in smaller
> iterations, to start with the code and functionality and to not bother
> with packaging, dependencies, etc.. If there will be a reason to make
> separate library we can do it, but then we will have a stable code
> already.

As long as the library exists any dependant package can of course use the
library from whatever package we chose (libibnetdisc or infiniband-diags).  We
have some code which is prototyped against ibnetdiscover but we plan on using
this library instead.  This would be separate from infiniband-diags.  But we
can just as easily put a dependancy on infiniband-diags as on libibnetdisc.

The fact is that it was actually easier to put this in a new package rather
than try and integrate with infiniband-diags.  I thought it made for a very
clean conversion by putting the library in as a new patch and then we could
convert the diags as appropriate.

Anyway, I will integrate it as you say and resubmit the patch.

Ira


From rdreier at cisco.com  Mon Nov 24 13:52:55 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 24 Nov 2008 13:52:55 -0800
Subject: [ofa-general] Re: [PATCH 1 of 2] libmlx4: Fix race condition in
	create/destroy QP
In-Reply-To: <200811221153.49156.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sat, 22 Nov 2008 11:53:48 +0200")
References: <200811221153.49156.jackm@dev.mellanox.co.il>
Message-ID: <ada1vx0g5ag.fsf@cisco.com>

I think I see one bug at least:

 > @@ -580,9 +584,12 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 >  	struct mlx4_qp *qp = to_mqp(ibqp);
 >  	int ret;
 >  
 > +	pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);
 >  	ret = ibv_cmd_destroy_qp(ibqp);
 > -	if (ret)
 > +	if (ret) {
 > +		pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);

The second one should be unlock.

I'm too tired to check everything carefully enough to be sure it's right
though.  Can you double-check your lock balancing and error paths, and
resend fixed patches?

Thanks,
  Roland


From rdreier at cisco.com  Mon Nov 24 13:56:50 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 24 Nov 2008 13:56:50 -0800
Subject: [ofa-general] Re: [PATCH 03/10] RDMA/nes: Remove tx_free_list
In-Reply-To: <20081121205044.GA7424@ctung-MOBL> (Chien Tung's message of "Fri, 
	21 Nov 2008 14:50:44 -0600")
References: <20081121205044.GA7424@ctung-MOBL>
Message-ID: <adawseseqjh.fsf@cisco.com>

 > +static struct sk_buff *get_free_pkt(u32 pktsize)
 >  {
 > -	u32 hashkey = 0;
 > -
 > -	hashkey = loc_addr + rem_addr + loc_port + rem_port;
 > -	hashkey = (hashkey % NES_CM_HASHTABLE_SIZE);
 > -
 > -	return hashkey;
 > +		return dev_alloc_skb(pktsize);
 >  }

Given this, is there any reason to have get_free_pkt() at all?  Why not
just use dev_alloc_skb() directly?

 - R.


From rdreier at cisco.com  Mon Nov 24 13:59:43 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 24 Nov 2008 13:59:43 -0800
Subject: [ofa-general] Re: [PATCH 08/10] RDMA/nes: Change accept_pend_cnt to
	atomic
In-Reply-To: <20081121205058.GA8184@ctung-MOBL> (Chien Tung's message of "Fri, 
	21 Nov 2008 14:50:58 -0600")
References: <20081121205058.GA8184@ctung-MOBL>
Message-ID: <adaskpgeqeo.fsf@cisco.com>

 > There is a race condition on accept_pend_cnt.  Change it to atomic.

This is much too terse, so I don't know what the race is or how the
patch fixes it.  But...

 > +	if (atomic_dec_and_test(&cm_node->accept_pend)) {

you do atomic_dec_and_test() but then the only other manipulations of
accept_pend that I see are:

 > +	atomic_set(&cm_node->accept_pend, 0);
 > +		atomic_set(&cm_node->accept_pend, 1);

and there's no particular ordering between atomic_set() and
atomic_dec_and_test() that I know of to protect against races.

So at least a better desription of the patch, please.

 - R.


From meier3 at llnl.gov  Mon Nov 24 14:03:38 2008
From: meier3 at llnl.gov (Timothy A. Meier)
Date: Mon, 24 Nov 2008 14:03:38 -0800
Subject: [ofa-general] Re: [PATCH] Opensm: main exit codes
In-Reply-To: <20081124190251.GT6183@sashak.voltaire.com>
References: <4923678D.3080701@llnl.gov>
	<20081123185836.GU21967@sashak.voltaire.com>
	<492AF3AE.3060605@llnl.gov>
	<20081124190251.GT6183@sashak.voltaire.com>
Message-ID: <492B24BA.40303@llnl.gov>

Hi Sasha,

I guess I viewed this patch as just cleaning up the interface between the program and the system.

Sasha Khapyorsky wrote:
> On 10:34 Mon 24 Nov     , Timothy A. Meier wrote:
>> Hi Sasha,
>>
>> Sasha Khapyorsky wrote:
>>> Hi Tim,
>>>
>>> On 17:10 Tue 18 Nov     , Timothy A. Meier wrote:
>>>>   I thought it would be useful to define a set of exit codes for opensm.  A quick examination of main.c
>>>> showed a few different ways to terminate.  How about this patch?  Obviously this doesn't catch every
>>>> possible exit scenario, but its a start that can be built upon.
>>> Personally I read 'exit(0)' faster than 'exit(OSM_EXIT_TYPE_NORMAL)',
>>> but maybe it is just me :).
>> Me too :^)  Not much confusion over a return code of 0.
>>
>> The audience for this change wouldn't be the people writing the software,
> 
> Somehow we need to care about yourselves too :)
> 
>> but admins, scripts, and tools that
>> start/stop/monitor opensm.  At least that is our use case.
>>
>>> Maybe error codes could be formalized, but I'm not sure that it would be
>>> beneficial without any practical uses (and clear requirements
>>> understanding). Finally we can found us in a middle of the total mess
>>> similar to how OSM_LOG_* is used today.
>>>
>>> Sasha
>>>
>> So the uses/requirements would be to formalize how opensm handles the non-ideal termination condition,
>> for the purpose of providing quick, convenient, and consistent information for other system level tools
>> that are responsible for starting/stopping/monitoring/reporting opensm.
> 
> And are there any of such tools? Or any *real* use?
>

Chicken/Egg?  Currently, we depend on only ZERO or non-zero.  Although OpenSM returns "other" values
on exit, they aren't really formalized or documented.  Hence the patch. ;^)

Personally, I have (and create) several different versions of opensm with small customizations,
and test them on our cluster testbeds.  I often will start/stop them in a variety of configurations
(with and without plugins, more than one sm on a node, etc.) and if and when opensm doesn't
startup normally, it would be nice to have a meaningful exit code.

Perhaps others might find it useful as well, or for some future use.

But again, I originally considered this more as code cleanup.  Converting the exits, returns, and aborts
to provide a more consistent interface to the system.

-- 
Timothy A. Meier
Computer Scientist
ICCD/High Performance Computing
meier3 at llnl.gov


From chien.tin.tung at intel.com  Mon Nov 24 14:14:30 2008
From: chien.tin.tung at intel.com (Tung, Chien Tin)
Date: Mon, 24 Nov 2008 15:14:30 -0700
Subject: [ofa-general] RE: [PATCH 03/10] RDMA/nes: Remove tx_free_list
In-Reply-To: <adawseseqjh.fsf@cisco.com>
References: <20081121205044.GA7424@ctung-MOBL> <adawseseqjh.fsf@cisco.com>
Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830310DC721E@azsmsx501.amr.corp.intel.com>


>Given this, is there any reason to have get_free_pkt() at all?  Why not
>just use dev_alloc_skb() directly?

We were trying to make minimum change to the code.  There is no reason left
For get_free_pkt().  I can rework the patch to remove it.

Chien

From davem at davemloft.net  Mon Nov 24 15:34:11 2008
From: davem at davemloft.net (David Miller)
Date: Mon, 24 Nov 2008 15:34:11 -0800 (PST)
Subject: [ofa-general] Re: [PATCH next]infiniband: Kill directly reference of
	netdev->priv
In-Reply-To: <adaabbpf47r.fsf@cisco.com>
References: <492A748F.9040308@cn.fujitsu.com>
	<adaabbpf47r.fsf@cisco.com>
Message-ID: <20081124.153411.148586270.davem@davemloft.net>

From: Roland Dreier <rdreier at cisco.com>
Date: Mon, 24 Nov 2008 09:01:28 -0800

> Looks fine to me.
> 
> Acked-by: Roland Dreier <rolandd at cisco.com>

Applied, thanks everyone.


From jackm at dev.mellanox.co.il  Mon Nov 24 22:36:45 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 25 Nov 2008 08:36:45 +0200
Subject: [ofa-general] Re: [PATCH 1 of 2] libmlx4: Fix race condition in
	create/destroy QP
In-Reply-To: <ada1vx0g5ag.fsf@cisco.com>
References: <200811221153.49156.jackm@dev.mellanox.co.il>
	<ada1vx0g5ag.fsf@cisco.com>
Message-ID: <200811250836.45468.jackm@dev.mellanox.co.il>

On Monday 24 November 2008 23:52, Roland Dreier wrote:
> I think I see one bug at least:
> 
>  > @@ -580,9 +584,12 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
>  >  	struct mlx4_qp *qp = to_mqp(ibqp);
>  >  	int ret;
>  >  
>  > +	pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);
>  >  	ret = ibv_cmd_destroy_qp(ibqp);
>  > -	if (ret)
>  > +	if (ret) {
>  > +		pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);
> 
> The second one should be unlock.
> 
> I'm too tired to check everything carefully enough to be sure it's right
> though.  Can you double-check your lock balancing and error paths, and
> resend fixed patches?
> 
> Thanks,
>   Roland

Ouch!  that is the only bug (after my careful review).
I guess I was tired when I sent them.

I'm resending this patch (fixed) only -- the libmthca patch is OK.

- Jack 


From jackm at dev.mellanox.co.il  Mon Nov 24 22:40:07 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Tue, 25 Nov 2008 08:40:07 +0200
Subject: [ofa-general] [PATCH 1 of 2 V2] libmlx4: Fix race condition in
	create/destroy QP
Message-ID: <200811250840.07944.jackm@dev.mellanox.co.il>

Index: libmlx4/src/qp.c
===================================================================
--- libmlx4.orig/src/qp.c	2008-11-20 11:46:58.000000000 +0200
+++ libmlx4/src/qp.c	2008-11-22 09:44:13.000000000 +0200
@@ -667,37 +667,25 @@ struct mlx4_qp *mlx4_find_qp(struct mlx4
 int mlx4_store_qp(struct mlx4_context *ctx, uint32_t qpn, struct mlx4_qp *qp)
 {
 	int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
-	int ret = 0;
-
-	pthread_mutex_lock(&ctx->qp_table_mutex);
 
 	if (!ctx->qp_table[tind].refcnt) {
 		ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1,
 						   sizeof (struct mlx4_qp *));
-		if (!ctx->qp_table[tind].table) {
-			ret = -1;
-			goto out;
-		}
+		if (!ctx->qp_table[tind].table)
+			return -1;
 	}
 
 	++ctx->qp_table[tind].refcnt;
 	ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = qp;
-
-out:
-	pthread_mutex_unlock(&ctx->qp_table_mutex);
-	return ret;
+	return 0;
 }
 
 void mlx4_clear_qp(struct mlx4_context *ctx, uint32_t qpn)
 {
 	int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
 
-	pthread_mutex_lock(&ctx->qp_table_mutex);
-
 	if (!--ctx->qp_table[tind].refcnt)
 		free(ctx->qp_table[tind].table);
 	else
 		ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = NULL;
-
-	pthread_mutex_unlock(&ctx->qp_table_mutex);
 }
Index: libmlx4/src/verbs.c
===================================================================
--- libmlx4.orig/src/verbs.c	2008-11-20 11:46:58.000000000 +0200
+++ libmlx4/src/verbs.c	2008-11-25 08:31:26.000000000 +0200
@@ -452,6 +452,8 @@ struct ibv_qp *mlx4_create_qp(struct ibv
 	cmd.sq_no_prefetch = 0;	/* OK for ABI 2: just a reserved field */
 	memset(cmd.reserved, 0, sizeof cmd.reserved);
 
+	pthread_mutex_lock(&to_mctx(pd->context)->qp_table_mutex);
+
 	ret = ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd,
 				&resp, sizeof resp);
 	if (ret)
@@ -460,6 +462,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv
 	ret = mlx4_store_qp(to_mctx(pd->context), qp->ibv_qp.qp_num, qp);
 	if (ret)
 		goto err_destroy;
+	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
 
 	qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr;
 	qp->rq.max_gs  = attr->cap.max_recv_sge;
@@ -477,6 +480,7 @@ err_destroy:
 	ibv_cmd_destroy_qp(&qp->ibv_qp);
 
 err_rq_db:
+	pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex);
 	if (!attr->srq)
 		mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, qp->db);
 
@@ -580,9 +584,12 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 	struct mlx4_qp *qp = to_mqp(ibqp);
 	int ret;
 
+	pthread_mutex_lock(&to_mctx(ibqp->context)->qp_table_mutex);
 	ret = ibv_cmd_destroy_qp(ibqp);
-	if (ret)
+	if (ret) {
+		pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex);
 		return ret;
+	}
 
 	mlx4_lock_cqs(ibqp);
 
@@ -594,6 +601,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 	mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);
 
 	mlx4_unlock_cqs(ibqp);
+	pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex);
 
 	if (!ibqp->srq)
 		mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db);


From vlad at lists.openfabrics.org  Tue Nov 25 03:42:57 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Tue, 25 Nov 2008 03:42:57 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081125-0200 daily build status
Message-ID: <20081125114257.AFA86E60939@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From fenkes at de.ibm.com  Tue Nov 25 04:58:06 2008
From: fenkes at de.ibm.com (Joachim Fenkes)
Date: Tue, 25 Nov 2008 13:58:06 +0100
Subject: [ofa-general] [PATCH] IB/ehca: Change misleading error message
In-Reply-To: <48499C11.7030504@gmail.com>
References: <200806061835.43802.fenkes@de.ibm.com> <48499C11.7030504@gmail.com>
Message-ID: <200811251358.06729.fenkes@de.ibm.com>

The error message printed when the eHCA driver prevents memory hotplug is
misleading -- the user might think that hot-removing the lhca, hotplugging
memory, then hot-adding the lhca again will work, but it doesn't.

Signed-off-by: Joachim Fenkes <fenkes at de.ibm.com>
---
 drivers/infiniband/hw/ehca/ehca_main.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index bb02a86..bec7e02 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -994,8 +994,7 @@ static int ehca_mem_notifier(struct notifier_block *nb,
 			if (printk_timed_ratelimit(&ehca_dmem_warn_time,
 						   30 * 1000))
 				ehca_gen_err("DMEM operations are not allowed"
-					     "as long as an ehca adapter is"
-					     "attached to the LPAR");
+					     "in conjunction with eHCA");
 			return NOTIFY_BAD;
 		}
 	}
-- 
1.5.5


From Robert at saq.co.uk  Tue Nov 25 06:20:43 2008
From: Robert at saq.co.uk (Robert Dunkley)
Date: Tue, 25 Nov 2008 14:20:43 -0000
Subject: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
Message-ID: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>

Hi everyone,

I'm using a setup of two machines (Lets call them A and B) directly
connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 Mellanox
PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED 1.3
installed, Machine B runs OpenSM. 

All was working fine. I shutdown Machine A did some maintenance and then
powered it on again, everything is OK again. I then shutdown Machine B
(The one running OpenSM), this seemed to really upset Machine A. After
booting Machine B again, Machine B looks OK with the port down and in
polling state. Machine A however gives the following error if I run
ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
(Resource temporarily unavailable)

I don't want to reboot Machine A as it must synch data with Machine B
over the Infiniband link first. Does anyone have any idea how to fix
machine A? 

Thanks,

Rob

The SAQ Group

Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SEMTEC Limited Trading as SAQ is Registered in England & Wales
Company Number: 06481952

 
http://www.saqnet.co.uk AS29219

SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.

DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.

Find us in http://www.thebestof.co.uk/petersfield


From Robert at saq.co.uk  Tue Nov 25 06:39:21 2008
From: Robert at saq.co.uk (Robert Dunkley)
Date: Tue, 25 Nov 2008 14:39:21 -0000
Subject: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
Message-ID: <C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>

Hi Eric,

Thanks for the response. OpenSM is running and set to start on bootup on
MachineB:
ps aux | grep open
root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
/usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0

The log on Machine B just logs this every 10 seconds:
Nov 25 14:34:21 148541 [477A7940] 0x01 ->
__osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down

Ibstat confirms port is in polling state on MachineB. MachineA however
is in a bad state, I tried the openibd restart command, it accepted the
command but after 5 minutes shows no progress of doing anything and is
just at the cursor. Is some sort of forced restart of openibd possible?

Thanks,

Rob


-----Original Message-----
From: Baur, Eric [mailto:Eric.Baur at gs.com] 
Sent: 25 November 2008 14:31
To: Robert Dunkley
Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
Temporarily unavailable"

Robert-

Is OpenSM set to start on boot? 
		chkconfig --list | grep opensmd

If not: 	chkconfig opensmd on 
and: 		/etc/init.d/opensmd start

You can also restart openib without rebooting the machines.
		/etc/init.d/openibd restart

-Eric

-----Original Message-----
From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
Dunkley
Sent: Tuesday, November 25, 2008 9:21 AM
To: general at lists.openfabrics.org
Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
Temporarily unavailable"

Hi everyone,

I'm using a setup of two machines (Lets call them A and B) directly
connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 Mellanox
PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED 1.3
installed, Machine B runs OpenSM. 

All was working fine. I shutdown Machine A did some maintenance and then
powered it on again, everything is OK again. I then shutdown Machine B
(The one running OpenSM), this seemed to really upset Machine A. After
booting Machine B again, Machine B looks OK with the port down and in
polling state. Machine A however gives the following error if I run
ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
(Resource temporarily unavailable)

I don't want to reboot Machine A as it must synch data with Machine B
over the Infiniband link first. Does anyone have any idea how to fix
machine A? 

Thanks,

Rob

The SAQ Group

Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
SEMTEC Limited Trading as SAQ is Registered in England & Wales
Company Number: 06481952

 
http://www.saqnet.co.uk AS29219

SAQ Group Delivers high quality, honestly priced communication and I.T.
services to UK Business.

DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
Backups : Managed Networks : Remote Support.

Find us in http://www.thebestof.co.uk/petersfield

_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From hal.rosenstock at gmail.com  Tue Nov 25 06:46:28 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Nov 2008 09:46:28 -0500
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
In-Reply-To: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
Message-ID: <f0e08f230811250646r67fdb054qe70ff1876e8c74a7@mail.gmail.com>

On Tue, Nov 25, 2008 at 9:20 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi everyone,
>
> I'm using a setup of two machines (Lets call them A and B) directly
> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 Mellanox
> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED 1.3
> installed, Machine B runs OpenSM.
>
> All was working fine. I shutdown Machine A did some maintenance and then
> powered it on again, everything is OK again. I then shutdown Machine B
> (The one running OpenSM), this seemed to really upset Machine A. After
> booting Machine B again, Machine B looks OK with the port down and in
> polling state.

Is this with machine A powered off ?

> Machine A however gives the following error if I run
> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> (Resource temporarily unavailable)

Does /sys/class/infiniband/mthca0 exist on machine A ? If so, what
files are there ?

-- Hal

> I don't want to reboot Machine A as it must synch data with Machine B
> over the Infiniband link first. Does anyone have any idea how to fix
> machine A?
>
> Thanks,
>
> Rob
>
> The SAQ Group
>
> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> Company Number: 06481952
>
>
>
> http://www.saqnet.co.uk AS29219
>
> SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business.
>
> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support.
>
> Find us in http://www.thebestof.co.uk/petersfield
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Tue Nov 25 06:49:23 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Nov 2008 09:49:23 -0500
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
In-Reply-To: <C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
Message-ID: <f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>

On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi Eric,
>
> Thanks for the response. OpenSM is running and set to start on bootup on
> MachineB:
> ps aux | grep open
> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>
> The log on Machine B just logs this every 10 seconds:
> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>
> Ibstat confirms port is in polling state on MachineB.

Is the port in init or down ?

> MachineA however is in a bad state,

Any additional details on this ?

Can you kill/unload all the ib stuff and reload it ? That would be
gentler than rebooting.

-- Hal

>I tried the openibd restart command, it accepted the
> command but after 5 minutes shows no progress of doing anything and is
> just at the cursor. Is some sort of forced restart of openibd possible?
>
> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Baur, Eric [mailto:Eric.Baur at gs.com]
> Sent: 25 November 2008 14:31
> To: Robert Dunkley
> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Robert-
>
> Is OpenSM set to start on boot?
>                chkconfig --list | grep opensmd
>
> If not:         chkconfig opensmd on
> and:            /etc/init.d/opensmd start
>
> You can also restart openib without rebooting the machines.
>                /etc/init.d/openibd restart
>
> -Eric
>
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
> Dunkley
> Sent: Tuesday, November 25, 2008 9:21 AM
> To: general at lists.openfabrics.org
> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Hi everyone,
>
> I'm using a setup of two machines (Lets call them A and B) directly
> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3 Mellanox
> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED 1.3
> installed, Machine B runs OpenSM.
>
> All was working fine. I shutdown Machine A did some maintenance and then
> powered it on again, everything is OK again. I then shutdown Machine B
> (The one running OpenSM), this seemed to really upset Machine A. After
> booting Machine B again, Machine B looks OK with the port down and in
> polling state. Machine A however gives the following error if I run
> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> (Resource temporarily unavailable)
>
> I don't want to reboot Machine A as it must synch data with Machine B
> over the Infiniband link first. Does anyone have any idea how to fix
> machine A?
>
> Thanks,
>
> Rob
>
> The SAQ Group
>
> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> Company Number: 06481952
>
>
>
> http://www.saqnet.co.uk AS29219
>
> SAQ Group Delivers high quality, honestly priced communication and I.T.
> services to UK Business.
>
> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
> Backups : Managed Networks : Remote Support.
>
> Find us in http://www.thebestof.co.uk/petersfield
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From Robert at saq.co.uk  Tue Nov 25 06:46:56 2008
From: Robert at saq.co.uk (Robert Dunkley)
Date: Tue, 25 Nov 2008 14:46:56 -0000
Subject: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<f0e08f230811250646r67fdb054qe70ff1876e8c74a7@mail.gmail.com>
Message-ID: <C1EAC9C5E752D24C968FF091D446D8232DE806@ALTERNATEREALIT>

Hi Hal,

Machine A is powered on. It was after powering down machine B and OpenSM
with it that Machine A went weird.

/sys/class/infiniband/mthca0 exists on Machine A, contents is:
board_id  fw_ver    hw_rev     node_guid  ports      sys_image_guid
device    hca_type  node_desc  node_type  subsystem  uevent


Thanks,

Rob

-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: 25 November 2008 14:46
To: Robert Dunkley
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
Temporarily unavailable"

On Tue, Nov 25, 2008 at 9:20 AM, Robert Dunkley <Robert at saq.co.uk>
wrote:
> Hi everyone,
>
> I'm using a setup of two machines (Lets call them A and B) directly
> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
Mellanox
> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
1.3
> installed, Machine B runs OpenSM.
>
> All was working fine. I shutdown Machine A did some maintenance and
then
> powered it on again, everything is OK again. I then shutdown Machine B
> (The one running OpenSM), this seemed to really upset Machine A. After
> booting Machine B again, Machine B looks OK with the port down and in
> polling state.

Is this with machine A powered off ?

> Machine A however gives the following error if I run
> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> (Resource temporarily unavailable)

Does /sys/class/infiniband/mthca0 exist on machine A ? If so, what
files are there ?

-- Hal

> I don't want to reboot Machine A as it must synch data with Machine B
> over the Infiniband link first. Does anyone have any idea how to fix
> machine A?
>
> Thanks,
>
> Rob
>
> The SAQ Group
>
> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> Company Number: 06481952
>
>
>
> http://www.saqnet.co.uk AS29219
>
> SAQ Group Delivers high quality, honestly priced communication and
I.T. services to UK Business.
>
> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
Backups : Managed Networks : Remote Support.
>
> Find us in http://www.thebestof.co.uk/petersfield
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
>


From vlad at mellanox.co.il  Tue Nov 25 06:56:04 2008
From: vlad at mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 25 Nov 2008 16:56:04 +0200
Subject: [ofa-general] [PATCH] IPoIB: Prevent address handles leak.
Message-ID: <20081125145604.GA22726@mellanox.co.il>

When removing the ib_ipoib module, ipoib_ib_dev_stop() is
called and all address handles (ah) in the dead_ahs list are reaped.
However, some ah's may be still be added to the dead list by ipoib_mcast_free()
after ipoib_ib_dev_stop() is called. These ah's will not be freed.

The solution is to reap any remaining ah's after multicast device is really
flushed during cleanup.

Based on a recommendation by Yossi Etigin.
This fixes Bugzilla https://bugs.openfabrics.org/show_bug.cgi?id=1410

Signed-off-by: Vladimir Sokolovsky <vlad at mellanox.co.il>
---
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 66cafa2..2b77bbd 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -640,6 +640,25 @@ void ipoib_reap_ah(struct work_struct *work)
 				   round_jiffies_relative(HZ));
 }
 
+static void ipoib_ah_dev_cleanup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long begin;
+
+	begin = jiffies;
+
+	while (!list_empty(&priv->dead_ahs)) {
+		__ipoib_reap_ah(dev);
+
+		if (time_after(jiffies, begin + HZ)) {
+			ipoib_warn(priv, "timing out; will leak address handles\n");
+			break;
+		}
+
+		msleep(1);
+	}
+}
+
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
 	drain_tx_cq((struct net_device *)ctx);
@@ -861,18 +880,7 @@ timeout:
 	if (flush)
 		flush_workqueue(ipoib_workqueue);
 
-	begin = jiffies;
-
-	while (!list_empty(&priv->dead_ahs)) {
-		__ipoib_reap_ah(dev);
-
-		if (time_after(jiffies, begin + HZ)) {
-			ipoib_warn(priv, "timing out; will leak address handles\n");
-			break;
-		}
-
-		msleep(1);
-	}
+	ipoib_ah_dev_cleanup(dev);
 
 	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
 
@@ -1005,6 +1013,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	ipoib_mcast_stop_thread(dev, 1);
 	ipoib_mcast_dev_flush(dev);
 
+	ipoib_ah_dev_cleanup(dev);
 	ipoib_transport_dev_cleanup(dev);
 }
 

From hal.rosenstock at gmail.com  Tue Nov 25 06:56:41 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Nov 2008 09:56:41 -0500
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
In-Reply-To: <C1EAC9C5E752D24C968FF091D446D8232DE806@ALTERNATEREALIT>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<f0e08f230811250646r67fdb054qe70ff1876e8c74a7@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE806@ALTERNATEREALIT>
Message-ID: <f0e08f230811250656n60831eb9hc7504b9c986cb4f9@mail.gmail.com>

Hi Rob,

On Tue, Nov 25, 2008 at 9:46 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi Hal,
>
> Machine A is powered on. It was after powering down machine B and OpenSM
> with it that Machine A went weird.

> /sys/class/infiniband/mthca0 exists on Machine A, contents is:
> board_id  fw_ver    hw_rev     node_guid  ports      sys_image_guid
> device    hca_type  node_desc  node_type  subsystem  uevent

What about machine B ? Do these files exist ? Also what is the port
state (down or init or something else) ?

-- Hal

> Thanks,
>
> Rob
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: 25 November 2008 14:46
> To: Robert Dunkley
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> On Tue, Nov 25, 2008 at 9:20 AM, Robert Dunkley <Robert at saq.co.uk>
> wrote:
>> Hi everyone,
>>
>> I'm using a setup of two machines (Lets call them A and B) directly
>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
> Mellanox
>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
> 1.3
>> installed, Machine B runs OpenSM.
>>
>> All was working fine. I shutdown Machine A did some maintenance and
> then
>> powered it on again, everything is OK again. I then shutdown Machine B
>> (The one running OpenSM), this seemed to really upset Machine A. After
>> booting Machine B again, Machine B looks OK with the port down and in
>> polling state.
>
> Is this with machine A powered off ?
>
>> Machine A however gives the following error if I run
>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
>> (Resource temporarily unavailable)
>
> Does /sys/class/infiniband/mthca0 exist on machine A ? If so, what
> files are there ?
>
> -- Hal
>
>> I don't want to reboot Machine A as it must synch data with Machine B
>> over the Infiniband link first. Does anyone have any idea how to fix
>> machine A?
>>
>> Thanks,
>>
>> Rob
>>
>> The SAQ Group
>>
>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
>> SEMTEC Limited Trading as SAQ is Registered in England & Wales
>> Company Number: 06481952
>>
>>
>>
>> http://www.saqnet.co.uk AS29219
>>
>> SAQ Group Delivers high quality, honestly priced communication and
> I.T. services to UK Business.
>>
>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
> Backups : Managed Networks : Remote Support.
>>
>> Find us in http://www.thebestof.co.uk/petersfield
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>>
>


From Robert at saq.co.uk  Tue Nov 25 06:54:07 2008
From: Robert at saq.co.uk (Robert Dunkley)
Date: Tue, 25 Nov 2008 14:54:07 -0000
Subject: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
	<f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>
Message-ID: <C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>

Hi Hal,

Thank you for your help.

Ibstat on MachineB:
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.2.0
        Hardware version: a0
        Node GUID: 0x0002c9020022d428
        System image GUID: 0x0002c9020022d42b
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510a6a
                Port GUID: 0x0002c9020022d429

Machine A is operating normally with the exception of Infiniband which
broke after powering down Machine B and did not recover once Machine B
was powered on again. An extract from the log of Machine A:
Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
(-11)
Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
(-11)
Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed
(-11)
Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:32:01 mrhappy last message repeated 3 times
Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)

Thanks again,

Rob

-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: 25 November 2008 14:49
To: Robert Dunkley
Cc: Baur, Eric; general at lists.openfabrics.org
Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
Temporarily unavailable"

On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
wrote:
> Hi Eric,
>
> Thanks for the response. OpenSM is running and set to start on bootup
on
> MachineB:
> ps aux | grep open
> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>
> The log on Machine B just logs this every 10 seconds:
> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>
> Ibstat confirms port is in polling state on MachineB.

Is the port in init or down ?

> MachineA however is in a bad state,

Any additional details on this ?

Can you kill/unload all the ib stuff and reload it ? That would be
gentler than rebooting.

-- Hal

>I tried the openibd restart command, it accepted the
> command but after 5 minutes shows no progress of doing anything and is
> just at the cursor. Is some sort of forced restart of openibd
possible?
>
> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Baur, Eric [mailto:Eric.Baur at gs.com]
> Sent: 25 November 2008 14:31
> To: Robert Dunkley
> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Robert-
>
> Is OpenSM set to start on boot?
>                chkconfig --list | grep opensmd
>
> If not:         chkconfig opensmd on
> and:            /etc/init.d/opensmd start
>
> You can also restart openib without rebooting the machines.
>                /etc/init.d/openibd restart
>
> -Eric
>
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
> Dunkley
> Sent: Tuesday, November 25, 2008 9:21 AM
> To: general at lists.openfabrics.org
> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Hi everyone,
>
> I'm using a setup of two machines (Lets call them A and B) directly
> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
Mellanox
> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
1.3
> installed, Machine B runs OpenSM.
>
> All was working fine. I shutdown Machine A did some maintenance and
then
> powered it on again, everything is OK again. I then shutdown Machine B
> (The one running OpenSM), this seemed to really upset Machine A. After
> booting Machine B again, Machine B looks OK with the port down and in
> polling state. Machine A however gives the following error if I run
> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> (Resource temporarily unavailable)
>
> I don't want to reboot Machine A as it must synch data with Machine B
> over the Infiniband link first. Does anyone have any idea how to fix
> machine A?
>
> Thanks,
>
> Rob
>
> The SAQ Group
>
> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> Company Number: 06481952
>
>
>
> http://www.saqnet.co.uk AS29219
>
> SAQ Group Delivers high quality, honestly priced communication and
I.T.
> services to UK Business.
>
> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
> Backups : Managed Networks : Remote Support.
>
> Find us in http://www.thebestof.co.uk/petersfield
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
>


From hal.rosenstock at gmail.com  Tue Nov 25 07:00:22 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Nov 2008 10:00:22 -0500
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
In-Reply-To: <C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
	<f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>
Message-ID: <f0e08f230811250700w6647d80fi9389fa2ff9a62cd5@mail.gmail.com>

Hi Rob,

On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi Hal,
>
> Thank you for your help.
>
> Ibstat on MachineB:
> CA 'mthca0'
>        CA type: MT25204
>        Number of ports: 1
>        Firmware version: 1.2.0
>        Hardware version: a0
>        Node GUID: 0x0002c9020022d428
>        System image GUID: 0x0002c9020022d42b
>        Port 1:
>                State: Down

Is machine A on ? Is mthca loaded there ? If so, this should at least
be init but the driver errors below may preclude this from occurring.

>                Physical state: Polling
>                Rate: 10
>                Base lid: 0
>                LMC: 0
>                SM lid: 0
>                Capability mask: 0x02510a6a
>                Port GUID: 0x0002c9020022d429
>
> Machine A is operating normally with the exception of Infiniband which
> broke after powering down Machine B and did not recover once Machine B
> was powered on again. An extract from the log of Machine A:
> Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> (-11)
> Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> (-11)
> Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed
> (-11)
> Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:32:01 mrhappy last message repeated 3 times
> Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)

-11 is EAGAIN. Not sure what this is used for in the mthca driver.

Can you unload and reload the IB stack especially mthca driver ?

-- Hal

> Thanks again,
>
> Rob
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: 25 November 2008 14:49
> To: Robert Dunkley
> Cc: Baur, Eric; general at lists.openfabrics.org
> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
> wrote:
>> Hi Eric,
>>
>> Thanks for the response. OpenSM is running and set to start on bootup
> on
>> MachineB:
>> ps aux | grep open
>> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
>> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>>
>> The log on Machine B just logs this every 10 seconds:
>> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
>> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
>> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
>> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>>
>> Ibstat confirms port is in polling state on MachineB.
>
> Is the port in init or down ?
>
>> MachineA however is in a bad state,
>
> Any additional details on this ?
>
> Can you kill/unload all the ib stuff and reload it ? That would be
> gentler than rebooting.
>
> -- Hal
>
>>I tried the openibd restart command, it accepted the
>> command but after 5 minutes shows no progress of doing anything and is
>> just at the cursor. Is some sort of forced restart of openibd
> possible?
>>
>> Thanks,
>>
>> Rob
>>
>>
>> -----Original Message-----
>> From: Baur, Eric [mailto:Eric.Baur at gs.com]
>> Sent: 25 November 2008 14:31
>> To: Robert Dunkley
>> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>> Temporarily unavailable"
>>
>> Robert-
>>
>> Is OpenSM set to start on boot?
>>                chkconfig --list | grep opensmd
>>
>> If not:         chkconfig opensmd on
>> and:            /etc/init.d/opensmd start
>>
>> You can also restart openib without rebooting the machines.
>>                /etc/init.d/openibd restart
>>
>> -Eric
>>
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
>> Dunkley
>> Sent: Tuesday, November 25, 2008 9:21 AM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>> Temporarily unavailable"
>>
>> Hi everyone,
>>
>> I'm using a setup of two machines (Lets call them A and B) directly
>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
> Mellanox
>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
> 1.3
>> installed, Machine B runs OpenSM.
>>
>> All was working fine. I shutdown Machine A did some maintenance and
> then
>> powered it on again, everything is OK again. I then shutdown Machine B
>> (The one running OpenSM), this seemed to really upset Machine A. After
>> booting Machine B again, Machine B looks OK with the port down and in
>> polling state. Machine A however gives the following error if I run
>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
>> (Resource temporarily unavailable)
>>
>> I don't want to reboot Machine A as it must synch data with Machine B
>> over the Infiniband link first. Does anyone have any idea how to fix
>> machine A?
>>
>> Thanks,
>>
>> Rob
>>
>> The SAQ Group
>>
>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
>> SEMTEC Limited Trading as SAQ is Registered in England & Wales
>> Company Number: 06481952
>>
>>
>>
>> http://www.saqnet.co.uk AS29219
>>
>> SAQ Group Delivers high quality, honestly priced communication and
> I.T.
>> services to UK Business.
>>
>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
>> Backups : Managed Networks : Remote Support.
>>
>> Find us in http://www.thebestof.co.uk/petersfield
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>>
>


From Robert at saq.co.uk  Tue Nov 25 07:21:10 2008
From: Robert at saq.co.uk (Robert Dunkley)
Date: Tue, 25 Nov 2008 15:21:10 -0000
Subject: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
	<f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>
	<f0e08f230811250700w6647d80fi9389fa2ff9a62cd5@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE80B@ALTERNATEREALIT>
	<f0e08f230811250719r6076a781k8531f97dcb360071@mail.gmail.com>
Message-ID: <C1EAC9C5E752D24C968FF091D446D8232DE80C@ALTERNATEREALIT>

Hi Hal,

Thanks again, I will try this in a minute. I think I have found the
moment it went bad on Machine A using Dmesg:
ib_mthca 0000:87:00.0: Catastrophic error detected: unknown error
ib_mthca 0000:87:00.0:   buf[00]: ffffffff
ib_mthca 0000:87:00.0:   buf[01]: ffffffff
ib_mthca 0000:87:00.0:   buf[02]: ffffffff
ib_mthca 0000:87:00.0:   buf[03]: ffffffff
ib_mthca 0000:87:00.0:   buf[04]: ffffffff
ib_mthca 0000:87:00.0:   buf[05]: ffffffff
ib_mthca 0000:87:00.0:   buf[06]: ffffffff
ib_mthca 0000:87:00.0:   buf[07]: ffffffff
ib_mthca 0000:87:00.0:   buf[08]: ffffffff
ib_mthca 0000:87:00.0:   buf[09]: ffffffff
ib_mthca 0000:87:00.0:   buf[0a]: ffffffff
ib_mthca 0000:87:00.0:   buf[0b]: ffffffff
ib_mthca 0000:87:00.0:   buf[0c]: ffffffff
ib_mthca 0000:87:00.0:   buf[0d]: ffffffff
ib_mthca 0000:87:00.0:   buf[0e]: ffffffff
ib_mthca 0000:87:00.0:   buf[0f]: ffffffff
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib0: ib_query_gid() failed
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib0: ib_query_port failed
ib0: Failed to modify QP to ERROR state
ib0: timing out; 1 sends 250 receives not completed
ib0: Failed to modify QP to RESET state
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_SRQ failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)

Does this help to pinpoint what might have caused this?

Thanks,

Rob


-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: 25 November 2008 15:19
To: Robert Dunkley
Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
Temporarily unavailable"

Hi Rob,

On Tue, Nov 25, 2008 at 10:01 AM, Robert Dunkley <Robert at saq.co.uk>
wrote:
> Hi Hal,
>
> Machine A is definitely on and I have had the cable connection
checked.
> I'm afraid I'm not much of a techy, how do I unload and reload the IB
> stack?

It depends on what you have running... Is it just OpenSM and IPoIB ?

Kill off opensm

Use modprobe -r to remove all the ib_ modules. You can find them via
lsmod | grep ib_. There is a dependency order.

If you can get them all unloaded, reload them in the reverse order and
hopefully things will be better...

-- Hal

> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: 25 November 2008 15:00
> To: Robert Dunkley
> Cc: Baur, Eric; general at lists.openfabrics.org
> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
"Resource
> Temporarily unavailable"
>
> Hi Rob,
>
> On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk>
> wrote:
>> Hi Hal,
>>
>> Thank you for your help.
>>
>> Ibstat on MachineB:
>> CA 'mthca0'
>>        CA type: MT25204
>>        Number of ports: 1
>>        Firmware version: 1.2.0
>>        Hardware version: a0
>>        Node GUID: 0x0002c9020022d428
>>        System image GUID: 0x0002c9020022d42b
>>        Port 1:
>>                State: Down
>
> Is machine A on ? Is mthca loaded there ? If so, this should at least
> be init but the driver errors below may preclude this from occurring.
>
>>                Physical state: Polling
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510a6a
>>                Port GUID: 0x0002c9020022d429
>>
>> Machine A is operating normally with the exception of Infiniband
which
>> broke after powering down Machine B and did not recover once Machine
B
>> was powered on again. An extract from the log of Machine A:
>> Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
> failed
>> (-11)
>> Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ
failed
>> (-11)
>> Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
> failed
>> (-11)
>> Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ
failed
>> (-11)
>> Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
> failed
>> (-11)
>> Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ
> failed
>> (-11)
>> Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
> failed
>> (-11)
>> Nov 25 14:32:01 mrhappy last message repeated 3 times
>> Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
> failed
>> (-11)
>
> -11 is EAGAIN. Not sure what this is used for in the mthca driver.
>
> Can you unload and reload the IB stack especially mthca driver ?
>
> -- Hal
>
>> Thanks again,
>>
>> Rob
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: 25 November 2008 14:49
>> To: Robert Dunkley
>> Cc: Baur, Eric; general at lists.openfabrics.org
>> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
> "Resource
>> Temporarily unavailable"
>>
>> On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
>> wrote:
>>> Hi Eric,
>>>
>>> Thanks for the response. OpenSM is running and set to start on
bootup
>> on
>>> MachineB:
>>> ps aux | grep open
>>> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
>>> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>>>
>>> The log on Machine B just logs this every 10 seconds:
>>> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
>>> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
>>> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
>>> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>>>
>>> Ibstat confirms port is in polling state on MachineB.
>>
>> Is the port in init or down ?
>>
>>> MachineA however is in a bad state,
>>
>> Any additional details on this ?
>>
>> Can you kill/unload all the ib stuff and reload it ? That would be
>> gentler than rebooting.
>>
>> -- Hal
>>
>>>I tried the openibd restart command, it accepted the
>>> command but after 5 minutes shows no progress of doing anything and
> is
>>> just at the cursor. Is some sort of forced restart of openibd
>> possible?
>>>
>>> Thanks,
>>>
>>> Rob
>>>
>>>
>>> -----Original Message-----
>>> From: Baur, Eric [mailto:Eric.Baur at gs.com]
>>> Sent: 25 November 2008 14:31
>>> To: Robert Dunkley
>>> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic -
> "Resource
>>> Temporarily unavailable"
>>>
>>> Robert-
>>>
>>> Is OpenSM set to start on boot?
>>>                chkconfig --list | grep opensmd
>>>
>>> If not:         chkconfig opensmd on
>>> and:            /etc/init.d/opensmd start
>>>
>>> You can also restart openib without rebooting the machines.
>>>                /etc/init.d/openibd restart
>>>
>>> -Eric
>>>
>>> -----Original Message-----
>>> From: general-bounces at lists.openfabrics.org
>>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
>>> Dunkley
>>> Sent: Tuesday, November 25, 2008 9:21 AM
>>> To: general at lists.openfabrics.org
>>> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>>> Temporarily unavailable"
>>>
>>> Hi everyone,
>>>
>>> I'm using a setup of two machines (Lets call them A and B) directly
>>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
>> Mellanox
>>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
>> 1.3
>>> installed, Machine B runs OpenSM.
>>>
>>> All was working fine. I shutdown Machine A did some maintenance and
>> then
>>> powered it on again, everything is OK again. I then shutdown Machine
> B
>>> (The one running OpenSM), this seemed to really upset Machine A.
> After
>>> booting Machine B again, Machine B looks OK with the port down and
in
>>> polling state. Machine A however gives the following error if I run
>>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
>>> (Resource temporarily unavailable)
>>>
>>> I don't want to reboot Machine A as it must synch data with Machine
B
>>> over the Infiniband link first. Does anyone have any idea how to fix
>>> machine A?
>>>
>>> Thanks,
>>>
>>> Rob
>>>
>>> The SAQ Group
>>>
>>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
>>> SEMTEC Limited Trading as SAQ is Registered in England & Wales
>>> Company Number: 06481952
>>>
>>>
>>>
>>> http://www.saqnet.co.uk AS29219
>>>
>>> SAQ Group Delivers high quality, honestly priced communication and
>> I.T.
>>> services to UK Business.
>>>
>>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
>>> Backups : Managed Networks : Remote Support.
>>>
>>> Find us in http://www.thebestof.co.uk/petersfield
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>>
>>
>


From hal.rosenstock at gmail.com  Tue Nov 25 07:30:39 2008
From: hal.rosenstock at gmail.com (Hal Rosenstock)
Date: Tue, 25 Nov 2008 10:30:39 -0500
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3,
	Linux and ibpanic - "Resource Temporarily unavailable"
In-Reply-To: <C1EAC9C5E752D24C968FF091D446D8232DE80C@ALTERNATEREALIT>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
	<f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>
	<f0e08f230811250700w6647d80fi9389fa2ff9a62cd5@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE80B@ALTERNATEREALIT>
	<f0e08f230811250719r6076a781k8531f97dcb360071@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE80C@ALTERNATEREALIT>
Message-ID: <f0e08f230811250730w22a839f5jd6fd556e67756a04@mail.gmail.com>

Hi Rob,

On Tue, Nov 25, 2008 at 10:21 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi Hal,
>
> Thanks again, I will try this in a minute. I think I have found the
> moment it went bad on Machine A using Dmesg:
> ib_mthca 0000:87:00.0: Catastrophic error detected: unknown error

Definitely need to reset mthca after this.

> ib_mthca 0000:87:00.0:   buf[00]: ffffffff
> ib_mthca 0000:87:00.0:   buf[01]: ffffffff
> ib_mthca 0000:87:00.0:   buf[02]: ffffffff
> ib_mthca 0000:87:00.0:   buf[03]: ffffffff
> ib_mthca 0000:87:00.0:   buf[04]: ffffffff
> ib_mthca 0000:87:00.0:   buf[05]: ffffffff
> ib_mthca 0000:87:00.0:   buf[06]: ffffffff
> ib_mthca 0000:87:00.0:   buf[07]: ffffffff
> ib_mthca 0000:87:00.0:   buf[08]: ffffffff
> ib_mthca 0000:87:00.0:   buf[09]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0a]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0b]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0c]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0d]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0e]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0f]: ffffffff
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib0: ib_query_gid() failed
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib0: ib_query_port failed
> ib0: Failed to modify QP to ERROR state
> ib0: timing out; 1 sends 250 receives not completed
> ib0: Failed to modify QP to RESET state
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_SRQ failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>
> Does this help to pinpoint what might have caused this?

Maybe Mellanox can comment. What firmware version are you using ?

-- Hal

>
> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: 25 November 2008 15:19
> To: Robert Dunkley
> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Hi Rob,
>
> On Tue, Nov 25, 2008 at 10:01 AM, Robert Dunkley <Robert at saq.co.uk>
> wrote:
>> Hi Hal,
>>
>> Machine A is definitely on and I have had the cable connection
> checked.
>> I'm afraid I'm not much of a techy, how do I unload and reload the IB
>> stack?
>
> It depends on what you have running... Is it just OpenSM and IPoIB ?
>
> Kill off opensm
>
> Use modprobe -r to remove all the ib_ modules. You can find them via
> lsmod | grep ib_. There is a dependency order.
>
> If you can get them all unloaded, reload them in the reverse order and
> hopefully things will be better...
>
> -- Hal
>
>> Thanks,
>>
>> Rob
>>
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: 25 November 2008 15:00
>> To: Robert Dunkley
>> Cc: Baur, Eric; general at lists.openfabrics.org
>> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
> "Resource
>> Temporarily unavailable"
>>
>> Hi Rob,
>>
>> On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk>
>> wrote:
>>> Hi Hal,
>>>
>>> Thank you for your help.
>>>
>>> Ibstat on MachineB:
>>> CA 'mthca0'
>>>        CA type: MT25204
>>>        Number of ports: 1
>>>        Firmware version: 1.2.0
>>>        Hardware version: a0
>>>        Node GUID: 0x0002c9020022d428
>>>        System image GUID: 0x0002c9020022d42b
>>>        Port 1:
>>>                State: Down
>>
>> Is machine A on ? Is mthca loaded there ? If so, this should at least
>> be init but the driver errors below may preclude this from occurring.
>>
>>>                Physical state: Polling
>>>                Rate: 10
>>>                Base lid: 0
>>>                LMC: 0
>>>                SM lid: 0
>>>                Capability mask: 0x02510a6a
>>>                Port GUID: 0x0002c9020022d429
>>>
>>> Machine A is operating normally with the exception of Infiniband
> which
>>> broke after powering down Machine B and did not recover once Machine
> B
>>> was powered on again. An extract from the log of Machine A:
>>> Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ
> failed
>>> (-11)
>>> Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ
> failed
>>> (-11)
>>> Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ
>> failed
>>> (-11)
>>> Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:32:01 mrhappy last message repeated 3 times
>>> Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>
>> -11 is EAGAIN. Not sure what this is used for in the mthca driver.
>>
>> Can you unload and reload the IB stack especially mthca driver ?
>>
>> -- Hal
>>
>>> Thanks again,
>>>
>>> Rob
>>>
>>> -----Original Message-----
>>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>>> Sent: 25 November 2008 14:49
>>> To: Robert Dunkley
>>> Cc: Baur, Eric; general at lists.openfabrics.org
>>> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
>> "Resource
>>> Temporarily unavailable"
>>>
>>> On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
>>> wrote:
>>>> Hi Eric,
>>>>
>>>> Thanks for the response. OpenSM is running and set to start on
> bootup
>>> on
>>>> MachineB:
>>>> ps aux | grep open
>>>> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
>>>> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>>>>
>>>> The log on Machine B just logs this every 10 seconds:
>>>> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
>>>> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
>>>> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
>>>> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>>>>
>>>> Ibstat confirms port is in polling state on MachineB.
>>>
>>> Is the port in init or down ?
>>>
>>>> MachineA however is in a bad state,
>>>
>>> Any additional details on this ?
>>>
>>> Can you kill/unload all the ib stuff and reload it ? That would be
>>> gentler than rebooting.
>>>
>>> -- Hal
>>>
>>>>I tried the openibd restart command, it accepted the
>>>> command but after 5 minutes shows no progress of doing anything and
>> is
>>>> just at the cursor. Is some sort of forced restart of openibd
>>> possible?
>>>>
>>>> Thanks,
>>>>
>>>> Rob
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Baur, Eric [mailto:Eric.Baur at gs.com]
>>>> Sent: 25 November 2008 14:31
>>>> To: Robert Dunkley
>>>> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic -
>> "Resource
>>>> Temporarily unavailable"
>>>>
>>>> Robert-
>>>>
>>>> Is OpenSM set to start on boot?
>>>>                chkconfig --list | grep opensmd
>>>>
>>>> If not:         chkconfig opensmd on
>>>> and:            /etc/init.d/opensmd start
>>>>
>>>> You can also restart openib without rebooting the machines.
>>>>                /etc/init.d/openibd restart
>>>>
>>>> -Eric
>>>>
>>>> -----Original Message-----
>>>> From: general-bounces at lists.openfabrics.org
>>>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
>>>> Dunkley
>>>> Sent: Tuesday, November 25, 2008 9:21 AM
>>>> To: general at lists.openfabrics.org
>>>> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>>>> Temporarily unavailable"
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm using a setup of two machines (Lets call them A and B) directly
>>>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
>>> Mellanox
>>>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
>>> 1.3
>>>> installed, Machine B runs OpenSM.
>>>>
>>>> All was working fine. I shutdown Machine A did some maintenance and
>>> then
>>>> powered it on again, everything is OK again. I then shutdown Machine
>> B
>>>> (The one running OpenSM), this seemed to really upset Machine A.
>> After
>>>> booting Machine B again, Machine B looks OK with the port down and
> in
>>>> polling state. Machine A however gives the following error if I run
>>>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
>>>> (Resource temporarily unavailable)
>>>>
>>>> I don't want to reboot Machine A as it must synch data with Machine
> B
>>>> over the Infiniband link first. Does anyone have any idea how to fix
>>>> machine A?
>>>>
>>>> Thanks,
>>>>
>>>> Rob
>>>>
>>>> The SAQ Group
>>>>
>>>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
>>>> SEMTEC Limited Trading as SAQ is Registered in England & Wales
>>>> Company Number: 06481952
>>>>
>>>>
>>>>
>>>> http://www.saqnet.co.uk AS29219
>>>>
>>>> SAQ Group Delivers high quality, honestly priced communication and
>>> I.T.
>>>> services to UK Business.
>>>>
>>>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
>>>> Backups : Managed Networks : Remote Support.
>>>>
>>>> Find us in http://www.thebestof.co.uk/petersfield
>>>>
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit
>>>> http://openib.org/mailman/listinfo/openib-general
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>
>>
>


From tziporet at dev.mellanox.co.il  Tue Nov 25 07:55:34 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 25 Nov 2008 17:55:34 +0200
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3,	Linux and ibpanic
	- "Resource Temporarily unavailable"
In-Reply-To: <f0e08f230811250730w22a839f5jd6fd556e67756a04@mail.gmail.com>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>	<f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>	<C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>	<f0e08f230811250700w6647d80fi9389fa2ff9a62cd5@mail.gmail.com>	<C1EAC9C5E752D24C968FF091D446D8232DE80B@ALTERNATEREALIT>	<f0e08f230811250719r6076a781k8531f97dcb360071@mail.gmail.com>	<C1EAC9C5E752D24C968FF091D446D8232DE80C@ALTERNATEREALIT>
	<f0e08f230811250730w22a839f5jd6fd556e67756a04@mail.gmail.com>
Message-ID: <492C1FF6.8070403@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Rob,
>
> On Tue, Nov 25, 2008 at 10:21 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
>   
>> Hi Hal,
>>
>> Thanks again, I will try this in a minute. I think I have found the
>> moment it went bad on Machine A using Dmesg:
>> ib_mthca 0000:87:00.0: Catastrophic error detected: unknown error
>>     
>
> Definitely need to reset mthca after this.
>
>   
>> ib_mthca 0000:87:00.0:   buf[00]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[01]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[02]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[03]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[04]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[05]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[06]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[07]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[08]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[09]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[0a]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[0b]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[0c]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[0d]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[0e]: ffffffff
>> ib_mthca 0000:87:00.0:   buf[0f]: ffffffff
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib0: ib_query_gid() failed
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib0: ib_query_port failed
>> ib0: Failed to modify QP to ERROR state
>> ib0: timing out; 1 sends 250 receives not completed
>> ib0: Failed to modify QP to RESET state
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_SRQ failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>>
>> Does this help to pinpoint what might have caused this?
>>     
>
>   
The ffffffff in the buf showing you have some PCI bus error. The mthca 
driver then moved to error mode and no command will be executed.
I suggest you check that the card has not moved in the system and you 
better reboot the system again

Tziporet


From monis at Voltaire.COM  Tue Nov 25 08:06:05 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Tue, 25 Nov 2008 18:06:05 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch of
	IB_EVENT_LID_CHANGE
Message-ID: <492C226D.7040009@Voltaire.COM>


When snooping a portinfo MAD, it's client_reregister bit is checked. 
If the bit is ON then a CLIENT_REREGISTER event is dispatched, otherwise
a LID_CHANGE event is dispatched. This way of decision ignores the cases
where the MAD changes the LID along with an instruction to reregister (so a
necessary LID_CHANGE event won't be dispatched) or the MAD is neither of
these (and an unnecessary LID_CHANGE event will be dispatched). 
This patch compares the LID in the MAD to the current LID. If and only if
they are not identical then a LID_CHANGE event will be dispatched.

Signed-off-by: Moni Shoua <monis at voltaire.com>
---
 drivers/infiniband/hw/mlx4/mad.c        |   21 +++++++++++++++------
 drivers/infiniband/hw/mthca/mthca_mad.c |   20 +++++++++++++++-----
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index 606f1e2..ca5fa9e 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -147,7 +147,8 @@ static void update_sm_ah(struct mlx4_ib_dev *dev, u8 port_num, u16 lid, u8 sl)
  * Snoop SM MADs for port info and P_Key table sets, so we can
  * synthesize LID change and P_Key change events.
  */
-static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
+static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad,
+				u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -157,6 +158,7 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			update_sm_ah(to_mdev(ibdev), port_num,
 				     be16_to_cpu(pinfo->sm_lid),
@@ -165,12 +167,15 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
 			event.device	       = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if ((prev_lid != 0) && (prev_lid != lid)) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -228,8 +233,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 			struct ib_wc *in_wc, struct ib_grh *in_grh,
 			struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
-	u16 slid;
+	u16 slid, prev_lid = 0;
 	int err;
+	struct ib_port_attr pattr;
 
 	slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
 
@@ -263,6 +269,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 	} else
 		return IB_MAD_RESULT_SUCCESS;
 
+	if (!ib_query_port(ibdev, port_num, &pattr))
+		prev_lid = pattr.lid;
+
 	err = mlx4_MAD_IFC(to_mdev(ibdev),
 			   mad_flags & IB_MAD_IGNORE_MKEY,
 			   mad_flags & IB_MAD_IGNORE_BKEY,
@@ -271,7 +280,7 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 		return IB_MAD_RESULT_FAILURE;
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 
diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c
index 6404495..6ac114a 100644
--- a/drivers/infiniband/hw/mthca/mthca_mad.c
+++ b/drivers/infiniband/hw/mthca/mthca_mad.c
@@ -104,7 +104,8 @@ static void update_sm_ah(struct mthca_dev *dev,
  */
 static void smp_snoop(struct ib_device *ibdev,
 		      u8 port_num,
-		      struct ib_mad *mad)
+		      struct ib_mad *mad,
+		      u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -114,6 +115,7 @@ static void smp_snoop(struct ib_device *ibdev,
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			mthca_update_rate(to_mdev(ibdev), port_num);
 			update_sm_ah(to_mdev(ibdev), port_num,
@@ -123,12 +125,15 @@ static void smp_snoop(struct ib_device *ibdev,
 			event.device           = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if ((prev_lid != 0) && (prev_lid != lid)) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -196,6 +201,8 @@ int mthca_process_mad(struct ib_device *ibdev,
 	int err;
 	u8 status;
 	u16 slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
+	u16 prev_lid = 0;
+	struct ib_port_attr pattr;
 
 	/* Forward locally generated traps to the SM */
 	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
@@ -234,6 +241,9 @@ int mthca_process_mad(struct ib_device *ibdev,
 	} else
 		return IB_MAD_RESULT_SUCCESS;
 
+	if (!ib_query_port(ibdev, port_num, &pattr))
+		prev_lid = pattr.lid;
+
 	err = mthca_MAD_IFC(to_mdev(ibdev),
 			    mad_flags & IB_MAD_IGNORE_MKEY,
 			    mad_flags & IB_MAD_IGNORE_BKEY,
@@ -252,7 +262,7 @@ int mthca_process_mad(struct ib_device *ibdev,
 	}
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 

From weiny2 at llnl.gov  Tue Nov 25 10:59:37 2008
From: weiny2 at llnl.gov (Ira Weiny)
Date: Tue, 25 Nov 2008 10:59:37 -0800
Subject: ***SPAM*** Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
	"Resource Temporarily unavailable"
In-Reply-To: <f0e08f230811250700w6647d80fi9389fa2ff9a62cd5@mail.gmail.com>
References: <C1EAC9C5E752D24C968FF091D446D8232DE802@ALTERNATEREALIT>
	<4DCBAA39733E8048992FB7737126041910FFD96A@gsmbnbp23es.firmwide.corp.gs.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE805@ALTERNATEREALIT>
	<f0e08f230811250649p2562820dxff1f5ca7c175bb83@mail.gmail.com>
	<C1EAC9C5E752D24C968FF091D446D8232DE808@ALTERNATEREALIT>
	<f0e08f230811250700w6647d80fi9389fa2ff9a62cd5@mail.gmail.com>
Message-ID: <20081125105937.7b12508b.weiny2@llnl.gov>

On Tue, 25 Nov 2008 10:00:22 -0500
"Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:

> Hi Rob,
> 
> On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> > Hi Hal,
> >
> > Thank you for your help.
> >
> > Ibstat on MachineB:
> > CA 'mthca0'
> >        CA type: MT25204
> >        Number of ports: 1
> >        Firmware version: 1.2.0
> >        Hardware version: a0
> >        Node GUID: 0x0002c9020022d428
> >        System image GUID: 0x0002c9020022d42b
> >        Port 1:
> >                State: Down
> 
> Is machine A on ? Is mthca loaded there ? If so, this should at least
> be init but the driver errors below may preclude this from occurring.
> 
> >                Physical state: Polling
> >                Rate: 10
> >                Base lid: 0
> >                LMC: 0
> >                SM lid: 0
> >                Capability mask: 0x02510a6a
> >                Port GUID: 0x0002c9020022d429
> >
> > Machine A is operating normally with the exception of Infiniband which
> > broke after powering down Machine B and did not recover once Machine B
> > was powered on again. An extract from the log of Machine A:
> > Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> > (-11)
> > Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> > (-11)
> > Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed
> > (-11)
> > Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:32:01 mrhappy last message repeated 3 times
> > Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> 
> -11 is EAGAIN. Not sure what this is used for in the mthca driver.

When we have seen these errors, it has meant the firmware is in a bad state and
is not responsive.  Unfortunately for you, in this situation we have been
forced to reboot to correct the problem.  (If rebooting is problematic for you
perhaps Mellanox has a way around this.)

For the future speak with Mellanox to ensure you have the latest firmware as
that has fixed a number of items for us.

Ira

> 
> Can you unload and reload the IB stack especially mthca driver ?
> 
> -- Hal
> 
> > Thanks again,
> >
> > Rob
> >
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > Sent: 25 November 2008 14:49
> > To: Robert Dunkley
> > Cc: Baur, Eric; general at lists.openfabrics.org
> > Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> > Temporarily unavailable"
> >
> > On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
> > wrote:
> >> Hi Eric,
> >>
> >> Thanks for the response. OpenSM is running and set to start on bootup
> > on
> >> MachineB:
> >> ps aux | grep open
> >> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
> >> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
> >>
> >> The log on Machine B just logs this every 10 seconds:
> >> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
> >> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
> >> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
> >> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
> >>
> >> Ibstat confirms port is in polling state on MachineB.
> >
> > Is the port in init or down ?
> >
> >> MachineA however is in a bad state,
> >
> > Any additional details on this ?
> >
> > Can you kill/unload all the ib stuff and reload it ? That would be
> > gentler than rebooting.
> >
> > -- Hal
> >
> >>I tried the openibd restart command, it accepted the
> >> command but after 5 minutes shows no progress of doing anything and is
> >> just at the cursor. Is some sort of forced restart of openibd
> > possible?
> >>
> >> Thanks,
> >>
> >> Rob
> >>
> >>
> >> -----Original Message-----
> >> From: Baur, Eric [mailto:Eric.Baur at gs.com]
> >> Sent: 25 November 2008 14:31
> >> To: Robert Dunkley
> >> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> >> Temporarily unavailable"
> >>
> >> Robert-
> >>
> >> Is OpenSM set to start on boot?
> >>                chkconfig --list | grep opensmd
> >>
> >> If not:         chkconfig opensmd on
> >> and:            /etc/init.d/opensmd start
> >>
> >> You can also restart openib without rebooting the machines.
> >>                /etc/init.d/openibd restart
> >>
> >> -Eric
> >>
> >> -----Original Message-----
> >> From: general-bounces at lists.openfabrics.org
> >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
> >> Dunkley
> >> Sent: Tuesday, November 25, 2008 9:21 AM
> >> To: general at lists.openfabrics.org
> >> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> >> Temporarily unavailable"
> >>
> >> Hi everyone,
> >>
> >> I'm using a setup of two machines (Lets call them A and B) directly
> >> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
> > Mellanox
> >> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
> > 1.3
> >> installed, Machine B runs OpenSM.
> >>
> >> All was working fine. I shutdown Machine A did some maintenance and
> > then
> >> powered it on again, everything is OK again. I then shutdown Machine B
> >> (The one running OpenSM), this seemed to really upset Machine A. After
> >> booting Machine B again, Machine B looks OK with the port down and in
> >> polling state. Machine A however gives the following error if I run
> >> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> >> (Resource temporarily unavailable)
> >>
> >> I don't want to reboot Machine A as it must synch data with Machine B
> >> over the Infiniband link first. Does anyone have any idea how to fix
> >> machine A?
> >>
> >> Thanks,
> >>
> >> Rob
> >>
> >> The SAQ Group
> >>
> >> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> >> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> >> Company Number: 06481952
> >>
> >>
> >>
> >> http:// www. saqnet.co.uk AS29219
> >>
> >> SAQ Group Delivers high quality, honestly priced communication and
> > I.T.
> >> services to UK Business.
> >>
> >> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
> >> Backups : Managed Networks : Remote Support.
> >>
> >> Find us in http:// www. thebestof.co.uk/petersfield
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> >> http:// openib.org/mailman/listinfo/openib-general
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> > http:// openib.org/mailman/listinfo/openib-general
> >>
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Tue Nov 25 14:54:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 25 Nov 2008 14:54:37 -0800
Subject: [ofa-general] Re: [PATCH 1 of 2 V2] libmlx4: Fix race condition in
	create/destroy QP
In-Reply-To: <200811250840.07944.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Tue, 25 Nov 2008 08:40:07 +0200")
References: <200811250840.07944.jackm@dev.mellanox.co.il>
Message-ID: <adaod03e7rm.fsf@cisco.com>

Thanks for double checking and resending... I applied and pushed out.

Kind of amazing that my bleary eyes noticed the one bug :)


From rdreier at cisco.com  Tue Nov 25 14:55:59 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 25 Nov 2008 14:55:59 -0800
Subject: [ofa-general] Re: [PATCH 2 of 2] libmthca: Fix race condition in
	create/destroy QP
In-Reply-To: <200811221154.02427.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Sat, 22 Nov 2008 11:54:01 +0200")
References: <200811221154.02427.jackm@dev.mellanox.co.il>
Message-ID: <adak5are7pc.fsf@cisco.com>

thanks, applied


From rdreier at cisco.com  Tue Nov 25 14:56:52 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 25 Nov 2008 14:56:52 -0800
Subject: [ofa-general] RE: [PATCH 03/10] RDMA/nes: Remove tx_free_list
In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA3830310DC721E@azsmsx501.amr.corp.intel.com>
	(Chien Tin Tung's message of "Mon, 24 Nov 2008 15:14:30 -0700")
References: <20081121205044.GA7424@ctung-MOBL> <adawseseqjh.fsf@cisco.com>
	<60BEFF3FBD4C6047B0F13F205CAFA3830310DC721E@azsmsx501.amr.corp.intel.com>
Message-ID: <adafxlfe7nv.fsf@cisco.com>

 > We were trying to make minimum change to the code.  There is no reason left
 > For get_free_pkt().  I can rework the patch to remove it.

Please do.  Since you changed the signature of get_free_pkt(), you have
to touch every call site anyway, so may as well call dev_alloc_skb()
directly and delete a few more lines of code.

 - R.


From rdreier at cisco.com  Tue Nov 25 15:13:51 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 25 Nov 2008 15:13:51 -0800
Subject: [ofa-general] Re: [PATCH] IB/ehca: Change misleading error message
In-Reply-To: <200811251358.06729.fenkes@de.ibm.com> (Joachim Fenkes's message
	of "Tue, 25 Nov 2008 13:58:06 +0100")
References: <200806061835.43802.fenkes@de.ibm.com>
	<48499C11.7030504@gmail.com> <200811251358.06729.fenkes@de.ibm.com>
Message-ID: <aday6z7csb4.fsf@cisco.com>

 > The error message printed when the eHCA driver prevents memory hotplug is
 > misleading -- the user might think that hot-removing the lhca, hotplugging
 > memory, then hot-adding the lhca again will work, but it doesn't.

That's too bad... I applied this patch but out of curiousity, why
doesn't the hot-remove/hot-add work?  I would have thought that
re-registering all of memory after the hot-add would do the right thing.


From rdreier at cisco.com  Tue Nov 25 15:15:57 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 25 Nov 2008 15:15:57 -0800
Subject: [ofa-general] [PATCH 3/3] IB/ipath - improve UD loopback
	performance by allocating temp array once
In-Reply-To: <20081023195017.10020.33878.stgit@eng-46.mv.qlogic.com> (Ralph
	Campbell's message of "Thu, 23 Oct 2008 12:50:17 -0700")
References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com>
	<20081023195017.10020.33878.stgit@eng-46.mv.qlogic.com>
Message-ID: <adatz9vcs7m.fsf@cisco.com>

thanks, applied


From ddiss at sgi.com  Tue Nov 25 21:12:13 2008
From: ddiss at sgi.com (David Disseldorp)
Date: Wed, 26 Nov 2008 16:12:13 +1100
Subject: [ofa-general] [PATCH] iser: avoid recv buf exhaustion
In-Reply-To: <49292119.9080105@voltaire.com>
References: <1227247845-16023-1-git-send-email-ddiss@sgi.com>
	<49292119.9080105@voltaire.com>
Message-ID: <20081126161213.000065c3@snort.melbourne.sgi.com>

Thanks for the feedback Or, comments below.

On Sun, 23 Nov 2008 11:23:37 +0200
Or Gerlitz <ogerlitz at voltaire.com> wrote:

> David Disseldorp wrote:
> > iSCSI/iSER targets may send PDUs without a prior request from the initiator, RFC 5046 refers to these PDUs as "unexpected". NOP-In PDUs with itt=RESERVED and Asynchronous Message PDUs occupy this category. Currently when an iSER target sends an "unexpected" PDU, the initiators recv buffer consumed by the PDU is not replaced. If over initial_post_recv_bufs_num "unexpected" PDUs are received then the receive queue will run out of receive work requests.
> Assuming these target initiated NOP-Ins are echoed back by the 
> initiator, the current code of iser_send_control would post a receive 
> buffer when sending the NOP-Out which will account for the buffer 
> consumed by the NOP-In. So we are remained with the Asynchronous PDUs  
> for which your patch indeed seems to fix a hole in the implementation.

Yes, target initiated "ping" NOP-Ins with a valid TTT do not currently
result in receive buffer depletion, however targets may use a NOP-In PDU
with both ITT and TTT set to RESERVED for the sole purpose of advertising
the command window counters (ExpCmdSN and MaxCmdSN). These PDUs do not
require a NOP-Out PDU from the initiator.

Likewise the Initiator may send a NOP-Out with both ITT and TTT set to
RESERVED, in this case a recv buf for a target response should not be
posted.

> >
> > This patch ensures recv buffers consumed by "unexpected" PDUs are replaced prior to sending the next control-type PDU.
> The practice used by the patch is account unexpected receives and refill 
> the receive buffer queue when ever possible with as many as unexpected 
> receives that took place since the last refill attempt. To ease with 
> future maintainance and debugging / simplicity of the code, I would 
> prefer a patch with zero foot-print at the iser_send_xxx functions, 
> something like account --async-- receives and when calling 
> iser_post_receive_control fill-in the missing buffers.

No problems, i'll rework the patch to post "unexpected" buffers along
with the response buffer in iser_post_receive_control().

Cheers, Dave


From ddiss at sgi.com  Tue Nov 25 21:19:22 2008
From: ddiss at sgi.com (David Disseldorp)
Date: Wed, 26 Nov 2008 16:19:22 +1100
Subject: [ofa-general] [PATCH] iser: avoid recv buf exhaustion v2
In-Reply-To: <20081126161213.000065c3@snort.melbourne.sgi.com>
References: <20081126161213.000065c3@snort.melbourne.sgi.com>
Message-ID: <1227676762-23505-1-git-send-email-ddiss@sgi.com>

iSCSI/iSER targets may send PDUs without a prior request from the initiator,
RFC 5046 refers to these PDUs as "unexpected". NOP-In PDUs with itt=RESERVED
and Asynchronous Message PDUs occupy this category.

The amount of active "unexpected" PDU's an iSER target may have at any time is
governed by the MaxOutstandingUnexpectedPDUs key, which is not yet supported.

Currently when an iSER target sends an "unexpected" PDU, the initiators recv
buffer consumed by the PDU is not replaced. If over initial_post_recv_bufs_num
"unexpected" PDUs are received then the receive queue will run out of receive
work requests.

This patch ensures recv buffers consumed by "unexpected" PDUs are replaced
in the next iser_post_receive_control() call.

Version 2:
o replace unexpected recv bufs in iser_post_receive_control, transparent
  to iser_send_* functions.

Signed-off-by: David Disseldorp <ddiss at sgi.com>
Signed-off-by: Ken Sandars <ksandars at sgi.com>
---
 drivers/infiniband/ulp/iser/iscsi_iser.h     |    3 +
 drivers/infiniband/ulp/iser/iser_initiator.c |  134 ++++++++++++++++++--------
 drivers/infiniband/ulp/iser/iser_verbs.c     |    1 +
 3 files changed, 97 insertions(+), 41 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 81a8262..8611195 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -252,6 +252,9 @@ struct iser_conn {
 	wait_queue_head_t	     wait;          /* waitq for conn/disconn  */
 	atomic_t                     post_recv_buf_count; /* posted rx count   */
 	atomic_t                     post_send_buf_count; /* posted tx count   */
+	atomic_t                     unexpected_pdu_count;/* count of received *
+							   * unexpected pdus   *
+							   * not yet retired   */
 	char 			     name[ISER_OBJECT_NAME_SIZE];
 	struct iser_page_vec         *page_vec;     /* represents SG to fmr maps*
 						     * maps serialized as tx is*/
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
index cdd2831..a0c56a4 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -183,14 +183,8 @@ static int iser_post_receive_control(struct iscsi_conn *conn)
 	struct iser_regd_buf *regd_data;
 	struct iser_dto      *recv_dto = NULL;
 	struct iser_device  *device = iser_conn->ib_conn->device;
-	int rx_data_size, err = 0;
-
-	rx_desc = kmem_cache_alloc(ig.desc_cache, GFP_NOIO);
-	if (rx_desc == NULL) {
-		iser_err("Failed to alloc desc for post recv\n");
-		return -ENOMEM;
-	}
-	rx_desc->type = ISCSI_RX;
+	int rx_data_size, err;
+	int posts, outstanding_unexp_pdus;
 
 	/* for the login sequence we must support rx of upto 8K; login is done
 	 * after conn create/bind (connect) and conn stop/bind (reconnect),
@@ -201,46 +195,80 @@ static int iser_post_receive_control(struct iscsi_conn *conn)
 	else /* FIXME till user space sets conn->max_recv_dlength correctly */
 		rx_data_size = 128;
 
-	rx_desc->data = kmalloc(rx_data_size, GFP_NOIO);
-	if (rx_desc->data == NULL) {
-		iser_err("Failed to alloc data buf for post recv\n");
-		err = -ENOMEM;
-		goto post_rx_kmalloc_failure;
-	}
+        outstanding_unexp_pdus =
+                atomic_xchg(&iser_conn->ib_conn->unexpected_pdu_count, 0);
 
-	recv_dto = &rx_desc->dto;
-	recv_dto->ib_conn = iser_conn->ib_conn;
-	recv_dto->regd_vector_len = 0;
+	/*
+	 * in addition to the response buffer, replace those consumed by
+	 * unexpected pdus.
+	 */
+	for (posts = 0; posts < 1 + outstanding_unexp_pdus; posts++) {
+		rx_desc = kmem_cache_alloc(ig.desc_cache, GFP_NOIO);
+		if (rx_desc == NULL) {
+			iser_err("Failed to alloc desc for post recv %d\n",
+				 posts);
+			err = -ENOMEM;
+			goto post_rx_cache_alloc_failure;
+		}
+		rx_desc->type = ISCSI_RX;
+		rx_desc->data = kmalloc(rx_data_size, GFP_NOIO);
+		if (rx_desc->data == NULL) {
+			iser_err("Failed to alloc data buf for post recv %d\n",
+				 posts);
+			err = -ENOMEM;
+			goto post_rx_kmalloc_failure;
+		}
 
-	regd_hdr = &rx_desc->hdr_regd_buf;
-	memset(regd_hdr, 0, sizeof(struct iser_regd_buf));
-	regd_hdr->device  = device;
-	regd_hdr->virt_addr  = rx_desc; /* == &rx_desc->iser_header */
-	regd_hdr->data_size  = ISER_TOTAL_HEADERS_LEN;
+		recv_dto = &rx_desc->dto;
+		recv_dto->ib_conn = iser_conn->ib_conn;
+		recv_dto->regd_vector_len = 0;
 
-	iser_reg_single(device, regd_hdr, DMA_FROM_DEVICE);
+		regd_hdr = &rx_desc->hdr_regd_buf;
+		memset(regd_hdr, 0, sizeof(struct iser_regd_buf));
+		regd_hdr->device  = device;
+		regd_hdr->virt_addr  = rx_desc; /* == &rx_desc->iser_header */
+		regd_hdr->data_size  = ISER_TOTAL_HEADERS_LEN;
 
-	iser_dto_add_regd_buff(recv_dto, regd_hdr, 0, 0);
+		iser_reg_single(device, regd_hdr, DMA_FROM_DEVICE);
 
-	regd_data = &rx_desc->data_regd_buf;
-	memset(regd_data, 0, sizeof(struct iser_regd_buf));
-	regd_data->device  = device;
-	regd_data->virt_addr  = rx_desc->data;
-	regd_data->data_size  = rx_data_size;
+		iser_dto_add_regd_buff(recv_dto, regd_hdr, 0, 0);
 
-	iser_reg_single(device, regd_data, DMA_FROM_DEVICE);
+		regd_data = &rx_desc->data_regd_buf;
+		memset(regd_data, 0, sizeof(struct iser_regd_buf));
+		regd_data->device  = device;
+		regd_data->virt_addr  = rx_desc->data;
+		regd_data->data_size  = rx_data_size;
 
-	iser_dto_add_regd_buff(recv_dto, regd_data, 0, 0);
+		iser_reg_single(device, regd_data, DMA_FROM_DEVICE);
 
-	err = iser_post_recv(rx_desc);
-	if (!err)
-		return 0;
+		iser_dto_add_regd_buff(recv_dto, regd_data, 0, 0);
 
-	/* iser_post_recv failed */
+		err = iser_post_recv(rx_desc);
+		if (err) {
+			iser_err("Failed iser_post_recv for post %d\n", posts);
+			goto post_rx_post_recv_failure;
+		}
+	}
+	/* all posts successful */
+	return 0;
+
+post_rx_post_recv_failure:
 	iser_dto_buffs_release(recv_dto);
 	kfree(rx_desc->data);
 post_rx_kmalloc_failure:
 	kmem_cache_free(ig.desc_cache, rx_desc);
+post_rx_cache_alloc_failure:
+	if (posts > 0) {
+		/*
+		 * response buffer posted, but did not replace all unexpected
+		 * pdu recv bufs. Ignore error, retry occurs next send
+		 */
+		outstanding_unexp_pdus -= (posts - 1);
+		err = 0;
+	}
+	atomic_add(outstanding_unexp_pdus,
+		   &iser_conn->ib_conn->unexpected_pdu_count);
+
 	return err;
 }
 
@@ -274,8 +302,10 @@ int iser_conn_set_full_featured_mode(struct iscsi_conn *conn)
 	struct iscsi_iser_conn *iser_conn = conn->dd_data;
 
 	int i;
-	/* no need to keep it in a var, we are after login so if this should
-	 * be negotiated, by now the result should be available here */
+	/*
+	 * FIXME this value should be declared to the target during login with
+	 * the MaxOutstandingUnexpectedPDUs key when supported
+	 */
 	int initial_post_recv_bufs_num = ISER_MAX_RX_MISC_PDUS;
 
 	iser_dbg("Initially post: %d\n", initial_post_recv_bufs_num);
@@ -478,6 +508,7 @@ int iser_send_control(struct iscsi_conn *conn,
 	int err = 0;
 	struct iser_regd_buf *regd_buf;
 	struct iser_device *device;
+	unsigned char opcode;
 
 	if (!iser_conn_state_comp(iser_conn->ib_conn, ISER_CONN_UP)) {
 		iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn);
@@ -512,10 +543,16 @@ int iser_send_control(struct iscsi_conn *conn,
 				       data_seg_len);
 	}
 
-	if (iser_post_receive_control(conn) != 0) {
-		iser_err("post_rcv_buff failed!\n");
-		err = -ENOMEM;
-		goto send_control_error;
+	opcode = task->hdr->opcode & ISCSI_OPCODE_MASK;
+
+	/* post recv buffer for response if one is expected */
+	if (!((opcode == ISCSI_OP_NOOP_OUT)
+	 && (task->hdr->itt == RESERVED_ITT))) {
+		if (iser_post_receive_control(conn) != 0) {
+			iser_err("post_rcv_buff failed!\n");
+			err = -ENOMEM;
+			goto send_control_error;
+		}
 	}
 
 	err = iser_post_send(mdesc);
@@ -586,6 +623,21 @@ void iser_rcv_completion(struct iser_desc *rx_desc,
 	 * parallel to the execution of iser_conn_term. So the code that waits *
 	 * for the posted rx bufs refcount to become zero handles everything   */
 	atomic_dec(&conn->ib_conn->post_recv_buf_count);
+
+	/*
+	 * if an unexpected PDU was received then the recv wr consumed must
+	 * be replaced, this is done in the next send of a control-type PDU
+	 */
+	if ((opcode == ISCSI_OP_NOOP_IN)
+	 && (hdr->itt == RESERVED_ITT)) {
+		/* nop-in with itt = 0xffffffff */
+		atomic_inc(&conn->ib_conn->unexpected_pdu_count);
+	}
+	else if (opcode == ISCSI_OP_ASYNC_EVENT) {
+		/* asyncronous message */
+		atomic_inc(&conn->ib_conn->unexpected_pdu_count);
+	}
+	/* a reject PDU consumes the recv buf posted for the response */
 }
 
 void iser_snd_completion(struct iser_desc *tx_desc)
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 26ff621..6dc6b17 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -498,6 +498,7 @@ void iser_conn_init(struct iser_conn *ib_conn)
 	init_waitqueue_head(&ib_conn->wait);
 	atomic_set(&ib_conn->post_recv_buf_count, 0);
 	atomic_set(&ib_conn->post_send_buf_count, 0);
+	atomic_set(&ib_conn->unexpected_pdu_count, 0);
 	atomic_set(&ib_conn->refcount, 1);
 	INIT_LIST_HEAD(&ib_conn->conn_list);
 	spin_lock_init(&ib_conn->lock);
-- 
1.5.4.5


From jackm at dev.mellanox.co.il  Tue Nov 25 23:42:46 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Wed, 26 Nov 2008 09:42:46 +0200
Subject: [ofa-general] Re: [PATCH 1 of 2 V2] libmlx4: Fix race condition in
	create/destroy QP
In-Reply-To: <adaod03e7rm.fsf@cisco.com>
References: <200811250840.07944.jackm@dev.mellanox.co.il>
	<adaod03e7rm.fsf@cisco.com>
Message-ID: <200811260942.47110.jackm@dev.mellanox.co.il>

On Wednesday 26 November 2008 00:54, Roland Dreier wrote:
> Kind of amazing that my bleary eyes noticed the one bug :)

Thank heaven you did!

- Jack


From vlad at dev.mellanox.co.il  Wed Nov 26 00:53:28 2008
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 26 Nov 2008 10:53:28 +0200
Subject: [ofa-general] [PATCH] ipoib: do not join broadcast group if
	interface is brought down
In-Reply-To: <49246EB7.3070607@Voltaire.COM>
References: <49246EB7.3070607@Voltaire.COM>
Message-ID: <492D0E88.6080009@dev.mellanox.co.il>

Yossi Etigin wrote:
> Because ipoib_workqueue is not flushed when ipoib interface is brought 
> down,
> ipoib_mcast_join() may trigger a join to the broadcast group after 
> priv->broadcast
> was set to NULL (during cleanup). This will cause ipoib to be joined 
> to the
> broadcast group when interface is down.
> As a side effect, this breaks the optimization of setting qkey only 
> when joining
> the broadcast group.
>
> Signed-off-by: Yossi Etigin <yosefe at voltaire.com>
>
> -- 
>
> Fix bugzilla 1370.
>
> Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> ===================================================================
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c    2008-11-19 
> 21:33:54.000000000 +0200
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c    2008-11-19 
> 21:40:12.000000000 +0200
> @@ -565,7 +565,8 @@ void ipoib_mcast_join_task(struct work_s
>             ipoib_warn(priv, "ib_query_port failed\n");
>     }
>
> -    if (!priv->broadcast) {
> +    rtnl_lock();
> +    if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && 
> !priv->broadcast) {
>         struct ipoib_mcast *broadcast;
>
>         broadcast = ipoib_mcast_alloc(dev, 1);
> @@ -576,6 +577,7 @@ void ipoib_mcast_join_task(struct work_s
>                 queue_delayed_work(ipoib_workqueue,
>                            &priv->mcast_join_task, HZ);
>             mutex_unlock(&mcast_mutex);
> +            rtnl_unlock();
>             return;
>         }
>
> @@ -587,6 +589,7 @@ void ipoib_mcast_join_task(struct work_s
>         __ipoib_mcast_add(dev, priv->broadcast);
>         spin_unlock_irq(&priv->lock);
>     }
> +    rtnl_unlock();
>
>     if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
>         if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))

Hi Yossi,
I got the following kernel oops on SLES 10 (2.6.16.21-0.8-smp) using the 
patch above.

To reproduce, run:
rmmod ib_ipoib


Unable to handle kernel NULL pointer dereference at virtual address 00000068
 printing eip:
f8c5e3c4
*pde = 7a0e8067
Oops: 0000 [#1]
SMP
last sysfs file: /class/infiniband/mthca0/node_desc
Modules linked in: ib_ipoib ib_cm ib_sa ib_uverbs ib_umad mlx4_ib 
mlx4_core ib_mthca ib_mad ib_core memtrack autofs4 nfs lockd nfs_acl 
sunrpc ipv6 af_packe
CPU:    0
EIP:    0060:[<f8c5e3c4>]    Tainted: G     U VLI
EFLAGS: 00010202   (2.6.16.21-0.8-smp #1)
EIP is at ipoib_mcast_join_task+0x134/0x24d [ib_ipoib]
eax: 00000000   ebx: f6a2c3e8   ecx: 00000000   edx: 00000000
esi: f6a2c56c   edi: f6a2c12c   ebp: f6a2c380   esp: f6a2bf0c
ds: 007b   es: 007b   ss: 0068
Process ipoib (pid: 7858, threadinfo=f6a2a000 task=f7e3c0f0)
Stack: <0>f6a2c000 00000004 00000004 00000004 00000020 02510a68 80000000 
00000000
       00000000 00020040 0400000f 02001200 00000501 f6a2c3e8 f6a2c3ec 
f73447c0
       00000292 c012d85e f8c5e290 f6a2c3e8 f73447cc f73447c0 f73447d4 
c012e052
Call Trace:
 [<c012d85e>] run_workqueue+0x7f/0xba
 [<f8c5e290>] ipoib_mcast_join_task+0x0/0x24d [ib_ipoib]
 [<c012e052>] worker_thread+0x0/0x11e
 [<c012e13f>] worker_thread+0xed/0x11e
 [<c011a067>] default_wake_function+0x0/0xc
 [<c0130895>] kthread+0x9d/0xc9
 [<c01307f8>] kthread+0x0/0xc9
 [<c0102005>] kernel_thread_helper+0x5/0xb
Code: 21 63 c7 8b 75 04 81 c6 3c 01 00 00 a5 a5 a5 a5 89 5d 28 8b 04 24 
89 da e8 b3 f5 ff ff b0 01 86 45 00 fb e8 62 92 5e c7 8b 55 28 <8b> 42 
68 a8 08 75

Regards,
Vladimir


From sashak at voltaire.com  Wed Nov 26 02:13:05 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 26 Nov 2008 12:13:05 +0200
Subject: [ofa-general] Re: [PATCH] Opensm: main exit codes
In-Reply-To: <492B24BA.40303@llnl.gov>
References: <4923678D.3080701@llnl.gov>
	<20081123185836.GU21967@sashak.voltaire.com>
	<492AF3AE.3060605@llnl.gov>
	<20081124190251.GT6183@sashak.voltaire.com>
	<492B24BA.40303@llnl.gov>
Message-ID: <20081126101305.GE12270@sashak.voltaire.com>

Hi Tim,

On 14:03 Mon 24 Nov     , Timothy A. Meier wrote:
> > 
> > And are there any of such tools? Or any *real* use?
> >
> 
> Chicken/Egg?  Currently, we depend on only ZERO or non-zero.  Although OpenSM returns "other" values
> on exit, they aren't really formalized or documented.  Hence the patch. ;^)

And after this patch it is still be not formalized - there are another
places in OpenSM where exit(N) is called. For example what could you do
with exit(YY_EXIT_FAILURE)?

> Personally, I have (and create) several different versions of opensm with small customizations,
> and test them on our cluster testbeds.  I often will start/stop them in a variety of configurations
> (with and without plugins, more than one sm on a node, etc.) and if and when opensm doesn't
> startup normally, it would be nice to have a meaningful exit code.
> 
> Perhaps others might find it useful as well, or for some future use.

Maybe, but for this clear rules should be defined and applied, not just
several exit codes. Ideally such work could be done in parallel - OpenSM
and analyzing tool (not a Chicken/Egg :)).

> But again, I originally considered this more as code cleanup.  Converting the exits, returns, and aborts
> to provide a more consistent interface to the system.

Ok, if it is only the purpose we can do something like this (assuming
all exit(), abort(), etc. and not only in main.c are converted), but in
this case I would suggest to start with very limited error codes set, and
to not add OSM_EXIT_TYPE_NORMAL - "0" looks better and it is fine for the
system too. And in any case I don't see this as OFED materials.

Sasha


From sashak at voltaire.com  Wed Nov 26 02:19:19 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 26 Nov 2008 12:19:19 +0200
Subject: [ofa-general] Re: [PATCH 0/3] ibnetdiscover library "libibnetdisc"
In-Reply-To: <20081124134938.61c345e0.weiny2@llnl.gov>
References: <20081120163809.26a3c499.weiny2@llnl.gov>
	<20081123182741.GS21967@sashak.voltaire.com>
	<20081124094243.4dbcff51.weiny2@llnl.gov>
	<20081124191050.GU6183@sashak.voltaire.com>
	<20081124113005.4261cfd1.weiny2@llnl.gov>
	<20081124200151.GX6183@sashak.voltaire.com>
	<20081124134938.61c345e0.weiny2@llnl.gov>
Message-ID: <20081126101919.GF12270@sashak.voltaire.com>

Hi Ira,

On 13:49 Mon 24 Nov     , Ira Weiny wrote:
> 
> As long as the library exists any dependant package can of course use the
> library from whatever package we chose (libibnetdisc or infiniband-diags).  We
> have some code which is prototyped against ibnetdiscover but we plan on using
> this library instead.  This would be separate from infiniband-diags.  But we
> can just as easily put a dependancy on infiniband-diags as on libibnetdisc.

Yes, it is possible to make dependency. But I'm getting complains about
too many in-management dependencies even now.

> The fact is that it was actually easier to put this in a new package rather
> than try and integrate with infiniband-diags.

I would disagree. In later case we need to deal with only one logical
change and not with all "new package" issues.

> I thought it made for a very
> clean conversion by putting the library in as a new patch and then we could
> convert the diags as appropriate.
> 
> Anyway, I will integrate it as you say and resubmit the patch.

Thanks.

Sasha


From sashak at voltaire.com  Wed Nov 26 03:00:49 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 26 Nov 2008 13:00:49 +0200
Subject: [ofa-general] [PATCH] infiniband-diags/grouping: add 10G IP router
	devid
Message-ID: <20081126110049.GJ12270@sashak.voltaire.com>


Add 10G IP router device id for grouping.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 infiniband-diags/include/grouping.h |    1 +
 infiniband-diags/src/grouping.c     |    3 ++-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/infiniband-diags/include/grouping.h b/infiniband-diags/include/grouping.h
index 3ba872c..e54efef 100644
--- a/infiniband-diags/include/grouping.h
+++ b/infiniband-diags/include/grouping.h
@@ -91,6 +91,7 @@ struct AllChassisList {
 #define VTR_DEVID_ISR2012		0x5a39
 #define VTR_DEVID_SFB2004		0x5a40
 #define VTR_DEVID_ISR2004		0x5a41
+#define VTR_DEVID_SRB2004		0x5a42
 
 enum ChassisType { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT };
 enum ChassisSlot { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS };
diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c
index e2b4488..f1a996f 100644
--- a/infiniband-diags/src/grouping.c
+++ b/infiniband-diags/src/grouping.c
@@ -242,7 +242,8 @@ static int is_spine(Node *node)
 static int is_line_24(Node *node)
 {
 	return (node->devid == VTR_DEVID_SLB24 ||
-		node->devid == VTR_DEVID_SLB24_DDR);
+		node->devid == VTR_DEVID_SLB24_DDR ||
+		node->devid == VTR_DEVID_SRB2004);
 }
 
 static int is_line_8(Node *node)
-- 
1.6.0.4.766.g6fc4a


From vlad at lists.openfabrics.org  Wed Nov 26 03:23:50 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Wed, 26 Nov 2008 03:23:50 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081126-0200 daily build status
Message-ID: <20081126112350.84B44E60B89@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From FENKES at de.ibm.com  Wed Nov 26 05:44:56 2008
From: FENKES at de.ibm.com (Joachim Fenkes)
Date: Wed, 26 Nov 2008 14:44:56 +0100
Subject: [ofa-general] Re: [PATCH] IB/ehca: Change misleading error message
In-Reply-To: <aday6z7csb4.fsf@cisco.com>
References: <200806061835.43802.fenkes@de.ibm.com>	<48499C11.7030504@gmail.com>
	<200811251358.06729.fenkes@de.ibm.com> <aday6z7csb4.fsf@cisco.com>
Message-ID: <OFDC616681.53A831E8-ONC125750D.0046F124-C125750D.004B8673@de.ibm.com>

Roland Dreier <rdreier at cisco.com> wrote on 26.11.2008 00:13:51:

> That's too bad... I applied this patch but out of curiousity, why
> doesn't the hot-remove/hot-add work?  I would have thought that
> re-registering all of memory after the hot-add would do the right thing.

That's right, but right now, we simply try to register all of memory from 
KERNELBASE to high_memory, which works right until we have memory holes in 
the middle; then the hypervisor will reject our page registrations. Same 
goes for huge (16GB) pages, by the way. We're working on a solution to 
this.

Cheers,
  Joachim


From marinal at voltaire.com  Wed Nov 26 08:07:41 2008
From: marinal at voltaire.com (Marina Lipshteyn)
Date: Wed, 26 Nov 2008 18:07:41 +0200
Subject: [ofa-general] documentation on Fat-Tree algorithm in OpenSM
Message-ID: <D2E118A1E7268F449C469EA2F301D929022FAAA1@exil.voltaire.com>

Hi,

 
I believe that the Fat-Tree algorithm has very pure documentation in
opensm. There is no description on how the port balancing is done. On
the other hand, the other algorithms do have an explanation on their
balancing concept. I would like to ask if it is possible to add such a
description, at least on a level of general concept explanation. This
will help to understand the algorithm.

 
Thanks,

Marina.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081126/022c1cc2/attachment.html>

From yosefe at Voltaire.COM  Wed Nov 26 08:27:27 2008
From: yosefe at Voltaire.COM (Yossi Etigin)
Date: Wed, 26 Nov 2008 18:27:27 +0200
Subject: [ofa-general] [PATCH v2] ipoib: do not join broadcast group if
 interface is brought down
In-Reply-To: <492D0E88.6080009@dev.mellanox.co.il>
References: <49246EB7.3070607@Voltaire.COM>
	<492D0E88.6080009@dev.mellanox.co.il>
Message-ID: <492D78EF.4010703@Voltaire.COM>

Because ipoib_workqueue is not flushed when ipoib interface is brought down,
ipoib_mcast_join() may trigger a join to the broadcast group after priv->broadcast
was set to NULL (during cleanup). This will cause ipoib to be joined to the
broadcast group when interface is down.
As a side effect, this breaks the optimization of setting qkey only when joining
the broadcast group.

Signed-off-by: Yossi Etigin <yosefe at voltaire.com>

--

Changes from v1:
 - Put checks in places where was assumed priv->broadcast != NULL.

Fix bugzilla 1370.

---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2008-11-19 21:33:54.000000000 +0200
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2008-11-26 18:08:48.000000000 +0200
@@ -497,7 +497,7 @@ static void ipoib_mcast_join(struct net_
 		IB_SA_MCMEMBER_REC_PKEY		|
 		IB_SA_MCMEMBER_REC_JOIN_STATE;
 
-	if (create) {
+	if (create && priv->broadcast) {
 		comp_mask |=
 			IB_SA_MCMEMBER_REC_QKEY			|
 			IB_SA_MCMEMBER_REC_MTU_SELECTOR		|
@@ -565,7 +565,8 @@ void ipoib_mcast_join_task(struct work_s
 			ipoib_warn(priv, "ib_query_port failed\n");
 	}
 
-	if (!priv->broadcast) {
+	rtnl_lock();
+	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && !priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
 		broadcast = ipoib_mcast_alloc(dev, 1);
@@ -576,6 +577,7 @@ void ipoib_mcast_join_task(struct work_s
 				queue_delayed_work(ipoib_workqueue,
 						   &priv->mcast_join_task, HZ);
 			mutex_unlock(&mcast_mutex);
+			rtnl_unlock();
 			return;
 		}
 
@@ -587,8 +589,10 @@ void ipoib_mcast_join_task(struct work_s
 		__ipoib_mcast_add(dev, priv->broadcast);
 		spin_unlock_irq(&priv->lock);
 	}
+	rtnl_unlock();
 
-	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
+	if (priv->broadcast &&
+	    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
 		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
 			ipoib_mcast_join(dev, priv->broadcast, 0);
 		return;
@@ -617,7 +621,8 @@ void ipoib_mcast_join_task(struct work_s
 		return;
 	}
 
-	priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
+	if (priv->broadcast)
+		priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
 
 	if (!ipoib_cm_admin_enabled(dev)) {
 		rtnl_lock();


From alekseys at voltaire.com  Wed Nov 26 09:51:39 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 26 Nov 2008 19:51:39 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6 support
	for rdma_bind_addr
Message-ID: <1227721899.3121.18.camel@alst60.voltaire.com>

Changes from v3:

This set of patches based on the latest, 2.6.28-rc4 kernel.


>From 3fd066360f33d4083e183c14b991ed6408d68726 Mon Sep 17 00:00:00 2001
From: Aleksey Senin <alekseys at voltaire.com>
Date: Wed, 13 Aug 2008 09:55:33 +0300
Subject: [PATCH] AF_INET6 support for rdma_bind_addr

Signed-off-by: Aleksey Senin <alekseys at voltaire.com>
---
 drivers/infiniband/core/cma.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d951896..4728265 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2073,7 +2073,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr)
 	struct rdma_id_private *id_priv;
 	int ret;
 
-	if (addr->sa_family != AF_INET)
+	if (addr->sa_family != AF_INET && addr->sa_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-- 
1.5.6.dirty


From alekseys at voltaire.com  Wed Nov 26 09:55:30 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 26 Nov 2008 17:55:30 +0000
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 2/6] AF_INET6
	case to cma_format_hdr function
In-Reply-To: <1227721899.3121.18.camel@alst60.voltaire.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>
Message-ID: <1227722130.3121.20.camel@alst60.voltaire.com>

>From 46f9e4ae3fadb174d26816df8932f97561479307 Mon Sep 17 00:00:00 2001
From: Aleksey Senin <alekseys at voltaire.com>
Date: Wed, 13 Aug 2008 10:01:05 +0300
Subject: [PATCH] AF_INET6 case to cma_format_hdr function

Signed-off-by: Aleksey Senin <alekseys at voltaire.com>
---
 drivers/infiniband/core/cma.c |   73 ++++++++++++++++++++++++++++------------
 1 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 4728265..31f2aa2 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2113,32 +2113,61 @@ EXPORT_SYMBOL(rdma_bind_addr);
 static int cma_format_hdr(void *hdr, enum rdma_port_space ps,
 			  struct rdma_route *route)
 {
-	struct sockaddr_in *src4, *dst4;
 	struct cma_hdr *cma_hdr;
 	struct sdp_hh *sdp_hdr;
 
-	src4 = (struct sockaddr_in *) &route->addr.src_addr;
-	dst4 = (struct sockaddr_in *) &route->addr.dst_addr;
-
-	switch (ps) {
-	case RDMA_PS_SDP:
-		sdp_hdr = hdr;
-		if (sdp_get_majv(sdp_hdr->sdp_version) != SDP_MAJ_VERSION)
-			return -EINVAL;
-		sdp_set_ip_ver(sdp_hdr, 4);
-		sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr;
-		sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr;
-		sdp_hdr->port = src4->sin_port;
-		break;
-	default:
-		cma_hdr = hdr;
-		cma_hdr->cma_version = CMA_VERSION;
-		cma_set_ip_ver(cma_hdr, 4);
-		cma_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr;
-		cma_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr;
-		cma_hdr->port = src4->sin_port;
-		break;
+	if (route->addr.src_addr.ss_family == AF_INET) {
+		struct sockaddr_in *src4, *dst4;
+
+		src4 = (struct sockaddr_in *) &route->addr.src_addr;
+		dst4 = (struct sockaddr_in *) &route->addr.dst_addr;
+
+		switch (ps) {
+		case RDMA_PS_SDP:
+			sdp_hdr = hdr;
+			if (sdp_get_majv(sdp_hdr->sdp_version) != SDP_MAJ_VERSION)
+				return -EINVAL;
+			sdp_set_ip_ver(sdp_hdr, 4);
+			sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr;
+			sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr;
+			sdp_hdr->port = src4->sin_port;
+			break;
+		default:
+			cma_hdr = hdr;
+			cma_hdr->cma_version = CMA_VERSION;
+			cma_set_ip_ver(cma_hdr, 4);
+			cma_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr;
+			cma_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr;
+			cma_hdr->port = src4->sin_port;
+			break;
+		}
+	} else {
+		struct sockaddr_in6 *src6, *dst6;
+
+		src6 = (struct sockaddr_in6 *) &route->addr.src_addr;
+		dst6 = (struct sockaddr_in6 *) &route->addr.dst_addr;
+
+		switch (ps) {
+		case RDMA_PS_SDP:
+			sdp_hdr = hdr;
+			if (sdp_get_majv(sdp_hdr->sdp_version) != SDP_MAJ_VERSION)
+				return -EINVAL;
+			sdp_set_ip_ver(sdp_hdr, 6);
+			sdp_hdr->src_addr.ip6 = src6->sin6_addr;
+			sdp_hdr->dst_addr.ip6 = dst6->sin6_addr;
+			sdp_hdr->port = src6->sin6_port;
+			break;
+		default:
+			cma_hdr = hdr;
+			cma_hdr->cma_version = CMA_VERSION;
+			cma_set_ip_ver(cma_hdr, 6);
+			cma_hdr->src_addr.ip6 = src6->sin6_addr;
+			cma_hdr->dst_addr.ip6 = dst6->sin6_addr;
+			cma_hdr->port = src6->sin6_port;
+			break;
+		}
 	}
+
 	return 0;
 }
 
-- 
1.5.6.dirty


From alekseys at voltaire.com  Wed Nov 26 09:56:09 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 26 Nov 2008 19:56:09 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 3/6] IPv6 support
	in cma_bind_any
In-Reply-To: <1227721899.3121.18.camel@alst60.voltaire.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>
Message-ID: <1227722169.3121.22.camel@alst60.voltaire.com>

>From 16579a6bd3da5d2f7fd46bc71261bf87f0baa6ae Mon Sep 17 00:00:00 2001
From: Aleksey Senin <alekseys at voltaire.com>
Date: Wed, 13 Aug 2008 10:03:16 +0300
Subject: [PATCH] IPv6 support in cma_bind_any

Using sockaddr_storage structure instead of sockaddr_in for
catching IPv6 protocol

Signed-off-by: Aleksey Senin <alekseys at voltaire.com>
---
 drivers/infiniband/core/cma.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 31f2aa2..df22c5c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1467,10 +1467,10 @@ static void cma_listen_on_all(struct rdma_id_private *id_priv)
 
 static int cma_bind_any(struct rdma_cm_id *id, sa_family_t af)
 {
-	struct sockaddr_in addr_in;
+	struct sockaddr_storage addr_in;
 
 	memset(&addr_in, 0, sizeof addr_in);
-	addr_in.sin_family = af;
+	addr_in.ss_family = af;
 	return rdma_bind_addr(id, (struct sockaddr *) &addr_in);
 }
 
-- 
1.5.6.dirty


From alekseys at voltaire.com  Wed Nov 26 09:56:39 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 26 Nov 2008 19:56:39 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 4/6]  IPv6 local
	address resolution
In-Reply-To: <1227721899.3121.18.camel@alst60.voltaire.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>
Message-ID: <1227722199.3121.25.camel@alst60.voltaire.com>

>From 8465a7d33a36cf8a9a92fbeea5d8f3b89f30e632 Mon Sep 17 00:00:00 2001
From: Aleksey Senin <alekseys at voltaire.com>
Date: Wed, 26 Nov 2008 16:16:09 +0200
Subject: [PATCH] IPv6 local address resolution

RDMA CM support on local machine

Signed-off-by: Aleksey Senin <alekseys at voltaire.com>
---
 drivers/infiniband/core/addr.c |   65 +++++++++++++++++++++++++++++-----------
 1 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index f95d21f..1d785d7 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -279,29 +279,58 @@ static int addr_resolve_local(struct sockaddr *src_in,
 			      struct rdma_dev_addr *addr)
 {
 	struct net_device *dev;
-	__be32 src_ip = ((struct sockaddr_in *)src_in)->sin_addr.s_addr;
-	__be32 dst_ip = ((struct sockaddr_in *)dst_in)->sin_addr.s_addr;
-	int ret;
+	int ret = -EADDRNOTAVAIL;
 
-	dev = ip_dev_find(&init_net, dst_ip);
-	if (!dev)
-		return -EADDRNOTAVAIL;
+	if (dst_in->sa_family == AF_INET) {
+		__be32 src_ip = ((struct sockaddr_in *)src_in)->sin_addr.s_addr;
+		__be32 dst_ip = ((struct sockaddr_in *)dst_in)->sin_addr.s_addr;
 
-	if (ipv4_is_zeronet(src_ip)) {
-		src_in->sa_family = dst_in->sa_family;
-		((struct sockaddr_in *)src_in)->sin_addr.s_addr = dst_ip;
-		ret = rdma_copy_addr(addr, dev, dev->dev_addr);
-	} else if (ipv4_is_loopback(src_ip)) {
-		ret = rdma_translate_ip(dst_in, addr);
-		if (!ret)
-			memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+		dev = ip_dev_find(&init_net, dst_ip);
+		if (!dev)
+			return -EADDRNOTAVAIL;
+
+		if (ipv4_is_zeronet(src_ip)) {
+			src_in->sa_family = dst_in->sa_family;
+			((struct sockaddr_in *)src_in)->sin_addr.s_addr = dst_ip;
+			ret = rdma_copy_addr(addr, dev, dev->dev_addr);
+		} else if (ipv4_is_loopback(src_ip)) {
+			ret = rdma_translate_ip(dst_in, addr);
+			if (!ret)
+				memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+		} else {
+			ret = rdma_translate_ip(src_in, addr);
+			if (!ret)
+				memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+		}
+		dev_put(dev);
 	} else {
-		ret = rdma_translate_ip(src_in, addr);
-		if (!ret)
-			memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+		struct in6_addr *a = &((struct sockaddr_in6 *)dst_in)->sin6_addr;
+
+		for_each_netdev(&init_net, dev)
+			if (ipv6_chk_addr(&init_net, &((struct sockaddr_in6 *) addr)->sin6_addr, dev, 1))
+				break;
+
+		if (!dev)
+			return -EADDRNOTAVAIL;
+
+		a = &((struct sockaddr_in6 *)src_in)->sin6_addr;
+
+		if (ipv6_addr_any(a)) {
+			src_in->sa_family = dst_in->sa_family;
+			((struct sockaddr_in6 *)src_in)->sin6_addr =
+				((struct sockaddr_in6 *)dst_in)->sin6_addr;
+			ret = rdma_copy_addr(addr, dev, dev->dev_addr);
+		} else if (ipv6_addr_loopback(a)) {
+			ret = rdma_translate_ip(dst_in, addr);
+			if (!ret)
+				memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+		} else  {
+			ret = rdma_translate_ip(src_in, addr);
+			if (!ret)
+				memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+		}
 	}
 
-	dev_put(dev);
 	return ret;
 }
 
-- 
1.5.6.dirty


From alekseys at voltaire.com  Wed Nov 26 09:58:57 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 26 Nov 2008 19:58:57 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 5/6] IPv6 support
	for network discovery
In-Reply-To: <1227721899.3121.18.camel@alst60.voltaire.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>
Message-ID: <1227722337.21512.0.camel@alst60.voltaire.com>

>From 14290555a2c58906214deb44423277fffd77fc4c Mon Sep 17 00:00:00 2001
From: Aleksey Senin <alekseys at voltaire.com>
Date: Wed, 13 Aug 2008 10:19:13 +0300
Subject: [PATCH] IPv6 support for network discovery

Added support for network discovery in addr_send_arp function

Signed-off-by: Aleksey Senin <alekseys at voltaire.com>
---
 drivers/infiniband/core/addr.c |   32 ++++++++++++++++++++++++--------
 1 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 1d785d7..460dcc2 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -43,6 +43,7 @@
 #include <net/netevent.h>
 #include <net/addrconf.h>
 #include <rdma/ib_addr.h>
+#include <net/ip6_route.h>
 
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("IB Address Translation");
@@ -172,19 +173,34 @@ static void queue_req(struct addr_req *req)
 	mutex_unlock(&lock);
 }
 
-static void addr_send_arp(struct sockaddr_in *dst_in)
+static void addr_send_arp(struct sockaddr *dst_in)
 {
 	struct rtable *rt;
 	struct flowi fl;
-	__be32 dst_ip = dst_in->sin_addr.s_addr;
+	struct dst_entry *dst;
 
 	memset(&fl, 0, sizeof fl);
-	fl.nl_u.ip4_u.daddr = dst_ip;
-	if (ip_route_output_key(&init_net, &rt, &fl))
-		return;
+	if (dst_in->sa_family == AF_INET)  {
+		fl.nl_u.ip4_u.daddr =
+			((struct sockaddr_in *)dst_in)->sin_addr.s_addr;
 
-	neigh_event_send(rt->u.dst.neighbour, NULL);
-	ip_rt_put(rt);
+		if (ip_route_output_key(&init_net, &rt, &fl))
+			return;
+
+		neigh_event_send(rt->u.dst.neighbour, NULL);
+		ip_rt_put(rt);
+
+	} else {
+		fl.nl_u.ip6_u.daddr =
+			((struct sockaddr_in6 *)dst_in)->sin6_addr;
+
+		dst = ip6_route_output(&init_net, NULL, &fl);
+		if (!dst)
+			return;
+
+		neigh_event_send(dst->neighbour, NULL);
+		dst_release(dst);
+	}
 }
 
 static int addr_resolve_remote(struct sockaddr *src_in,
@@ -373,7 +389,7 @@ int rdma_resolve_ip(struct rdma_addr_client *client,
 	case -ENODATA:
 		req->timeout = msecs_to_jiffies(timeout_ms) + jiffies;
 		queue_req(req);
-		addr_send_arp((struct sockaddr_in *)dst_in);
+		addr_send_arp(dst_in);
 		break;
 	default:
 		ret = req->status;
-- 
1.5.6.dirty


From alekseys at voltaire.com  Wed Nov 26 09:59:31 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Wed, 26 Nov 2008 19:59:31 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 6/6] Remote IPv6
	resolution
In-Reply-To: <1227721899.3121.18.camel@alst60.voltaire.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>
Message-ID: <1227722371.21512.2.camel@alst60.voltaire.com>

>From 34464092a263d339c919432b1e4495dce36ee568 Mon Sep 17 00:00:00 2001
From: Aleksey Senin <alekseys at voltaire.com>
Date: Wed, 26 Nov 2008 18:24:35 +0200
Subject: [PATCH] Remote IPv6 resolution

Added remote address resolusion for RDMA CM
Function addr_resolve_remote used as wrapper for two other functions:
addr4_resolve_remote ( original addr_resolve_remote )
addr6_resolve_remote ( new function )

Signed-off-by: Aleksey Senin <alekseys at voltaire.com>
---
 drivers/infiniband/core/addr.c |   53 +++++++++++++++++++++++++++++++++++----
 1 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 460dcc2..16ffd49 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -203,12 +203,12 @@ static void addr_send_arp(struct sockaddr *dst_in)
 	}
 }
 
-static int addr_resolve_remote(struct sockaddr *src_in,
-			       struct sockaddr *dst_in,
+static int addr4_resolve_remote(struct sockaddr_in *src_in,
+			       struct sockaddr_in *dst_in,
 			       struct rdma_dev_addr *addr)
 {
-	__be32 src_ip = ((struct sockaddr_in *)src_in)->sin_addr.s_addr;
-	__be32 dst_ip = ((struct sockaddr_in *)dst_in)->sin_addr.s_addr;
+	__be32 src_ip = src_in->sin_addr.s_addr;
+	__be32 dst_ip = dst_in->sin_addr.s_addr;
 	struct flowi fl;
 	struct rtable *rt;
 	struct neighbour *neigh;
@@ -239,8 +239,8 @@ static int addr_resolve_remote(struct sockaddr *src_in,
 	}
 
 	if (!src_ip) {
-		src_in->sa_family = dst_in->sa_family;
-		((struct sockaddr_in *)src_in)->sin_addr.s_addr = rt->rt_src;
+		src_in->sin_family = dst_in->sin_family;
+		src_in->sin_addr.s_addr = rt->rt_src;
 	}
 
 	ret = rdma_copy_addr(addr, neigh->dev, neigh->ha);
@@ -252,6 +252,47 @@ out:
 	return ret;
 }
 
+static int addr6_resolve_remote(struct sockaddr_in6 *src_in,
+			       struct sockaddr_in6 *dst_in,
+			       struct rdma_dev_addr *addr)
+{
+	struct flowi fl;
+	struct neighbour *neigh;
+	struct dst_entry *dst;
+	int ret = -ENODATA;
+
+	memset(&fl, 0, sizeof fl);
+	fl.nl_u.ip6_u.daddr = dst_in->sin6_addr;
+	fl.nl_u.ip6_u.saddr = src_in->sin6_addr;
+
+	dst = ip6_route_output(&init_net, NULL, &fl);
+	if (!dst)
+		return ret;
+
+	if (dst->dev->flags & IFF_NOARP) {
+		ret = rdma_copy_addr(addr, dst->dev, NULL);
+	} else {
+		neigh = dst->neighbour;
+		if (neigh && (neigh->nud_state & NUD_VALID))
+			ret = rdma_copy_addr(addr, neigh->dev, neigh->ha);
+	}
+
+	dst_release(dst);
+	return ret;
+}
+
+static int addr_resolve_remote(struct sockaddr *src_in,
+				struct sockaddr *dst_in,
+				struct rdma_dev_addr *addr)
+{
+	if (src_in->sa_family == AF_INET) {
+		return addr4_resolve_remote((struct sockaddr_in *)src_in,
+			(struct sockaddr_in *)dst_in, addr);
+	} else
+		return addr6_resolve_remote((struct sockaddr_in6 *)src_in,
+			(struct sockaddr_in6 *)dst_in, addr);
+}
+
 static void process_req(struct work_struct *work)
 {
 	struct addr_req *req, *temp_req;
-- 
1.5.6.dirty


From YJia at tmriusa.com  Wed Nov 26 14:18:48 2008
From: YJia at tmriusa.com (Yicheng Jia)
Date: Wed, 26 Nov 2008 16:18:48 -0600
Subject: [ofa-general] set up QPs with different transfer rate
Message-ID: <OFC298460B.3F451131-ON8625750D.00792EE6-8625750D.007A9277@TMRIUSA.COM>

Hi Folks,

I have two applications which require different IB transfer rates. I am 
using Mellanox 25204 HCA. Can I achieve it by setting up two QPs with 
different service levels? Can I set "SL" field in QP context, or it is 
controlled by SM? Thanks!

Best,

Yicheng

Software Engineer
Toshiba Medical Research Institute USA, Inc.


_____________________________________________________________________________
Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com
_____________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081126/39896150/attachment.html>

From gmpc at sanger.ac.uk  Thu Nov 27 02:36:15 2008
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Thu, 27 Nov 2008 10:36:15 +0000
Subject: [ofa-general] Anyone working on debian  packages?
Message-ID: <492E781F.40709@sanger.ac.uk>

Hi all,

We have recently been experimenting with IB on debian, and in the course of this
work I have built a basic set of OFED-1.3.1 debian packages for our internal use.

With a bit more effort the packages could be worked into a state suitable for
inclusion into the main Debian archives. Before I start this I would like ensure
that I am not re-inventing the wheel;  is anyone else working on this?

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From vlad at lists.openfabrics.org  Thu Nov 27 03:24:29 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Thu, 27 Nov 2008 03:24:29 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081127-0200 daily build status
Message-ID: <20081127112430.1844AE60D44@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From olga.shern at gmail.com  Thu Nov 27 04:52:44 2008
From: olga.shern at gmail.com (Olga Shern (Voltaire))
Date: Thu, 27 Nov 2008 14:52:44 +0200
Subject: [ofa-general] ***SPAM*** Re: [ewg] OFED Nov 24, 2008 meeting minutes
In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0FE7A8@mtlexch01.mtl.com>
References: <458BC6B0F287034F92FE78908BD01CE84EF35EF0@mtlexch01.mtl.com>
	<5D49E7A8952DC44FB38C38FA0D758EAD0FE7A8@mtlexch01.mtl.com>
Message-ID: <bc457d660811270452u46ea0231xdac09f829a5348c2@mail.gmail.com>

>
> OFED 1.4 release: RC6 on Nov 28, GA on Dec 8

Hi,

Are you going to build RC6 today/tomorrow?
I see that there are still a lot of major bugs. Maybe we should wait?

Olga


From jackm at dev.mellanox.co.il  Thu Nov 27 04:57:32 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 27 Nov 2008 14:57:32 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch of
	IB_EVENT_LID_CHANGE
In-Reply-To: <492C226D.7040009@Voltaire.COM>
References: <492C226D.7040009@Voltaire.COM>
Message-ID: <200811271457.32510.jackm@dev.mellanox.co.il>

On Tuesday 25 November 2008 18:06, Moni Shoua wrote:
> @@ -263,6 +269,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,     u8 port_num,
>         } else
>                 return IB_MAD_RESULT_SUCCESS;
>  
> +       if (!ib_query_port(ibdev, port_num, &pattr))
> +               prev_lid = pattr.lid;
> +
> 

Why do ib_query_port for each MAD that is handled?  query_port involves a firmware access.
Events are generated only for SMP SET packets.

I think the condition should read:

	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
	     in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
	    in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
	    !ib_query_port(ibdev, port_num, &pattr))
		prev_lid = pattr.lid;

so that the query_port will be performed only for appropriate packets.

- Jack


From monis at Voltaire.COM  Thu Nov 27 05:26:23 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 27 Nov 2008 15:26:23 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch
	of	IB_EVENT_LID_CHANGE
In-Reply-To: <200811271457.32510.jackm@dev.mellanox.co.il>
References: <492C226D.7040009@Voltaire.COM>
	<200811271457.32510.jackm@dev.mellanox.co.il>
Message-ID: <492E9FFF.9070107@Voltaire.COM>

Jack Morgenstein wrote:
> On Tuesday 25 November 2008 18:06, Moni Shoua wrote:
>> @@ -263,6 +269,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,     u8 port_num,
>>         } else
>>                 return IB_MAD_RESULT_SUCCESS;
>>  
>> +       if (!ib_query_port(ibdev, port_num, &pattr))
>> +               prev_lid = pattr.lid;
>> +
>>
> 
> Why do ib_query_port for each MAD that is handled?  query_port involves a firmware access.
> Events are generated only for SMP SET packets.
I agreee. Thanks.

I'm also changing the action in case ib_query_port() fails. 
Instead of ignoring the failure I now assume the worst (i.e: LID_CHANGE)


From monis at Voltaire.COM  Thu Nov 27 05:31:18 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 27 Nov 2008 15:31:18 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch of
	IB_EVENT_LID_CHANGE
In-Reply-To: <200811271457.32510.jackm@dev.mellanox.co.il>
References: <492C226D.7040009@Voltaire.COM>
	<200811271457.32510.jackm@dev.mellanox.co.il>
Message-ID: <492EA126.1060104@Voltaire.COM>

New patch according to Jack's comment and the other change (credit to Yossi E.)
Same change log applies here.

--
 mlx4/mad.c        |   24 ++++++++++++++++++------
 mthca/mthca_mad.c |   22 +++++++++++++++++-----
 2 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index 606f1e2..9528459 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -147,7 +147,8 @@ static void update_sm_ah(struct mlx4_ib_dev *dev, u8 port_num, u16 lid, u8 sl)
  * Snoop SM MADs for port info and P_Key table sets, so we can
  * synthesize LID change and P_Key change events.
  */
-static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
+static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad,
+				u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -157,6 +158,7 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			update_sm_ah(to_mdev(ibdev), port_num,
 				     be16_to_cpu(pinfo->sm_lid),
@@ -165,12 +167,15 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
 			event.device	       = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if (prev_lid != lid) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -228,8 +233,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 			struct ib_wc *in_wc, struct ib_grh *in_grh,
 			struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
-	u16 slid;
+	u16 slid, prev_lid = 0;
 	int err;
+	struct ib_port_attr pattr;
 
 	slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
 
@@ -263,6 +269,12 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 	} else
 		return IB_MAD_RESULT_SUCCESS;
 
+	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+		in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+		in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+		!ib_query_port(ibdev, port_num, &pattr))
+			prev_lid = pattr.lid;
+
 	err = mlx4_MAD_IFC(to_mdev(ibdev),
 			   mad_flags & IB_MAD_IGNORE_MKEY,
 			   mad_flags & IB_MAD_IGNORE_BKEY,
@@ -271,7 +283,7 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 		return IB_MAD_RESULT_FAILURE;
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 
diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c
index 6404495..d872aeb 100644
--- a/drivers/infiniband/hw/mthca/mthca_mad.c
+++ b/drivers/infiniband/hw/mthca/mthca_mad.c
@@ -104,7 +104,8 @@ static void update_sm_ah(struct mthca_dev *dev,
  */
 static void smp_snoop(struct ib_device *ibdev,
 		      u8 port_num,
-		      struct ib_mad *mad)
+		      struct ib_mad *mad,
+		      u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -114,6 +115,7 @@ static void smp_snoop(struct ib_device *ibdev,
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			mthca_update_rate(to_mdev(ibdev), port_num);
 			update_sm_ah(to_mdev(ibdev), port_num,
@@ -123,12 +125,15 @@ static void smp_snoop(struct ib_device *ibdev,
 			event.device           = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if (prev_lid != lid) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -196,6 +201,8 @@ int mthca_process_mad(struct ib_device *ibdev,
 	int err;
 	u8 status;
 	u16 slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
+	u16 prev_lid = 0;
+	struct ib_port_attr pattr;
 
 	/* Forward locally generated traps to the SM */
 	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
@@ -233,6 +240,11 @@ int mthca_process_mad(struct ib_device *ibdev,
 			return IB_MAD_RESULT_SUCCESS;
 	} else
 		return IB_MAD_RESULT_SUCCESS;
+	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+		in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+		in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+		!ib_query_port(ibdev, port_num, &pattr))
+			prev_lid = pattr.lid;
 
 	err = mthca_MAD_IFC(to_mdev(ibdev),
 			    mad_flags & IB_MAD_IGNORE_MKEY,
@@ -252,7 +264,7 @@ int mthca_process_mad(struct ib_device *ibdev,
 	}
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 

From jackm at dev.mellanox.co.il  Thu Nov 27 05:43:10 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 27 Nov 2008 15:43:10 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch
	=?iso-8859-1?q?of=09IB=5FEVENT=5FLID=5FCHANGE?=
In-Reply-To: <492E9FFF.9070107@Voltaire.COM>
References: <492C226D.7040009@Voltaire.COM>
	<200811271457.32510.jackm@dev.mellanox.co.il>
	<492E9FFF.9070107@Voltaire.COM>
Message-ID: <200811271543.10613.jackm@dev.mellanox.co.il>

On Thursday 27 November 2008 15:26, Moni Shoua wrote:
> I'm also changing the action in case ib_query_port() fails. 
> Instead of ignoring the failure I now assume the worst (i.e: LID_CHANGE)
> 
OK.  This will not be worse than the current situation.
(actually, it may in certain cases, because you can generate the both LID_CHANGE event and
 a CLIENT_REREGISTER event, where before only one was generated.  However,
 if query_port fails, we will probably see other failures as well, so what
 the heck).

BTW,
The condition I sent in my last post is not enough.
It should be:
        if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
             in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
            in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
	    in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
            !ib_query_port(ibdev, port_num, &pattr))
                prev_lid = pattr.lid;

since the query_port response data is only relevant for the IB_SMP_ATTR_PORT_INFO
path in smp_snoop.

- Jack


From jackm at dev.mellanox.co.il  Thu Nov 27 05:51:46 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 27 Nov 2008 15:51:46 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch of
	IB_EVENT_LID_CHANGE
In-Reply-To: <492EA126.1060104@Voltaire.COM>
References: <492C226D.7040009@Voltaire.COM>
	<200811271457.32510.jackm@dev.mellanox.co.il>
	<492EA126.1060104@Voltaire.COM>
Message-ID: <200811271551.47076.jackm@dev.mellanox.co.il>

On Thursday 27 November 2008 15:31, Moni Shoua wrote:
> @@ -263,6 +269,12 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,    u8 port_num,
>         } else
>                 return IB_MAD_RESULT_SUCCESS;
>  
> +       if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> +               in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
> +               in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
> +               !ib_query_port(ibdev, port_num, &pattr))
> +                       prev_lid = pattr.lid;
> +
>         err = mlx4_MAD_IFC(to_mdev(ibdev),
>                            mad_flags & IB_MAD_IGNORE_MKEY,
>                            mad_flags & IB_MAD_IGNORE_BKEY,
> 

Per my last post, this should be:

@@ -263,6 +269,13 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,    u8 port_num,
        } else
                return IB_MAD_RESULT_SUCCESS;
 
+       if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+               in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+               in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+               in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
+               !ib_query_port(ibdev, port_num, &pattr))
+                       prev_lid = pattr.lid;
+
        err = mlx4_MAD_IFC(to_mdev(ibdev),
                           mad_flags & IB_MAD_IGNORE_MKEY,
                           mad_flags & IB_MAD_IGNORE_BKEY,


From monis at Voltaire.COM  Thu Nov 27 06:14:45 2008
From: monis at Voltaire.COM (Moni Shoua)
Date: Thu, 27 Nov 2008 16:14:45 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch
	of	IB_EVENT_LID_CHANGE
In-Reply-To: <200811271551.47076.jackm@dev.mellanox.co.il>
References: <492C226D.7040009@Voltaire.COM>	<200811271457.32510.jackm@dev.mellanox.co.il>	<492EA126.1060104@Voltaire.COM>
	<200811271551.47076.jackm@dev.mellanox.co.il>
Message-ID: <492EAB55.7020003@Voltaire.COM>

Thanks again.
Now with your other fix

--
 drivers/infiniband/hw/mlx4/mad.c        |   25 +++++++++++++++++++------
 drivers/infiniband/hw/mthca/mthca_mad.c |   23 ++++++++++++++++++-----
 2 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index 606f1e2..d5971a1 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -147,7 +147,8 @@ static void update_sm_ah(struct mlx4_ib_dev *dev, u8 port_num, u16 lid, u8 sl)
  * Snoop SM MADs for port info and P_Key table sets, so we can
  * synthesize LID change and P_Key change events.
  */
-static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
+static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad,
+				u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -157,6 +158,7 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			update_sm_ah(to_mdev(ibdev), port_num,
 				     be16_to_cpu(pinfo->sm_lid),
@@ -165,12 +167,15 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
 			event.device	       = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if (prev_lid != lid) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -228,8 +233,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 			struct ib_wc *in_wc, struct ib_grh *in_grh,
 			struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
-	u16 slid;
+	u16 slid, prev_lid = 0;
 	int err;
+	struct ib_port_attr pattr;
 
 	slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
 
@@ -263,6 +269,13 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 	} else
 		return IB_MAD_RESULT_SUCCESS;
 
+	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+		in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+		in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+		in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
+		!ib_query_port(ibdev, port_num, &pattr))
+			prev_lid = pattr.lid;
+
 	err = mlx4_MAD_IFC(to_mdev(ibdev),
 			   mad_flags & IB_MAD_IGNORE_MKEY,
 			   mad_flags & IB_MAD_IGNORE_BKEY,
@@ -271,7 +284,7 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
 		return IB_MAD_RESULT_FAILURE;
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 
diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c
index 6404495..45ac68e 100644
--- a/drivers/infiniband/hw/mthca/mthca_mad.c
+++ b/drivers/infiniband/hw/mthca/mthca_mad.c
@@ -104,7 +104,8 @@ static void update_sm_ah(struct mthca_dev *dev,
  */
 static void smp_snoop(struct ib_device *ibdev,
 		      u8 port_num,
-		      struct ib_mad *mad)
+		      struct ib_mad *mad,
+		      u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -114,6 +115,7 @@ static void smp_snoop(struct ib_device *ibdev,
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			mthca_update_rate(to_mdev(ibdev), port_num);
 			update_sm_ah(to_mdev(ibdev), port_num,
@@ -123,12 +125,15 @@ static void smp_snoop(struct ib_device *ibdev,
 			event.device           = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if (prev_lid != lid) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -196,6 +201,8 @@ int mthca_process_mad(struct ib_device *ibdev,
 	int err;
 	u8 status;
 	u16 slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
+	u16 prev_lid = 0;
+	struct ib_port_attr pattr;
 
 	/* Forward locally generated traps to the SM */
 	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
@@ -233,6 +240,12 @@ int mthca_process_mad(struct ib_device *ibdev,
 			return IB_MAD_RESULT_SUCCESS;
 	} else
 		return IB_MAD_RESULT_SUCCESS;
+	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+		in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+		in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+		in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
+		!ib_query_port(ibdev, port_num, &pattr))
+			prev_lid = pattr.lid;
 
 	err = mthca_MAD_IFC(to_mdev(ibdev),
 			    mad_flags & IB_MAD_IGNORE_MKEY,
@@ -252,7 +265,7 @@ int mthca_process_mad(struct ib_device *ibdev,
 	}
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 

From tziporet at dev.mellanox.co.il  Thu Nov 27 06:29:08 2008
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 27 Nov 2008 16:29:08 +0200
Subject: [ofa-general] Re: ***SPAM*** Re: [ewg] OFED Nov 24,
	2008 meeting minutes
In-Reply-To: <bc457d660811270452u46ea0231xdac09f829a5348c2@mail.gmail.com>
References: <458BC6B0F287034F92FE78908BD01CE84EF35EF0@mtlexch01.mtl.com>	<5D49E7A8952DC44FB38C38FA0D758EAD0FE7A8@mtlexch01.mtl.com>
	<bc457d660811270452u46ea0231xdac09f829a5348c2@mail.gmail.com>
Message-ID: <492EAEB4.4000301@mellanox.co.il>

Olga Shern (Voltaire) wrote:
>> OFED 1.4 release: RC6 on Nov 28, GA on Dec 8
>>     
>
> Hi,
>
> Are you going to build RC6 today/tomorrow?
> I see that there are still a lot of major bugs. Maybe we should wait?
>
>   

We already build it and I will publish it later today after some sanity 
checks we run here.
We should not wait since UNH must run their Logo program tests.

We can fix few more critical bugs next week too.

Tziporet


From tziporet at mellanox.co.il  Thu Nov 27 08:38:55 2008
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 27 Nov 2008 18:38:55 +0200
Subject: [ofa-general] OFED-1.4-rc6 release is available
Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD010F752D@mtlexch01.mtl.com>


Hi, 
OFED-1.4-rc6 release is available on 
http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc6.tgz 


To get BUILD_ID run ofed_info 

Please report any issues in bugzilla https://bugs.openfabrics.org/ for
OFED 1.4 

Vladimir & Tziporet

========================================================================


Release information: 
------------------------------ 
Linux Operating Systems: 
       - RedHat EL4 up4:  2.6.9-42.ELsmp      * 
       - RedHat EL4 up5:  2.6.9-55.ELsmp 
       - RedHat EL4 up6:  2.6.9-67.ELsmp 
       - RedHat EL4 up7:	2.6.9-78.ELsmp 
       - RedHat EL5:        2.6.18-8.el5 
       - RedHat EL5 up1:  2.6.18-53.el5 
       - RedHat EL5 up2:  2.6.18-92.el5 
       - OEL 4.5:		2.6.9-55.ELsmp 
       - OEL 5.2:		2.6.18-92.el5
       - CentOS 5.2:         2.6.18-92.el5 
       - Fedora C9:           2.6.25-14.fc9          * 
       - SLES10:              2.6.16.21-0.8-smp 
       - SLES10 SP1:       2.6.16.46-0.12-smp 
       - SLES10 SP1 up1: 2.6.16.53-0.16-smp 
       - SLES10 SP2:       2.6.16.60-0.21-smp 
       - OpenSuSE 10.3:   2.6.22.5-31             * 
       - kernel.org:            2.6.26 and 2.6.27 

     * Minimal QA for these versions 

Systems: 
       * x86_64 
       * x86 
       * ia64 
       * ppc64 


Main Changes from OFED-1.4-rc4
==============================
- Updated MPI packages: mvapich-1.1.0-3143
- Updated bonding package: ib-bonding-0.9.0-36
- Updated opensm version to opensm-3.2.4
- updated diags package version to infiniband-diags-1.4.3
- 19 bugs fixed (see attached for details) 

- Attached kernel git tree changes

Tasks that should be completed for the release: 
===================================
1. High priority bug fixes
2. UNH Logo program testing
3. Documentation update

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ofed-1.4-rc6-fixed-bugs.csv
Type: application/octet-stream
Size: 2083 bytes
Desc: ofed-1.4-rc6-fixed-bugs.csv
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081127/f67d94bf/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rc5_rc6_commits
Type: application/octet-stream
Size: 10513 bytes
Desc: rc5_rc6_commits
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081127/f67d94bf/attachment-0001.obj>

From vlad at lists.openfabrics.org  Fri Nov 28 03:28:01 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Fri, 28 Nov 2008 03:28:01 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081128-0200 daily build status
Message-ID: <20081128112802.14E7CE60CB2@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From dotanba at gmail.com  Fri Nov 28 08:01:02 2008
From: dotanba at gmail.com (Dotan Barak)
Date: Fri, 28 Nov 2008 18:01:02 +0200
Subject: ***SPAM*** Re: [ofa-general] set up QPs with different transfer rate
In-Reply-To: <OFC298460B.3F451131-ON8625750D.00792EE6-8625750D.007A9277@TMRIUSA.COM>
References: <OFC298460B.3F451131-ON8625750D.00792EE6-8625750D.007A9277@TMRIUSA.COM>
Message-ID: <493015BE.8050702@gmail.com>

Yicheng Jia wrote:
>
> Hi Folks,
>
> I have two applications which require different IB transfer rates. I 
> am using Mellanox 25204 HCA. Can I achieve it by setting up two QPs 
> with different service levels? Can I set "SL" field in QP context, or 
> it is controlled by SM? Thanks!
You can set the SL value in the QP, but the SM controls the SL2VL 
mapping + VL_arbitration table.

Dotan
>
> Best,
>
> Yicheng
>
> Software Engineer
> Toshiba Medical Research Institute USA, Inc.
>
> _____________________________________________________________________________
> Scanned by IBM Email Security Management Services powered by 
> MessageLabs. For more information please visit http://www.ers.ibm.com
> _____________________________________________________________________________
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From rdreier at cisco.com  Fri Nov 28 21:14:09 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Nov 2008 21:14:09 -0800
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6
	support for rdma_bind_addr
In-Reply-To: <1227721899.3121.18.camel@alst60.voltaire.com> (Aleksey Senin's
	message of "Wed, 26 Nov 2008 19:51:39 +0200")
References: <1227721899.3121.18.camel@alst60.voltaire.com>
Message-ID: <ada8wr3cdwe.fsf@cisco.com>

I would like to get some input from Sean before proceeding on this, but
one thing does jump out at me: the order of the patches seems strange to
me (or maybe it's the way the patches are split).  Starting with this
change only:

 > @@ -2073,7 +2073,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr)
 > -	if (addr->sa_family != AF_INET)
 > +	if (addr->sa_family != AF_INET && addr->sa_family != AF_INET6)
 >  		return -EAFNOSUPPORT;

seems wrong to me.  If I just have this patch applied (eg if I'm doing a
bisection to track down a bug) then it seems I'll get some very strange
results if I try to bind an IPv6 address.

It seems to me we would want all the prep work like using
sockaddr_storage where needed, etc. before we actually enable IPv6 in
the API.

- R.


From rdreier at cisco.com  Fri Nov 28 21:48:29 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 28 Nov 2008 21:48:29 -0800
Subject: [ofa-general]
	Re: [PATCH V2] mlx4: save default port ib capabilities,
	and use when setting port type to IB.
In-Reply-To: <200811051444.02306.jackm@dev.mellanox.co.il> (Jack Morgenstein's
	message of "Wed, 5 Nov 2008 14:44:01 +0200")
References: <200811041214.39085.jackm@dev.mellanox.co.il>
	<adaprlbtjaz.fsf@cisco.com>
	<200811051444.02306.jackm@dev.mellanox.co.il>
Message-ID: <adavdu7axqq.fsf@cisco.com>

thanks, applied


From aostvold at platform.com  Sat Nov 29 01:00:42 2008
From: aostvold at platform.com (Asmund Ostvold)
Date: Sat, 29 Nov 2008 10:00:42 +0100
Subject: [ofa-general] reviving wrong data after trying to allocation a too
 large memory chunck
Message-ID: <493104BA.9090607@platform.com>


We discovered a strange problem running OFED; We're not sure if it is a
OFED problem but we post it here anyway.


Short description:
We have a program that allocates a set of buffers with valloc, sends
them with ibv_post_send and free them.
This is run in loop;
We have a "caching"-algorithm so that we register memory only the first
time we come across a buffer address.
We starts getting wrong data for parts of sends after a couple of
iterations

There are a few things worth mentioning:
- We must use valloc; the test works with malloc
- We must have a malloc allocating a too large chunk before starting the
loop (the malloc fails)

We have modified the "rdma_lat.c" program to show the error (attached)

Regards
Asmund


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bug.c
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081129/07c42ddb/attachment.c>

From vlad at lists.openfabrics.org  Sat Nov 29 03:20:09 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sat, 29 Nov 2008 03:20:09 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081129-0200 daily build status
Message-ID: <20081129112009.CEFB8E60324@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From alekseys at voltaire.com  Sun Nov 30 00:24:40 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Sun, 30 Nov 2008 10:24:40 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6
	support for rdma_bind_addr
In-Reply-To: <ada8wr3cdwe.fsf@cisco.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>
	<ada8wr3cdwe.fsf@cisco.com>
Message-ID: <1228033480.3621.5.camel@alst60.voltaire.com>

You are right, this one should be, probably, applied as last in the
series. And the first should be this one:

> static int cma_bind_any(struct rdma_cm_id *id, sa_family_t af)
>  {
> -       struct sockaddr_in addr_in;
> +       struct sockaddr_storage addr_in;
> 
>         memset(&addr_in, 0, sizeof addr_in);
> -       addr_in.sin_family = af;
> +       addr_in.ss_family = af;
>         return rdma_bind_addr(id, (struct sockaddr *) &addr_in);
>  }
> 

But.. All other patches depends one on another, and in my opinion better
to apply it all together, otherwise, when separated, all those 'if'
statements have no sense.
So, I'll be waiting for Sean input too.


From jackm at dev.mellanox.co.il  Sun Nov 30 00:28:18 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Nov 2008 10:28:18 +0200
Subject: [ofa-general] [PATCH] mlx4_ib/mthca: Fix dispatch
	=?iso-8859-1?q?of=09IB=5FEVENT=5FLID=5FCHANGE?=
In-Reply-To: <492EAB55.7020003@Voltaire.COM>
References: <492C226D.7040009@Voltaire.COM>
	<200811271551.47076.jackm@dev.mellanox.co.il>
	<492EAB55.7020003@Voltaire.COM>
Message-ID: <200811301028.19171.jackm@dev.mellanox.co.il>

I've split this patch into two separate patches,
one for ib_mthca, and another for mlx4_ib (for better trackability).
I've committed both of them to OFED 1.4 (so that they will be in
tomorrow's daily)

I'll post them shortly to the list as a 2-patch sequence.

- Jack

On Thursday 27 November 2008 16:14, Moni Shoua wrote:
> Thanks again.
> Now with your other fix
> 
> --
>  drivers/infiniband/hw/mlx4/mad.c        |   25 +++++++++++++++++++------
>  drivers/infiniband/hw/mthca/mthca_mad.c |   23 ++++++++++++++++++-----
>  2 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
> index 606f1e2..d5971a1 100644
> --- a/drivers/infiniband/hw/mlx4/mad.c
> +++ b/drivers/infiniband/hw/mlx4/mad.c
> @@ -147,7 +147,8 @@ static void update_sm_ah(struct mlx4_ib_dev *dev, u8 port_num, u16 lid, u8 sl)
>   * Snoop SM MADs for port info and P_Key table sets, so we can
>   * synthesize LID change and P_Key change events.
>   */
> -static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
> +static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad,
> +				u16 prev_lid)
>  {
>  	struct ib_event event;
>  
> @@ -157,6 +158,7 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
>  		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
>  			struct ib_port_info *pinfo =
>  				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
> +			u16 lid = be16_to_cpu(pinfo->lid);
>  
>  			update_sm_ah(to_mdev(ibdev), port_num,
>  				     be16_to_cpu(pinfo->sm_lid),
> @@ -165,12 +167,15 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
>  			event.device	       = ibdev;
>  			event.element.port_num = port_num;
>  
> -			if (pinfo->clientrereg_resv_subnetto & 0x80)
> +			if (pinfo->clientrereg_resv_subnetto & 0x80) {
>  				event.event    = IB_EVENT_CLIENT_REREGISTER;
> -			else
> +				ib_dispatch_event(&event);
> +			}
> +			if (prev_lid != lid) {
>  				event.event    = IB_EVENT_LID_CHANGE;
> +				ib_dispatch_event(&event);
> +			}
>  
> -			ib_dispatch_event(&event);
>  		}
>  
>  		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
> @@ -228,8 +233,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
>  			struct ib_wc *in_wc, struct ib_grh *in_grh,
>  			struct ib_mad *in_mad, struct ib_mad *out_mad)
>  {
> -	u16 slid;
> +	u16 slid, prev_lid = 0;
>  	int err;
> +	struct ib_port_attr pattr;
>  
>  	slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
>  
> @@ -263,6 +269,13 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
>  	} else
>  		return IB_MAD_RESULT_SUCCESS;
>  
> +	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> +		in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
> +		in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
> +		in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
> +		!ib_query_port(ibdev, port_num, &pattr))
> +			prev_lid = pattr.lid;
> +
>  	err = mlx4_MAD_IFC(to_mdev(ibdev),
>  			   mad_flags & IB_MAD_IGNORE_MKEY,
>  			   mad_flags & IB_MAD_IGNORE_BKEY,
> @@ -271,7 +284,7 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,	u8 port_num,
>  		return IB_MAD_RESULT_FAILURE;
>  
>  	if (!out_mad->mad_hdr.status) {
> -		smp_snoop(ibdev, port_num, in_mad);
> +		smp_snoop(ibdev, port_num, in_mad, prev_lid);
>  		node_desc_override(ibdev, out_mad);
>  	}
>  
> diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c
> index 6404495..45ac68e 100644
> --- a/drivers/infiniband/hw/mthca/mthca_mad.c
> +++ b/drivers/infiniband/hw/mthca/mthca_mad.c
> @@ -104,7 +104,8 @@ static void update_sm_ah(struct mthca_dev *dev,
>   */
>  static void smp_snoop(struct ib_device *ibdev,
>  		      u8 port_num,
> -		      struct ib_mad *mad)
> +		      struct ib_mad *mad,
> +		      u16 prev_lid)
>  {
>  	struct ib_event event;
>  
> @@ -114,6 +115,7 @@ static void smp_snoop(struct ib_device *ibdev,
>  		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
>  			struct ib_port_info *pinfo =
>  				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
> +			u16 lid = be16_to_cpu(pinfo->lid);
>  
>  			mthca_update_rate(to_mdev(ibdev), port_num);
>  			update_sm_ah(to_mdev(ibdev), port_num,
> @@ -123,12 +125,15 @@ static void smp_snoop(struct ib_device *ibdev,
>  			event.device           = ibdev;
>  			event.element.port_num = port_num;
>  
> -			if (pinfo->clientrereg_resv_subnetto & 0x80)
> +			if (pinfo->clientrereg_resv_subnetto & 0x80) {
>  				event.event    = IB_EVENT_CLIENT_REREGISTER;
> -			else
> +				ib_dispatch_event(&event);
> +			}
> +			if (prev_lid != lid) {
>  				event.event    = IB_EVENT_LID_CHANGE;
> +				ib_dispatch_event(&event);
> +			}
>  
> -			ib_dispatch_event(&event);
>  		}
>  
>  		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
> @@ -196,6 +201,8 @@ int mthca_process_mad(struct ib_device *ibdev,
>  	int err;
>  	u8 status;
>  	u16 slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
> +	u16 prev_lid = 0;
> +	struct ib_port_attr pattr;
>  
>  	/* Forward locally generated traps to the SM */
>  	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
> @@ -233,6 +240,12 @@ int mthca_process_mad(struct ib_device *ibdev,
>  			return IB_MAD_RESULT_SUCCESS;
>  	} else
>  		return IB_MAD_RESULT_SUCCESS;
> +	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
> +		in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
> +		in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
> +		in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
> +		!ib_query_port(ibdev, port_num, &pattr))
> +			prev_lid = pattr.lid;
>  
>  	err = mthca_MAD_IFC(to_mdev(ibdev),
>  			    mad_flags & IB_MAD_IGNORE_MKEY,
> @@ -252,7 +265,7 @@ int mthca_process_mad(struct ib_device *ibdev,
>  	}
>  
>  	if (!out_mad->mad_hdr.status) {
> -		smp_snoop(ibdev, port_num, in_mad);
> +		smp_snoop(ibdev, port_num, in_mad, prev_lid);
>  		node_desc_override(ibdev, out_mad);
>  	}
>  
> 
> 


From jackm at dev.mellanox.co.il  Sun Nov 30 00:28:59 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Nov 2008 10:28:59 +0200
Subject: [ofa-general] [PATCH 1 of 2] mlx4_ib: Fix dispatch
	of	IB_EVENT_LID_CHANGE
In-Reply-To: <492EAB55.7020003@Voltaire.COM>
References: <492C226D.7040009@Voltaire.COM>
	<200811271551.47076.jackm@dev.mellanox.co.il>
	<492EAB55.7020003@Voltaire.COM>
Message-ID: <200811301028.59757.jackm@dev.mellanox.co.il>

mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE

When snooping a portinfo MAD, its client_reregister bit is checked. 
If the bit is ON then a CLIENT_REREGISTER event is dispatched, otherwise
a LID_CHANGE event is dispatched. This ignores the cases where the MAD
changes the LID along with an instruction to reregister (so a
necessary LID_CHANGE event won't be dispatched), or the MAD is neither of
these (and an unnecessary LID_CHANGE event is dispatched). 

This patch dispatches an event if the client_reregister bit is set.
In addition, the patch compares the LID in the MAD to the current LID.
If and only if they are not identical, a LID_CHANGE event is dispatched.

From: Moni Shoua <monis at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
Signed-off-by: Yossi Etigin <yosefe at voltaire.com>

---
Roland,
Here is Moni's patch separated into two patches, one for mlx4_ib and one for ib_mthca.

Jack

Index: infiniband/drivers/infiniband/hw/mlx4/mad.c
===================================================================
--- infiniband.orig/drivers/infiniband/hw/mlx4/mad.c	2008-11-04 10:21:02.000000000 +0200
+++ infiniband/drivers/infiniband/hw/mlx4/mad.c	2008-11-30 09:47:39.000000000 +0200
@@ -147,7 +147,8 @@ static void update_sm_ah(struct mlx4_ib_
  * Snoop SM MADs for port info and P_Key table sets, so we can
  * synthesize LID change and P_Key change events.
  */
-static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad)
+static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad,
+				u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -157,6 +158,7 @@ static void smp_snoop(struct ib_device *
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			update_sm_ah(to_mdev(ibdev), port_num,
 				     be16_to_cpu(pinfo->sm_lid),
@@ -165,12 +167,15 @@ static void smp_snoop(struct ib_device *
 			event.device	       = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if (prev_lid != lid) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -228,8 +233,9 @@ int mlx4_ib_process_mad(struct ib_device
 			struct ib_wc *in_wc, struct ib_grh *in_grh,
 			struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
-	u16 slid;
+	u16 slid, prev_lid = 0;
 	int err;
+	struct ib_port_attr pattr;
 
 	slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
 
@@ -263,6 +269,13 @@ int mlx4_ib_process_mad(struct ib_device
 	} else
 		return IB_MAD_RESULT_SUCCESS;
 
+	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+	    in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+	    in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
+	    !ib_query_port(ibdev, port_num, &pattr))
+		prev_lid = pattr.lid;
+
 	err = mlx4_MAD_IFC(to_mdev(ibdev),
 			   mad_flags & IB_MAD_IGNORE_MKEY,
 			   mad_flags & IB_MAD_IGNORE_BKEY,
@@ -271,7 +284,7 @@ int mlx4_ib_process_mad(struct ib_device
 		return IB_MAD_RESULT_FAILURE;
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 

From jackm at dev.mellanox.co.il  Sun Nov 30 00:29:01 2008
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Sun, 30 Nov 2008 10:29:01 +0200
Subject: [ofa-general] [PATCH 2 of 2] ib_mthca: Fix dispatch
	of	IB_EVENT_LID_CHANGE
Message-ID: <200811301029.02196.jackm@dev.mellanox.co.il>

ib_mthca: Fix dispatch of IB_EVENT_LID_CHANGE

When snooping a portinfo MAD, its client_reregister bit is checked. 
If the bit is ON then a CLIENT_REREGISTER event is dispatched, otherwise
a LID_CHANGE event is dispatched. This ignores the cases where the MAD
changes the LID along with an instruction to reregister (so a
necessary LID_CHANGE event won't be dispatched), or the MAD is neither of
these (and an unnecessary LID_CHANGE event is dispatched). 

This patch dispatches an event if the client_reregister bit is set.
In addition, the patch compares the LID in the MAD to the current LID.
If and only if they are not identical, a LID_CHANGE event is dispatched.

From: Moni Shoua <monis at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
Signed-off-by: Yossi Etigin <yosefe at voltaire.com>

---

Roland,
Here is Moni's patch separated into two patches, one for mlx4_ib and one for ib_mthca.

Jack

Index: infiniband/drivers/infiniband/hw/mthca/mthca_mad.c
===================================================================
--- infiniband.orig/drivers/infiniband/hw/mthca/mthca_mad.c	2008-11-04 10:21:02.000000000 +0200
+++ infiniband/drivers/infiniband/hw/mthca/mthca_mad.c	2008-11-30 09:48:35.000000000 +0200
@@ -104,7 +104,8 @@ static void update_sm_ah(struct mthca_de
  */
 static void smp_snoop(struct ib_device *ibdev,
 		      u8 port_num,
-		      struct ib_mad *mad)
+		      struct ib_mad *mad,
+		      u16 prev_lid)
 {
 	struct ib_event event;
 
@@ -114,6 +115,7 @@ static void smp_snoop(struct ib_device *
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) {
 			struct ib_port_info *pinfo =
 				(struct ib_port_info *) ((struct ib_smp *) mad)->data;
+			u16 lid = be16_to_cpu(pinfo->lid);
 
 			mthca_update_rate(to_mdev(ibdev), port_num);
 			update_sm_ah(to_mdev(ibdev), port_num,
@@ -123,12 +125,15 @@ static void smp_snoop(struct ib_device *
 			event.device           = ibdev;
 			event.element.port_num = port_num;
 
-			if (pinfo->clientrereg_resv_subnetto & 0x80)
+			if (pinfo->clientrereg_resv_subnetto & 0x80) {
 				event.event    = IB_EVENT_CLIENT_REREGISTER;
-			else
+				ib_dispatch_event(&event);
+			}
+			if (prev_lid != lid) {
 				event.event    = IB_EVENT_LID_CHANGE;
+				ib_dispatch_event(&event);
+			}
 
-			ib_dispatch_event(&event);
 		}
 
 		if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) {
@@ -196,6 +201,8 @@ int mthca_process_mad(struct ib_device *
 	int err;
 	u8 status;
 	u16 slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE);
+	u16 prev_lid = 0;
+	struct ib_port_attr pattr;
 
 	/* Forward locally generated traps to the SM */
 	if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP &&
@@ -233,6 +240,12 @@ int mthca_process_mad(struct ib_device *
 			return IB_MAD_RESULT_SUCCESS;
 	} else
 		return IB_MAD_RESULT_SUCCESS;
+	if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	     in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) &&
+	    in_mad->mad_hdr.method == IB_MGMT_METHOD_SET &&
+	    in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO &&
+	    !ib_query_port(ibdev, port_num, &pattr))
+		prev_lid = pattr.lid;
 
 	err = mthca_MAD_IFC(to_mdev(ibdev),
 			    mad_flags & IB_MAD_IGNORE_MKEY,
@@ -252,7 +265,7 @@ int mthca_process_mad(struct ib_device *
 	}
 
 	if (!out_mad->mad_hdr.status) {
-		smp_snoop(ibdev, port_num, in_mad);
+		smp_snoop(ibdev, port_num, in_mad, prev_lid);
 		node_desc_override(ibdev, out_mad);
 	}
 

From vlad at lists.openfabrics.org  Sun Nov 30 03:20:52 2008
From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox)
Date: Sun, 30 Nov 2008 03:20:52 -0800 (PST)
Subject: [ofa-general] ofa_1_4_kernel 20081130-0200 daily build status
Message-ID: <20081130112052.7C30AE60D46@openfabrics.org>

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.19
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.18-8.el5

Failed:


From sashak at voltaire.com  Sun Nov 30 05:30:26 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 15:30:26 +0200
Subject: [ofa-general] Re: [PATCH] opensm: skeleton for toroidal mesh
	analysis
In-Reply-To: <000001c943c8$fef921f0$fceb65d0$@com>
References: <000001c943c8$fef921f0$fceb65d0$@com>
Message-ID: <20081130133026.GE9338@sashak.voltaire.com>

Hi Bob,

On 00:44 Tue 11 Nov     , Robert Pearson wrote:
> Sasha, 
> 
> Here is the first patch in a series to implement the algorithm described in
> the file lash_changes.doc.
> 
> This patch
>       - creates a new command line flag --do_mesh_analysis and a new Boolean
> that is set if the flag is used.
>       - adds code to main to implement the flag and option.

This also requires addition in OpenSM man page and ideally some
explanations in opensm/doc/current-routing.txt document. This can be done
as separate patch if you like.

>       - creates a new file osm_mesh.c to hold the algorithm code
>       - moves declarations from osm_ucast_lash.c and osm_mesh.c into header
> files
>       - adds these files to Makefile.am
>       - adds a stub do_mesh_analysis() that is called from lash_core.
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> 
> -----
> 
> diff --git a/opensm/include/opensm/osm_mesh.h
> b/opensm/include/opensm/osm_mesh.h
> new file mode 100644
> index 0000000..1467440
> --- /dev/null
> +++ b/opensm/include/opensm/osm_mesh.h
> @@ -0,0 +1,46 @@
> +/*
> + * Copyright (c) 2088      System Fabric Works, Inc.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +/*
> + * Abstract:
> + *      Declarations for mesh analysis
> + */
> +
> +#ifndef OSM_UCAST_MESH_H
> +#define OSM_UCAST_MESH_H
> +
> +struct _lash;
> +
> +int do_mesh_analysis(struct _lash *p_lash);
> +
> +#endif
> diff --git a/opensm/include/opensm/osm_subnet.h
> b/opensm/include/opensm/osm_subnet.h
> index 7259587..2abe36d 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -215,6 +215,7 @@ typedef struct osm_subn_opt {
>  	char *node_name_map_name;
>  	char *prefix_routes_file;
>  	boolean_t consolidate_ipv6_snm_req;
> +	boolean_t do_mesh_analysis;
>  } osm_subn_opt_t;
>  /*
>  * FIELDS
> diff --git a/opensm/include/opensm/osm_ucast_lash.h
> b/opensm/include/opensm/osm_ucast_lash.h
> new file mode 100644
> index 0000000..646e9a3
> --- /dev/null
> +++ b/opensm/include/opensm/osm_ucast_lash.h
> @@ -0,0 +1,100 @@
> +/*
> + * Copyright (c) 2008      System Fabric Works, Inc.
> + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> + * Copyright (c) 2007      Simula Research Laboratory. All rights reserved.
> + * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +/*
> + * Abstract:
> + *      Declarations for LASH algorithm
> + */
> +
> +#ifndef OSM_UCAST_LASH_H
> +#define OSM_UCAST_LASH_H
> +
> +enum {
> +	UNQUEUED,
> +	Q_MEMBER,
> +	MST_MEMBER,
> +	MAX_INT = 9999,
> +	NONE = MAX_INT
> +};
> +
> +typedef struct _cdg_vertex {
> +	int num_dependencies;
> +	struct _cdg_vertex **dependency;
> +	int from;
> +	int to;
> +	int seen;
> +	int temp;
> +	int visiting_number;
> +	struct _cdg_vertex *next;
> +	int num_temp_depend;
> +	int num_using_vertex;
> +	int *num_using_this_depend;
> +} cdg_vertex_t;
> +
> +typedef struct _reachable_dest {
> +	int switch_id;
> +	struct _reachable_dest *next;
> +} reachable_dest_t;
> +
> +typedef struct _switch {
> +	osm_switch_t *p_sw;
> +	int *dij_channels;
> +	int id;
> +	int used_channels;
> +	int q_state;
> +	struct routing_table {
> +		unsigned out_link;
> +		unsigned lane;
> +	} *routing_table;
> +	unsigned int num_connections;
> +	int *virtual_physical_port_table;
> +	int *phys_connections;
> +} switch_t;
> +
> +typedef struct _lash {
> +	osm_opensm_t *p_osm;
> +	int num_switches;
> +	uint8_t vl_min;
> +	int balance_limit;
> +	switch_t **switches;
> +	cdg_vertex_t ****cdg_vertex_matrix;
> +	int *num_mst_in_lane;
> +	int ***virtual_location;
> +} lash_t;
> +
> +#endif
> diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am
> index 01573d2..7b9da18 100644
> --- a/opensm/opensm/Makefile.am
> +++ b/opensm/opensm/Makefile.am
> @@ -31,7 +31,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c
> osm_db_files.c \
>  		 osm_inform.c osm_lid_mgr.c osm_lin_fwd_rcv.c \
>  		 osm_link_mgr.c osm_mcast_fwd_rcv.c \
>  		 osm_mcast_mgr.c osm_mcast_tbl.c osm_mcm_info.c \
> -		 osm_mcm_port.c osm_mtree.c osm_multicast.c osm_node.c \
> +		 osm_mcm_port.c osm_mesh.c osm_mtree.c osm_multicast.c
> osm_node.c \
>  		 osm_node_desc_rcv.c osm_node_info_rcv.c \
>  		 osm_opensm.c osm_pkey.c osm_pkey_mgr.c osm_pkey_rcv.c \
>  		 osm_port.c osm_port_info_rcv.c \
> @@ -76,6 +76,7 @@ opensminclude_HEADERS = \
>  	$(srcdir)/../include/opensm/osm_errors.h \
>  	$(srcdir)/../include/opensm/osm_helper.h \
>  	$(srcdir)/../include/opensm/osm_inform.h \
> +	$(srcdir)/../include/opensm/osm_ucast_lash.h \
>  	$(srcdir)/../include/opensm/osm_lid_mgr.h \
>  	$(srcdir)/../include/opensm/osm_log.h \
>  	$(srcdir)/../include/opensm/osm_mad_pool.h \
> @@ -83,6 +84,7 @@ opensminclude_HEADERS = \
>  	$(srcdir)/../include/opensm/osm_mcast_tbl.h \
>  	$(srcdir)/../include/opensm/osm_mcm_info.h \
>  	$(srcdir)/../include/opensm/osm_mcm_port.h \
> +	$(srcdir)/../include/opensm/osm_mesh.h \
>  	$(srcdir)/../include/opensm/osm_mtree.h \
>  	$(srcdir)/../include/opensm/osm_multicast.h \
>  	$(srcdir)/../include/opensm/osm_msgdef.h \
> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
> index 53648d6..63bd5a6 100644
> --- a/opensm/opensm/main.c
> +++ b/opensm/opensm/main.c
> @@ -585,6 +585,7 @@ int main(int argc, char *argv[])
>  #endif
>  		{"prefix_routes_file", 1, NULL, 3},
>  		{"consolidate_ipv6_snm_req", 0, NULL, 4},
> +		{"do_mesh_analysis", 0, NULL, 5},

A new command line option requires addition (and some short explanation)
in usage() function (invoked on 'opensm --help') and in OpenSM man page.

Also I suppose this option should be added to OpenSM config file and not
to be "command line only".

>  		{NULL, 0, NULL, 0}	/* Required at the end of the array
> */
>  	};
>  
> @@ -922,6 +923,9 @@ int main(int argc, char *argv[])
>  		case 4:
>  			opt.consolidate_ipv6_snm_req = TRUE;
>  			break;
> +		case 5:
> +			opt.do_mesh_analysis = TRUE;
> +			break;
>  		case 'h':
>  		case '?':
>  		case ':':
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> new file mode 100644
> index 0000000..7943274
> --- /dev/null
> +++ b/opensm/opensm/osm_mesh.c
> @@ -0,0 +1,65 @@
> +/*
> + * Copyright (c) 2008      System Fabric Works, Inc.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +
> +/*
> + * Abstract:
> + *      routines to analyze certain meshes
> + */
> +
> +#if HAVE_CONFIG_H
> +#  include <config.h>
> +#endif				/* HAVE_CONFIG_H */
> +
> +#include <stdio.h>
> +#include <opensm/osm_switch.h>
> +#include <opensm/osm_opensm.h>
> +#include <opensm/osm_log.h>
> +#include <opensm/osm_mesh.h>
> +#include <opensm/osm_ucast_lash.h>
> +
> +/*
> + * do_mesh_analysis
> + */
> +int do_mesh_analysis(lash_t *p_lash)
> +{
> +	int ret = 0;
> +	osm_log_t *p_log = &p_lash->p_osm->log;
> +
> +	OSM_LOG_ENTER(p_log);
> +
> +	printf("lash: do_mesh_analysis stub called\n");
> +
> +	OSM_LOG_EXIT(p_log);
> +
> +	return ret;
> +}
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index c082798..e10371c 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -52,64 +52,13 @@
>  #include <opensm/osm_switch.h>
>  #include <opensm/osm_opensm.h>
>  #include <opensm/osm_log.h>
> +#include <opensm/osm_mesh.h>
> +#include <opensm/osm_ucast_lash.h>
>  
>  /* //////////////////////////// */
>  /*  Local types                 */
>  /* //////////////////////////// */
>  
> -enum {
> -	UNQUEUED,
> -	Q_MEMBER,
> -	MST_MEMBER,
> -	MAX_INT = 9999,
> -	NONE = MAX_INT
> -};
> -
> -typedef struct _cdg_vertex {
> -	int num_dependencies;
> -	struct _cdg_vertex **dependency;
> -	int from;
> -	int to;
> -	int seen;
> -	int temp;
> -	int visiting_number;
> -	struct _cdg_vertex *next;
> -	int num_temp_depend;
> -	int num_using_vertex;
> -	int *num_using_this_depend;
> -} cdg_vertex_t;
> -
> -typedef struct _reachable_dest {
> -	int switch_id;
> -	struct _reachable_dest *next;
> -} reachable_dest_t;
> -
> -typedef struct _switch {
> -	osm_switch_t *p_sw;
> -	int *dij_channels;
> -	int id;
> -	int used_channels;
> -	int q_state;
> -	struct routing_table {
> -		unsigned out_link;
> -		unsigned lane;
> -	} *routing_table;
> -	unsigned int num_connections;
> -	int *virtual_physical_port_table;
> -	int *phys_connections;
> -} switch_t;
> -
> -typedef struct _lash {
> -	osm_opensm_t *p_osm;
> -	int num_switches;
> -	uint8_t vl_min;
> -	int balance_limit;
> -	switch_t **switches;
> -	cdg_vertex_t ****cdg_vertex_matrix;
> -	int *num_mst_in_lane;
> -	int ***virtual_location;
> -} lash_t;
> -
>  static cdg_vertex_t *create_cdg_vertex(unsigned num_switches)
>  {
>  	cdg_vertex_t *cdg_vertex = (cdg_vertex_t *)
> malloc(sizeof(cdg_vertex_t));
> @@ -872,10 +821,15 @@ static int lash_core(lash_t * p_lash)
>  	int output_link2, i_next_switch2;
>  	int cycle_found2 = 0;
>  	int status = 0;
> -	int *switch_bitmap;	/* Bitmap to check if we have processed this
> pair */
> +	int *switch_bitmap = NULL;	/* Bitmap to check if we have
> processed this pair */

Why this initialization is needed?

>  
>  	OSM_LOG_ENTER(p_log);
>  
> +	if (p_lash->p_osm->subn.opt.do_mesh_analysis &&
> do_mesh_analysis(p_lash)) {
> +		OSM_LOG(p_log, OSM_LOG_ERROR, "Mesh analysis failed\n");
> +		goto Exit;
> +	}
> +
>  	for (i = 0; i < num_switches; i++) {
>  
>  		shortest_path(p_lash, i);


From sashak at voltaire.com  Sun Nov 30 05:48:57 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 15:48:57 +0200
Subject: [ofa-general] Re: [PATCH][3] opensm: per mesh node information
In-Reply-To: <000501c943d4$57b3f8f0$071bead0$@com>
References: <000501c943d4$57b3f8f0$071bead0$@com>
Message-ID: <20081130134857.GF9338@sashak.voltaire.com>

On 02:06 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> This is the third patch implementing the mesh analysis algorithm
> 
> This patch
>       - creates per mesh node (e.g. switch) data structure mesh_node_t
>       - adds a pointer to mesh_node_t in the switch_t structure
>       - implements create and cleanup methods for node_t
>       - calls these in switch_create and swich_delete in *lash.c
> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/include/opensm/osm_mesh.h
> b/opensm/include/opensm/osm_mesh.h
> index 8313614..78af086 100644
> --- a/opensm/include/opensm/osm_mesh.h
> +++ b/opensm/include/opensm/osm_mesh.h
> @@ -40,6 +40,39 @@
>  #define OSM_UCAST_MESH_H
>  
>  struct _lash;
> +struct _switch;
> +
> +enum mesh_node_type {
> +	mesh_type_none,
> +	mesh_type_cartesian,
> +};
> +
> +/*
> + * per switch to switch link info
> + */
> +typedef struct _link {
> +	int switch_id;
> +	int link_id;
> +	int *ports;
> +	int num_ports;
> +	int next_port;
> +} link_t;
> +
> +/*
> + * per switch node mesh info
> + */
> +typedef struct _mesh_node {
> +	unsigned int num_links;		/* number of 'links' to adjacent
> switches */
> +	link_t **links;			/* per link information */
> +	int *axes;			/* used to hold and reorder assigned
> axes */
> +	int *coord;			/* mesh coordinates of switch */
> +	int **matrix;			/* distances between adjacent
> switches */
> +	int *poly;			/* characteristic polynomial of
> matrix */
> +					/* used as an invariant
> classification */
> +	enum mesh_node_type type;
> +	int dimension;			/* apparent dimension of mesh around
> node */
> +	int temp;			/* temporary holder for distance
> info */
> +} mesh_node_t;
>  
>  /*
>   * per fabric mesh info
> @@ -55,4 +88,7 @@ typedef struct _mesh {
>  void osm_mesh_cleanup(struct _lash *p_lash);
>  int osm_do_mesh_analysis(struct _lash *p_lash);
>  
> +void osm_mesh_node_cleanup(struct _switch *sw);
> +int osm_mesh_node_create(struct _lash *p_lash, struct _switch *sw);
> +
>  #endif
> diff --git a/opensm/include/opensm/osm_ucast_lash.h
> b/opensm/include/opensm/osm_ucast_lash.h
> index 1ae3bb6..c037571 100644
> --- a/opensm/include/opensm/osm_ucast_lash.h
> +++ b/opensm/include/opensm/osm_ucast_lash.h
> @@ -81,6 +81,7 @@ typedef struct _switch {
>  		unsigned out_link;
>  		unsigned lane;
>  	} *routing_table;
> +	mesh_node_t *node;
>  	unsigned int num_connections;
>  	int *virtual_physical_port_table;
>  	int *phys_connections;
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index c97925b..6ef397c 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -98,7 +98,7 @@ static int mesh_create(lash_t *p_lash)
>  }
>  
>  /*
> - * do_mesh_analysis
> + * osm_do_mesh_analysis
>   */
>  int osm_do_mesh_analysis(lash_t *p_lash)
>  {
> @@ -121,3 +121,83 @@ int osm_do_mesh_analysis(lash_t *p_lash)
>  
>  	return ret;
>  }
> +
> +/*
> + * osm_mesh_node_cleanup - cleanup per switch resources
> + */
> +void osm_mesh_node_cleanup(switch_t *sw)
> +{
> +	int i;
> +	mesh_node_t *node = sw->node;
> +	unsigned num_ports = sw->p_sw->num_ports;
> +
> +	if (node) {
> +		if (node->links) {
> +			for (i = 0; i < num_ports; i++) {
> +				if (node->links[i]) {
> +					if (node->links[i]->ports)
> +						free(node->links[i]->ports);
> +					free(node->links[i]);
> +				}
> +			}
> +			free(node->links);
> +		}
> +
> +		if (node->poly)
> +			free(node->poly);
> +
> +		if (node->matrix) {
> +			for (i = 0; i < node->num_links; i++) {
> +				if (node->matrix[i])
> +					free(node->matrix[i]);
> +			}
> +			free(node->matrix);
> +		}
> +
> +		if (node->axes)
> +			free(node->axes);
> +
> +		free(node);
> +
> +		sw->node = NULL;
> +	}
> +}
> +
> +/*
> + * osm_mesh_node_create - allocate per switch resources
> + */
> +int osm_mesh_node_create(lash_t *p_lash, switch_t *sw)
> +{
> +	osm_log_t *p_log = &p_lash->p_osm->log;
> +	int i;
> +	mesh_node_t *node;
> +	unsigned num_ports = sw->p_sw->num_ports;
> +
> +	if (!(node = sw->node = calloc(1, sizeof(mesh_node_t)))) {
> +		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating mesh node -
> out of memory\n");
> +		return -1;
> +	}
> +
> +	if (!(node->links = calloc(num_ports, sizeof(link_t *))))
> +		goto err;
> +
> +	for (i = 0; i < num_ports; i++) {
> +		if (!(node->links[i] = calloc(1, sizeof(link_t))) ||
> +		    !(node->links[i]->ports = calloc(num_ports,
> sizeof(int))))
> +			goto err;
> +	}

Assuming that ports array is preallocated, wouldn't it be simpler to
define link as:

typedef struct _link {
	int switch_id;
	int link_id;
	int num_ports;
	int next_port;
	int ports[0];
} link_t;

, and then:

	node->links[i] = calloc(1, sizeof(link_t *) + num_ports * sizeof(int))))

?

(Similar optimizations are probably relevant in other places).

Sasha

> +
> +	if (!(node->axes = calloc(num_ports, sizeof(int))))
> +		goto err;
> +
> +	for (i = 0; i < num_ports; i++) {
> +		node->links[i]->switch_id = NONE;
> +	}
> +
> +	return 0;
> +
> +err:
> +	OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating mesh node - out of
> memory\n");
> +	osm_mesh_node_cleanup(sw);
> +	return -1;
> +}
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index 3577cca..b9394af 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -651,6 +651,9 @@ static switch_t *switch_create(lash_t * p_lash, unsigned
> id, osm_switch_t * p_sw
>  		sw->phys_connections[i] = NONE;
>  	}
>  
> +	if (osm_mesh_node_create(p_lash, sw))
> +		return -1;
> +
>  	sw->p_sw = p_sw;
>  	if (p_sw)
>  		p_sw->priv = sw;
> @@ -660,6 +663,8 @@ static switch_t *switch_create(lash_t * p_lash, unsigned
> id, osm_switch_t * p_sw
>  
>  static void switch_delete(switch_t * sw)
>  {
> +	osm_mesh_node_cleanup(sw);
> +
>  	if (sw->dij_channels)
>  		free(sw->dij_channels);
>  	if (sw->virtual_physical_port_table)
> 
> 


From sashak at voltaire.com  Sun Nov 30 05:50:04 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 15:50:04 +0200
Subject: [ofa-general] opensm support for toroidal meshes
In-Reply-To: <000501c9437d$ffa7cd90$fef768b0$@com>
References: <000501c9437d$ffa7cd90$fef768b0$@com>
Message-ID: <20081130135004.GG9338@sashak.voltaire.com>

Hi Bob,

On 15:47 Mon 10 Nov     , Robert Pearson wrote:
> We have been involved in a project to deliver a large system based on a
> toroidal mesh fabric. One of the requirements for this system is to be able
> to guarantee a deadlock free routing of the fabric. The lash routing engine
> in opensm did not work in this case because required number of VLs for the
> machine as configured was 12 which exceeded the number of VLs supported by
> Mellanox switch ASICs. It turns out that if one has the freedom to reorder
> the order of the port assignments used by lash optimally that lash can
> successfully route the fabric but that is impractical in the hardware. The
> attached note describes an algorithm for automatically recognizing when a
> Cartesian mesh fabric is a torus, determining its size and optimally
> reordering the ports in opensm so that lash can generate a route with the
> smallest number of VLs.
> 
> We have implemented a set of changes to opensm that implement this algorithm
> and will submit the changes as patches. This note will help to understand
> the code.

Thanks for the great work! I'm sending some initial comments (still
learning the code).

Sasha

> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From sashak at voltaire.com  Sun Nov 30 05:54:40 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 15:54:40 +0200
Subject: [ofa-general] Re: [PATCH][4] opensm: vector and matrix utilities
In-Reply-To: <003201c9441c$d23ce8f0$76b6bad0$@com>
References: <003201c9441c$d23ce8f0$76b6bad0$@com>
Message-ID: <20081130135440.GH9338@sashak.voltaire.com>

On 10:44 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> Here is the fourth patch in a series implementing the mesh analysis
> algorithm.
> 
> This patch implements
>       - create and cleanup methods for polynomial with integer coefficients
>       - create and cleanup methods for square matrix with integer
> coefficients
>       - create and cleanup methods for square matrix with polynomial
> coefficients
>       - routine to compute the determinant of a matrix with polynomial
> coefficients
> 
> (Note the determinant is restricted to computing the characteristic
> polynomial)
> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index 6ef397c..5dee1d0 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -49,6 +49,295 @@
>  #include <opensm/osm_ucast_lash.h>
>  
>  /*
> + * poly_alloc
> + * 
> + * allocate a polynomial of degree n
> + */
> +static int *poly_alloc(lash_t *p_lash, int n)
> +{
> +	osm_log_t *p_log = &p_lash->p_osm->log;
> +	int *p;
> +
> +	if (!(p = calloc(n+1, sizeof(int)))) {
> +		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating poly - out
> of memory\n");
> +	}
> +
> +	return p;
> +}
> +
> +/*
> + * poly_diff
> + *
> + * return a nonzero value if polynomials differ else 0
> + */
> +static int poly_diff(int n, int *p, switch_t *s)
> +{
> +	int i;
> +
> +	if (s->node->num_links != n)
> +		return 1;
> +
> +	for (i = 0; i <= n; i++) {
> +		if (s->node->poly[i] != p[i])
> +			return 1;
> +	}

	memcmp(s->node->poly, p, n)?

> +
> +	return 0;
> +}
> +
> +/*
> + * m_free
> + *
> + * free a square matrix of rank l
> + */
> +static void m_free(int **m, int l)
> +{
> +	int i;
> +
> +	if (m) {
> +		for (i = 0; i < l; i++) {
> +			if (m[i])
> +				free(m[i]);
> +		}
> +		free(m);
> +	}
> +}
> +
> +/*
> + * m_alloc
> + *
> + * allocate a square matrix of rank l
> + */
> +static int **m_alloc(lash_t *p_lash, int l)
> +{
> +	osm_log_t *p_log = &p_lash->p_osm->log;
> +	int i;
> +	int **m = NULL;
> +
> +	do {
> +		if (!(m = calloc(l, sizeof(int *))))
> +			break;
> +
> +		for (i = 0; i < l; i++) {
> +			if (!(m[i] = calloc(l, sizeof(int))))
> +				break;
> +		}
> +		if (i != l)
> +			break;
> +
> +		return m;
> +	} while(0);

Maybe just m = calloc(l*l, sizeof(int))?

> +
> +	OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating matrix - out of
> memory\n");
> +
> +	m_free(m, l);
> +	return NULL;
> +}
> +
> +/*
> + * pm_free
> + *
> + * free a square matrix of rank l of polynomials
> + */
> +static void pm_free(int ***m, int l)
> +{
> +	int i, j;
> +
> +	if (m) {
> +		for (i = 0; i < l; i++) {
> +			if (m[i]) {
> +				for (j = 0; j < l; j++) {
> +					if (m[i][j])
> +						free(m[i][j]);
> +				}
> +				free(m[i]);
> +			}
> +		}
> +		free(m);
> +	}
> +}
> +
> +/*
> + * pm_alloc
> + *
> + * allocate a square matrix of rank l of polynomials of degree n
> + */
> +static int ***pm_alloc(lash_t *p_lash, int l, int n)
> +{
> +	osm_log_t *p_log = &p_lash->p_osm->log;
> +	int i, j;
> +	int ***m = NULL;
> +
> +	do {
> +		if (!(m = calloc(l, sizeof(int **))))
> +			break;
> +
> +		for (i = 0; i < l; i++) {
> +			if (!(m[i] = calloc(l, sizeof(int *))))
> +				break;
> +
> +			for (j = 0; j < l; j++) {
> +				if (!(m[i][j] = calloc(n+1, sizeof(int))))
> +					break;
> +			}
> +			if (j != l)
> +				break;
> +		}
> +		if (i != l)
> +			break;
> +
> +		return m;
> +	} while(0);

Ditto.

> +
> +	OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating matrix - out of
> memory\n");
> +
> +	pm_free(m, l);
> +	return NULL;
> +}
> +
> +static int determinant(lash_t *p_lash, int n, int rank, int ***m, int *p);
> +
> +/*
> + * sub_determinant
> + *
> + * compute the determinant of a submatrix of matrix of rank l of
> polynomials of degree n
> + * with row and col removed in poly. caller must free poly
> + */
> +static int sub_determinant(lash_t *p_lash, int n, int l, int row, int col,
> int ***matrix, int **poly)
> +{
> +	int ret = -1;
> +	int ***m = NULL;
> +	int *p = NULL;
> +	int i, j, k, x, y;
> +	int rank = l - 1;
> +
> +	do {
> +		if (!(p = poly_alloc(p_lash, n))) {
> +			break;
> +		}
> +
> +		if (rank <= 0) {
> +			p[0] = 1;
> +			ret = 0;
> +			break;
> +		}
> +
> +		if (!(m = pm_alloc(p_lash, rank, n))) {
> +			free(p);
> +			p = NULL;
> +			break;
> +		}
> +
> +		x = 0;
> +		for (i = 0; i < l; i++) {
> +			if (i == row)
> +				continue;
> +
> +			y = 0;
> +			for (j = 0; j < l; j++) {
> +				if (j == col)
> +					continue;
> +
> +				for (k = 0; k <= n; k++)
> +					m[x][y][k] = matrix[i][j][k];
> +
> +				y++;
> +			}
> +			x++;
> +		}
> +
> +		if (determinant(p_lash, n, rank, m, p)) {
> +			free(p);
> +			p = NULL;
> +			break;
> +		}
> +
> +		ret = 0;
> +	} while(0);
> +
> +	pm_free(m, rank);
> +	*poly = p;
> +	return ret;
> +}
> +
> +/*
> + * determinant
> + *
> + * compute the determinant of matrix m of rank of polynomials of degree deg
> + * and add the result to polynomial p allocated by caller
> + */
> +static int determinant(lash_t *p_lash, int deg, int rank, int ***m, int *p)
> +{
> +	int i, j, k;
> +	int *q;
> +	int sign = 1;
> +
> +	/*
> +	 * handle simple case of 1x1 matrix
> +	 */
> +	if (rank == 1) {
> +		for (i = 0; i <= deg; i++)
> +			p[i] += m[0][0][i];
> +	}
> +
> +	/*
> +	 * handle simple case of 2x2 matrix
> +	 */
> +	else if (rank == 2) {
> +		for (i = 0; i <= deg; i++) {
> +			if (m[0][0][i] == 0)
> +				continue;
> +
> +			for (j = 0; j <= deg; j++) {
> +				if (m[1][1][j] == 0)
> +					continue;
> +
> +				p[i+j] += m[0][0][i]*m[1][1][j];
> +			}
> +		}
> +
> +		for (i = 0; i <= deg; i++) {
> +			if (m[0][1][i] == 0)
> +				continue;
> +
> +			for (j = 0; j <= deg; j++) {
> +				if (m[1][0][j] == 0)
> +					continue;
> +
> +				p[i+j] -= m[0][1][i]*m[1][0][j];
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * handle the general case
> +	 */
> +	else {
> +		for (i = 0; i < rank; i++) {
> +			if (sub_determinant(p_lash, deg, rank, 0, i, m, &q))
> +				return -1;
> +
> +			for (j = 0; j <= deg; j++) {
> +				if (m[0][i][j] == 0)
> +					continue;
> +
> +				for (k = 0; k <= deg; k++) {
> +					if (q[k] == 0)
> +						continue;
> +
> +					p[j+k] += sign*m[0][i][j]*q[k];
> +				}
> +			}
> +
> +			free(q);
> +			sign = -sign;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/*
>   * osm_mesh_cleanup - free per mesh resources
>   */
>  void osm_mesh_cleanup(lash_t *p_lash)
> 
> 


From sashak at voltaire.com  Sun Nov 30 07:28:18 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 17:28:18 +0200
Subject: [ofa-general] Re: [PATCH][5] opensm: compute local geometry
In-Reply-To: <003301c9441e$eed2f480$cc78dd80$@com>
References: <003301c9441e$eed2f480$cc78dd80$@com>
Message-ID: <20081130152818.GI9338@sashak.voltaire.com>

On 10:59 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> Here is the fifth patch implementing the mesh analysis algorithm.
> 
> This patch implements
>       - routine to compute characteristics polynomial of a matrix
>       - routine to compute the local 'metric' around each switch
>       - routine to classify switches into a histogram of local geometry
> classes
> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index 7434fee..9254de3 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -338,6 +338,172 @@ static int determinant(lash_t *p_lash, int deg, int
> rank, int ***m, int *p)
>  }
>  
>  /*
> + * char_poly
> + *
> + * compute the characteristic polynomial of matrix of rank
> + * by computing the determinant of m-x*I and return in poly
> + * as an array. caller must free poly
> + */
> +static int char_poly(lash_t *p_lash, int rank, int **matrix, int **poly)
> +{
> +	int ret = -1;
> +	int i, j;
> +	int ***m = NULL;
> +	int *p = NULL;
> +	int deg = rank;
> +
> +	do {
> +		if (!(p = poly_alloc(p_lash, deg))) {
> +			break;
> +		}
> +
> +		if (!(m = pm_alloc(p_lash, rank, deg))) {
> +			free(p);
> +			p = NULL;
> +			break;
> +		}
> +
> +		for (i = 0; i < rank; i++) {
> +			for (j = 0; j < rank; j++) {
> +				m[i][j][0] = matrix[i][j];
> +			}
> +			m[i][i][1] = -1;
> +		}
> +
> +		if (determinant(p_lash, deg, rank, m, p)) {
> +			free(p);
> +			p = NULL;
> +			break;
> +		}
> +
> +		ret = 0;
> +	} while(0);
> +
> +	pm_free(m, rank);
> +	*poly = p;
> +	return ret;
> +}
> +
> +/*
> + * get_switch_metric
> + *
> + * compute the matrix of minimum distances between each of
> + * the adjacent switch nodes to sw along paths
> + * that do not go through sw. do calculation by
> + * relaxation method
> + * allocate space for the matrix and save in node_t structure
> + */
> +static int get_switch_metric(lash_t *p_lash, int sw)
> +{
> +	int ret = -1;
> +	int i, j, change;
> +	int sw1, sw2, sw3;
> +	switch_t *s = p_lash->switches[sw];
> +	switch_t *s1, *s2, *s3;
> +	int **m;
> +	mesh_node_t *node = s->node;
> +	int num_links = node->num_links;
> +
> +	do {
> +		if (!(m = m_alloc(p_lash, num_links)))
> +			break;
> +
> +		for (i = 0; i < num_links; i++) {
> +			sw1 = node->links[i]->switch_id;
> +			s1 = p_lash->switches[sw1];
> +
> +			/* make all distances big except s1 to itself */
> +			for (sw2 = 0; sw2 < p_lash->num_switches; sw2++)
> +				p_lash->switches[sw2]->node->temp =
> 0x7fffffff;
> +
> +			s1->node->temp = 0;
> +
> +			do {
> +				change = 0;
> +
> +				for (sw2 = 0; sw2 < p_lash->num_switches;
> sw2++) {
> +					s2 = p_lash->switches[sw2];
> +					if (s2->node->temp == 0x7fffffff)
> +						continue;
> +					for (j = 0; j < s2->node->num_links;
> j++) {
> +						sw3 =
> s2->node->links[j]->switch_id;
> +						s3 = p_lash->switches[sw3];
> +
> +						if (sw3 == sw)
> +							continue;
> +
> +						if ((s2->node->temp + 1) <
> s3->node->temp) {
> +							s3->node->temp =
> s2->node->temp + 1;
> +							change++;
> +						}
> +					}
> +				}
> +			} while(change);

As far as I can understand it is minimal hops calculation.

We already have this information in OpenSM switches lmx mtrices. Using
this matrix 'm' could be created as:

	for (i = 0; i < num_links; i++) {
		sw1 = node->links[i]->switch_id;
		s1 = p_lash->switches[sw1];

		for (i = 0; i < num_links; i++) {
			unsigned lid;
			sw2 = node->links[i]->switch_id;
			s2 = p_lash->switches[sw2];
			lid = cl_ntoh16(osm_node_get_base_lid(s2->p_sw->p_node, 0));

			m[i][j] = osm_switch_get_least_hops(s1->p_sw, lid);
		}
	}

> +
> +			for (j = 0; j < num_links; j++) {
> +				sw2 = node->links[j]->switch_id;
> +				s2 = p_lash->switches[sw2];
> +				m[i][j] = s2->node->temp;
> +			}
> +		}
> +
> +		if (char_poly(p_lash, num_links, m, &node->poly)) {
> +			m_free(m, num_links);
> +			m = NULL;
> +			break;
> +		}
> +
> +		ret = 0;
> +	} while(0);
> +
> +	node->matrix = m;
> +	return ret;
> +}
> +
> +/*
> + * classify_switch
> + *
> + * add switch to histogram of switch types
> + */
> +static void classify_switch(lash_t *p_lash, int sw)
> +{
> +	int i;
> +	switch_t *s = p_lash->switches[sw];
> +	switch_t *s1;
> +	mesh_t *mesh = p_lash->mesh;
> +
> +	for (i = 0; i < mesh->num_class; i++) {
> +		s1 = p_lash->switches[mesh->class_type[i]];
> +	
> +		if (poly_diff(s->node->num_links, s->node->poly, s1))
> +			continue;
> +
> +		mesh->class_count[i]++;
> +		return;
> +	}
> +
> +	mesh->class_type[mesh->num_class] = sw;
> +	mesh->class_count[mesh->num_class] = 1;
> +	mesh->num_class++;
> +	return;
> +}
> +
> +/*
> + * get_local_geometry
> + *
> + * analyze the local geometry around each switch
> + */
> +static void get_local_geometry(lash_t *p_lash)
> +{
> +	int sw;
> +
> +	for (sw = 0; sw < p_lash->num_switches; sw++) {
> +		get_switch_metric(p_lash, sw);
> +		classify_switch(p_lash, sw);
> +	}
> +}
> +
> +/*
>   * osm_mesh_cleanup - free per mesh resources
>   */
>  void osm_mesh_cleanup(lash_t *p_lash)
> @@ -404,6 +570,12 @@ int osm_do_mesh_analysis(lash_t *p_lash)
>  		return -1;
>  	}
>  
> +	/*
> +	 * get local metric and invariant for each switch
> +	 * also classify each switch
> +	 */
> +	get_local_geometry(p_lash);
> +
>  	printf("lash: do_mesh_analysis stub called\n");
>  
>  	OSM_LOG_EXIT(p_log);
> 
> 


From sashak at voltaire.com  Sun Nov 30 08:36:37 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 18:36:37 +0200
Subject: [ofa-general] Re: [PATCH][5] opensm: compute local geometry
In-Reply-To: <003301c9441e$eed2f480$cc78dd80$@com>
References: <003301c9441e$eed2f480$cc78dd80$@com>
Message-ID: <20081130163637.GJ9338@sashak.voltaire.com>

On 10:59 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> Here is the fifth patch implementing the mesh analysis algorithm.
> 
> This patch implements
>       - routine to compute characteristics polynomial of a matrix
>       - routine to compute the local 'metric' around each switch

I checked performance of determinant calculation - when switch has 8
links it takes 11-12 seconds per switch, with 10 links - 2177 seconds.

Is it possible to improve performance there?

Sasha


From sashak at voltaire.com  Sun Nov 30 08:39:38 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 18:39:38 +0200
Subject: [ofa-general] Re: [PATCH][5] opensm: compute local geometry
In-Reply-To: <20081130163637.GJ9338@sashak.voltaire.com>
References: <003301c9441e$eed2f480$cc78dd80$@com>
	<20081130163637.GJ9338@sashak.voltaire.com>
Message-ID: <20081130163938.GK9338@sashak.voltaire.com>

On 18:36 Sun 30 Nov     , Sasha Khapyorsky wrote:
> On 10:59 Tue 11 Nov     , Robert Pearson wrote:
> > Sasha,
> > 
> > Here is the fifth patch implementing the mesh analysis algorithm.
> > 
> > This patch implements
> >       - routine to compute characteristics polynomial of a matrix
> >       - routine to compute the local 'metric' around each switch
> 
> I checked performance of determinant calculation - when switch has 8
> links it takes 11-12 seconds per switch, with 10 links - 2177 seconds.

Oops, sorry. The results above are for 10 and 12 links.

Sasha


From rpearson at systemfabricworks.com  Sun Nov 30 08:56:36 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Sun, 30 Nov 2008 10:56:36 -0600
Subject: [ofa-general] RE: [PATCH][5] opensm: compute local geometry
In-Reply-To: <20081130163938.GK9338@sashak.voltaire.com>
References: <003301c9441e$eed2f480$cc78dd80$@com>
	<20081130163637.GJ9338@sashak.voltaire.com>
	<20081130163938.GK9338@sashak.voltaire.com>
Message-ID: <00f401c9530c$9bb4a530$d31def90$@com>

I am looking at the earlier posts.

I had thought about this one before. All the cases where this algorithm
applies have low port counts. I can fix this by just not doing the
determinant if the port count is larger than the highest order polynomial in
the table since none of them will match.

-----Original Message-----
From: Sasha Khapyorsky [mailto:sashak at voltaire.com] 
Sent: Sunday, November 30, 2008 10:40 AM
To: Robert Pearson
Cc: general at lists.openfabrics.org
Subject: Re: [PATCH][5] opensm: compute local geometry

On 18:36 Sun 30 Nov     , Sasha Khapyorsky wrote:
> On 10:59 Tue 11 Nov     , Robert Pearson wrote:
> > Sasha,
> > 
> > Here is the fifth patch implementing the mesh analysis algorithm.
> > 
> > This patch implements
> >       - routine to compute characteristics polynomial of a matrix
> >       - routine to compute the local 'metric' around each switch
> 
> I checked performance of determinant calculation - when switch has 8
> links it takes 11-12 seconds per switch, with 10 links - 2177 seconds.

Oops, sorry. The results above are for 10 and 12 links.

Sasha


From rdreier at cisco.com  Sun Nov 30 09:28:38 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 30 Nov 2008 09:28:38 -0800
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6
	support for rdma_bind_addr
In-Reply-To: <1228033480.3621.5.camel@alst60.voltaire.com> (Aleksey Senin's
	message of "Sun, 30 Nov 2008 10:24:40 +0200")
References: <1227721899.3121.18.camel@alst60.voltaire.com>
	<ada8wr3cdwe.fsf@cisco.com>
	<1228033480.3621.5.camel@alst60.voltaire.com>
Message-ID: <adafxl9azsp.fsf@cisco.com>

 > But.. All other patches depends one on another, and in my opinion better
 > to apply it all together, otherwise, when separated, all those 'if'
 > statements have no sense.

Umm, OK.  So why did you send 6 separate patches?


From sashak at voltaire.com  Sun Nov 30 09:59:29 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 19:59:29 +0200
Subject: [ofa-general] Re: [PATCH][5] opensm: compute local geometry
In-Reply-To: <20081130152818.GI9338@sashak.voltaire.com>
References: <003301c9441e$eed2f480$cc78dd80$@com>
	<20081130152818.GI9338@sashak.voltaire.com>
Message-ID: <20081130175929.GL9338@sashak.voltaire.com>

On 17:28 Sun 30 Nov     , Sasha Khapyorsky wrote:
> > +
> > +	do {
> > +		if (!(m = m_alloc(p_lash, num_links)))
> > +			break;
> > +
> > +		for (i = 0; i < num_links; i++) {
> > +			sw1 = node->links[i]->switch_id;
> > +			s1 = p_lash->switches[sw1];
> > +
> > +			/* make all distances big except s1 to itself */
> > +			for (sw2 = 0; sw2 < p_lash->num_switches; sw2++)
> > +				p_lash->switches[sw2]->node->temp =
> > 0x7fffffff;
> > +
> > +			s1->node->temp = 0;
> > +
> > +			do {
> > +				change = 0;
> > +
> > +				for (sw2 = 0; sw2 < p_lash->num_switches;
> > sw2++) {
> > +					s2 = p_lash->switches[sw2];
> > +					if (s2->node->temp == 0x7fffffff)
> > +						continue;
> > +					for (j = 0; j < s2->node->num_links;
> > j++) {
> > +						sw3 =
> > s2->node->links[j]->switch_id;
> > +						s3 = p_lash->switches[sw3];
> > +
> > +						if (sw3 == sw)
> > +							continue;
> > +
> > +						if ((s2->node->temp + 1) <
> > s3->node->temp) {
> > +							s3->node->temp =
> > s2->node->temp + 1;
> > +							change++;
> > +						}
> > +					}
> > +				}
> > +			} while(change);
> 
> As far as I can understand it is minimal hops calculation.
> 
> We already have this information in OpenSM switches lmx mtrices. Using
> this matrix 'm' could be created as:
> 
> 	for (i = 0; i < num_links; i++) {
> 		sw1 = node->links[i]->switch_id;
> 		s1 = p_lash->switches[sw1];
> 
> 		for (i = 0; i < num_links; i++) {
> 			unsigned lid;
> 			sw2 = node->links[i]->switch_id;
> 			s2 = p_lash->switches[sw2];
> 			lid = cl_ntoh16(osm_node_get_base_lid(s2->p_sw->p_node, 0));
> 
> 			m[i][j] = osm_switch_get_least_hops(s1->p_sw, lid);
> 		}
> 	}

Actually this my assumption is wrong. 'm' matrix contains min hops except
paths which can cross the original switch. So it should be done
differently, maybe something like this:


	for (i = 0; i < num_links; i++) {
		sw1 = node->links[i]->switch_id;
		s1 = p_lash->switches[sw1];

		for (j = 0; j < num_links; j++) {
			unsigned lid, p, h, hops = 0xff;

			sw2 = node->links[j]->switch_id;

			if (sw1 == sw2) {
				m1[i][j] = 0;
				continue;
			}

			s2 = p_lash->switches[sw2];
			lid = cl_ntoh16(osm_node_get_base_lid(s2->p_sw->p_node, 0));
			for (p = 1 ; p < s1->p_sw->num_ports; p++) {
				h = osm_switch_get_hop_count(s1->p_sw, lid, p);
				osm_physp_t *physp = osm_node_get_physp_ptr(s1->p_sw->p_node, p);
				if (h < hops &&
				    physp->p_remote_physp->p_node->sw != s->p_sw)
					hops = h;
			}
			m1[i][j] = hops;
		}
	}

Sasha


From sashak at voltaire.com  Sun Nov 30 10:03:07 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 20:03:07 +0200
Subject: [ofa-general] Re: [PATCH][5] opensm: compute local geometry
In-Reply-To: <00f401c9530c$9bb4a530$d31def90$@com>
References: <003301c9441e$eed2f480$cc78dd80$@com>
	<20081130163637.GJ9338@sashak.voltaire.com>
	<20081130163938.GK9338@sashak.voltaire.com>
	<00f401c9530c$9bb4a530$d31def90$@com>
Message-ID: <20081130180307.GM9338@sashak.voltaire.com>

On 10:56 Sun 30 Nov     , Robert Pearson wrote:
> 
> I had thought about this one before. All the cases where this algorithm
> applies have low port counts. I can fix this by just not doing the
> determinant if the port count is larger than the highest order polynomial in
> the table since none of them will match.

I think it would be nice fix.

Sasha


From rpearson at systemfabricworks.com  Sun Nov 30 10:39:53 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Sun, 30 Nov 2008 12:39:53 -0600
Subject: [ofa-general] [PATCH][11] opensm: add descriptions to docs and man
	page
Message-ID: <00f601c9531b$096c88a0$1c4599e0$@com>

Sasha,

This patch adds some descriptive language to current_routing.txt and
opensm.8.in.

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch11
Type: application/octet-stream
Size: 2680 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081130/9c59018b/attachment.obj>

From rpearson at systemfabricworks.com  Sun Nov 30 10:51:53 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Sun, 30 Nov 2008 12:51:53 -0600
Subject: [ofa-general] [PATCH][12] opensm: add descriptions to show_usage
Message-ID: <010001c9531c$b636e390$22a4aab0$@com>

Sasha,

This patch adds language to show_usage for the --do_mesh_analysis flag.

Regards,

Bob Pearson

Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch12
Type: application/octet-stream
Size: 952 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20081130/43f09329/attachment.obj>

From rpearson at systemfabricworks.com  Sun Nov 30 10:58:49 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Sun, 30 Nov 2008 12:58:49 -0600
Subject: [ofa-general] RE: [PATCH] opensm: skeleton for toroidal mesh
	analysis
In-Reply-To: <20081130133026.GE9338@sashak.voltaire.com>
References: <000001c943c8$fef921f0$fceb65d0$@com>
	<20081130133026.GE9338@sashak.voltaire.com>
Message-ID: <010a01c9531d$ae434150$0ac9c3f0$@com>

You wrote:
> @@ -872,10 +821,15 @@ static int lash_core(lash_t * p_lash)
>  	int output_link2, i_next_switch2;
>  	int cycle_found2 = 0;
>  	int status = 0;
> -	int *switch_bitmap;	/* Bitmap to check if we have processed this
> pair */
> +	int *switch_bitmap = NULL;	/* Bitmap to check if we have
> processed this pair */

Why this initialization is needed?

The added code can fail which will cause a goto to Exit. At Exit
switch_bitmap is freed if it is not zero. The added initialization makes
sure it is zero.


From sashak at voltaire.com  Sun Nov 30 11:07:52 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 21:07:52 +0200
Subject: [ofa-general] Re: [PATCH] opensm: skeleton for toroidal mesh
	analysis
In-Reply-To: <010a01c9531d$ae434150$0ac9c3f0$@com>
References: <000001c943c8$fef921f0$fceb65d0$@com>
	<20081130133026.GE9338@sashak.voltaire.com>
	<010a01c9531d$ae434150$0ac9c3f0$@com>
Message-ID: <20081130190752.GN9338@sashak.voltaire.com>

On 12:58 Sun 30 Nov     , Robert Pearson wrote:
> You wrote:
> > @@ -872,10 +821,15 @@ static int lash_core(lash_t * p_lash)
> >  	int output_link2, i_next_switch2;
> >  	int cycle_found2 = 0;
> >  	int status = 0;
> > -	int *switch_bitmap;	/* Bitmap to check if we have processed this
> > pair */
> > +	int *switch_bitmap = NULL;	/* Bitmap to check if we have
> > processed this pair */
> 
> Why this initialization is needed?
> 
> The added code can fail which will cause a goto to Exit. At Exit
> switch_bitmap is freed if it is not zero. The added initialization makes
> sure it is zero.

Ok. I missed that.

Sasha


From rpearson at systemfabricworks.com  Sun Nov 30 11:24:45 2008
From: rpearson at systemfabricworks.com (Robert Pearson)
Date: Sun, 30 Nov 2008 13:24:45 -0600
Subject: [ofa-general] RE: [PATCH][3] opensm: per mesh node information
In-Reply-To: <20081130134857.GF9338@sashak.voltaire.com>
References: <000501c943d4$57b3f8f0$071bead0$@com>
	<20081130134857.GF9338@sashak.voltaire.com>
Message-ID: <010b01c95321$4e962a20$ebc27e60$@com>

Hi Sasha

You wrote:

> +	if (!(node->links = calloc(num_ports, sizeof(link_t *))))
> +		goto err;
> +
> +	for (i = 0; i < num_ports; i++) {
> +		if (!(node->links[i] = calloc(1, sizeof(link_t))) ||
> +		    !(node->links[i]->ports = calloc(num_ports,
> sizeof(int))))
> +			goto err;
> +	}

Assuming that ports array is preallocated, wouldn't it be simpler to
define link as:

typedef struct _link {
	int switch_id;
	int link_id;
	int num_ports;
	int next_port;
	int ports[0];
} link_t;

, and then:

	node->links[i] = calloc(1, sizeof(link_t *) + num_ports *
sizeof(int))))

?

(Similar optimizations are probably relevant in other places).

I agree they accomplish the same goal. It is a tradeoff between code that is
a little shorter and faster and ease of understanding. I don't have strong
feelings. (For the same reason I tend to use 'x = calloc(1, foo)' instead of
'x = malloc(foo); memset(x, 0, foo);' which is a very common usage pattern.)

The same applies to your later note. We can represent a two dimensional
array as

int **array;

followed by array = calloc(1, n*sizeof(int *));
		array[i] = calloc(1, m*sizeof(int)); ...

and then you get to type

		array[i][j] = xxx;

vs

int *array;

array = calloc(1, m*n*sizeof(int));

and then

array[i*m+j] = xxx;

You can't use array[i][j] here because the compiler doesn't know the size of
the array until run time.


If the code is at all complex I prefer the [][] notation because it is
easier to read and understand. The optimizer in the compiler will take the
pointer dereference or the multiply out of inner loops so there is not
normally a big performance difference.

I guess that this code is complex enough that at least for now it is
preferable to err on the side of keeping everything as straight forward as
possible until we are sure that it is correct. Then if performance is an
issue we can optimize it.

I am happy either way. Let me know what you want me to do.

Regards,

Bob


From sashak at voltaire.com  Sun Nov 30 12:57:53 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 22:57:53 +0200
Subject: [ofa-general] Re: [PATCH][8] opensm: measure size and reorder links
In-Reply-To: <004501c94424$23551620$69ff4260$@com>
References: <004501c94424$23551620$69ff4260$@com>
Message-ID: <20081130205753.GO9338@sashak.voltaire.com>

On 11:37 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
>  
> 
> Here is the eighth patch implementing the mesh analysis algorithm.
> 
>  
> 
> This patch implements
> 
>       - routine to reorder links and measure the size of the mesh
> 
>  
> 
> Regards,
> 
>  
> 
> Bob Pearson
> 
>  
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> 
> ----
> 
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> 
> index 65afae6..a248522 100644
> 
> --- a/opensm/opensm/osm_mesh.c
> 
> +++ b/opensm/opensm/osm_mesh.c
> 
> @@ -832,6 +832,183 @@ next_j:
> 
>  }
> 
>  
> 
>  /*
> 
> + * return |a| < |b|
> 
> + */
> 
> +static inline int ltmag(int a, int b)
> 
> +{
> 
> +     int a1 = (a >= 0)? a : -a;
> 
> +     int b1 = (b >= 0)? b : -b;
> 
> +
> 
> +     return (a1 < b1) || (a1 == b1 && a > b);
> 
> +}
> 
> +
> 
> +/*
> 
> + * reorder_links
> 
> + *
> 
> + * reorder the links out of a switch in sign/dimension order
> 
> + */
> 
> +static int reorder_links(lash_t *p_lash, int sw)
> 
> +{
> 
> +     osm_log_t *p_log = &p_lash->p_osm->log;
> 
> +     switch_t *s = p_lash->switches[sw];
> 
> +     mesh_node_t *node = s->node;
> 
> +     int n = node->num_links;
> 
> +     link_t **links;
> 
> +     int *axes;
> 
> +     int i, j;
> 
> +     int c;
> 
> +     int next = 0;
> 
> +
> 
> +     if (!(links = calloc(n, sizeof(link_t *)))) {
> 
> +           OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating temp array -
> out of memory\n");
> 
> +           return -1;
> 
> +     }
> 
> +
> 
> +     if (!(axes = calloc(n, sizeof(int)))) {
> 
> +           free(links);
> 
> +           OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating temp array -
> out of memory\n");
> 
> +           return -1;
> 
> +     }
> 
> +
> 
> +     /*
> 
> +     * find the links with axes
> 
> +     */
> 
> +     for (j = 1; j <= 2*node->dimension; j++) {
> 
> +           c = j;
> 
> +           if (node->coord[(c-1)/2] > 0)
> 
> +                 c = opposite(s, c);
> 
> +
> 
> +           for (i = 0; i < n; i++) {
> 
> +                 if (!node->links[i])
> 
> +                       continue;
> 
> +                 if (node->axes[i] == c) {
> 
> +                       links[next] = node->links[i];
> 
> +                       axes[next] = node->axes[i];
> 
> +                       node->links[i] = NULL;
> 
> +                       next++;
> 
> +                 }
> 
> +           }
> 
> +     }
> 
> +
> 
> +     /*
> 
> +     * get the rest
> 
> +     */
> 
> +     for (i = 0; i < n; i++) {
> 
> +           if (!node->links[i])
> 
> +                 continue;
> 
> +
> 
> +           links[next] = node->links[i];
> 
> +           axes[next] = node->axes[i];
> 
> +           node->links[i] = NULL;
> 
> +           next++;
> 
> +     }
> 
> +
> 
> +     for (i = 0; i < n; i++) {
> 
> +           node->links[i] = links[i];
> 
> +           node->axes[i] = axes[i];
> 
> +     }
> 
> +
> 
> +     free(links);
> 
> +     free(axes);
> 
> +
> 
> +     return 0;
> 
> +}
> 
> +
> 
> +/*
> 
> + * measure geometry
> 
> + */
> 
> +static int measure_geometry(lash_t *p_lash, int seed)
> 
> +{
> 
> +     int i, j, k;
> 
> +     int sw;
> 
> +     switch_t *s, *s1;
> 
> +     int change;
> 
> +     int dimension = p_lash->mesh->dimension;
> 
> +     int num_switches = p_lash->num_switches;
> 
> +     int assigned_axes = 0, unassigned_axes = 0;
> 
> +     int *max, *min;
> 
> +
> 
> +     for (sw = 0; sw < num_switches; sw++) {
> 
> +           s = p_lash->switches[sw];
> 
> +
> 
> +           s->node->coord = calloc(dimension, sizeof(int));

Is there free() anywhere? I cannot find.

> 
> +           for (i = 0; i < dimension; i++)
> 
> +                 s->node->coord[i] = (sw == seed)? 0 : 0x7fffffff;
> 
> +
> 
> +           for (i = 0; i < s->node->num_links; i++)
> 
> +                 if (s->node->axes[i] == 0)
> 
> +                       unassigned_axes++;
> 
> +                 else
> 
> +                       assigned_axes++;
> 
> +     }
> 
> +
> 
> +     printf("lash: %d/%d unassigned/assigned axes\n", unassigned_axes,
> assigned_axes);
> 
> +
> 
> +     do {
> 
> +           change = 0;
> 
> +
> 
> +           for (sw = 0; sw < num_switches; sw++) {
> 
> +                 s = p_lash->switches[sw];
> 
> +
> 
> +                 if (s->node->coord[0] == 0x7fffffff)
> 
> +                       continue;
> 
> +
> 
> +                 for (j = 0; j < s->node->num_links; j++) {
> 
> +                       if (!s->node->axes[j])
> 
> +                             continue;
> 
> +
> 
> +                       s1 = p_lash->switches[s->node->links[j]->switch_id];
> 
> +
> 
> +                       for (k = 0; k < dimension; k++) {
> 
> +                             int coord = s->node->coord[k];
> 
> +                             int axis = s->node->axes[j] - 1;
> 
> +
> 
> +                             if (k == axis/2)
> 
> +                                   coord += (axis & 1)? -1 : +1;
> 
> +
> 
> +                             if (ltmag(coord, s1->node->coord[k])) {
> 
> +                                   s1->node->coord[k] = coord;
> 
> +                                   change++;
> 
> +                             }
> 
> +                       }
> 
> +                 }
> 
> +           }
> 
> +     } while (change);
> 
> +
> 
> +     for (sw = 0; sw < num_switches; sw++) {
> 
> +           if (reorder_links(p_lash, sw))
> 
> +                 return -1;
> 
> +     }
> 
> +
> 
> +     max = calloc(dimension, sizeof(int));
> 
> +     min = calloc(dimension, sizeof(int));

Are min and max freed?

Sasha

> 
> +     p_lash->mesh->size = calloc(dimension, sizeof(int));
> 
> +
> 
> +     for (i = 0; i < dimension; i++) {
> 
> +           max[i] = -0x7fffffff;
> 
> +           min[i] = 0x7fffffff;
> 
> +     }
> 
> +
> 
> +     for (sw = 0; sw < num_switches; sw++) {
> 
> +           s = p_lash->switches[sw];
> 
> +
> 
> +           for (i = 0; i < dimension; i++) {
> 
> +                 if (s->node->coord[i] == 0x7fffffff)
> 
> +                       continue;
> 
> +                 if (s->node->coord[i] > max[i])
> 
> +                       max[i] = s->node->coord[i];
> 
> +                 if (s->node->coord[i] < min[i])
> 
> +                       min[i] = s->node->coord[i];
> 
> +           }
> 
> +     }
> 
> +
> 
> +     for (i = 0; i < dimension; i++)
> 
> +           p_lash->mesh->size[i] = max[i] - min[i] + 1;
> 
> +
> 
> +     return 0;
> 
> +}
> 
> +
> 
> +/*
> 
>   * osm_mesh_cleanup - free per mesh resources
> 
>   */
> 
>  void osm_mesh_cleanup(lash_t *p_lash)
> 
> @@ -941,6 +1118,14 @@ int osm_do_mesh_analysis(lash_t *p_lash)
> 
>  
> 
>       if (s->node->type) {
> 
>             make_geometry(p_lash, max_class_type);
> 
> +
> 
> +           if (measure_geometry(p_lash, max_class_type))
> 
> +                 return -1;
> 
> +
> 
> +           printf("lash: found ");
> 
> +           for (i = 0; i < mesh->dimension; i++)
> 
> +                 printf("%s%d", i? "X" : "", mesh->size[i]);
> 
> +           printf(" mesh\n");
> 
>       }
> 
>  
> 
>       OSM_LOG_EXIT(p_log);
> 
>  
> 
>  
> 


From sashak at voltaire.com  Sun Nov 30 13:09:11 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 23:09:11 +0200
Subject: [ofa-general] Re: [PATCH][9] opensm: lash preparation
In-Reply-To: <008701c9443c$cfc1f050$6f45d0f0$@com>
References: <008701c9443c$cfc1f050$6f45d0f0$@com>
Message-ID: <20081130210911.GP9338@sashak.voltaire.com>

On 14:33 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> Here is the ninth patch implementing the mesh analysis algorithm.
> 
> This patch makes some minor cleanups in osm_ucast_lash.c in preparation for
> next steps.
> The main change is to minimize the occurrences of phys_connections.
> Also there are a few nits:
>       - delete banner for local variables that moved to ...lash.h
>       - fix bad return value of osm_mesh_node_create fails

I think it should be fixed in related patches (v2), so we will not have
broken code in our history.

>       - clear sw->p_sw->priv on switch cleanup
>       - fix spelling error in comment
>       - discover_network_properties returns an error which was not checked

Actually most of those (and maybe also get_next_port() function) are not
really part of the mesh changes. I'm fine to get separately and to apply
even before GA, since it fixes something.

Sasha

> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index b9394af..95dbcc2 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -55,10 +55,6 @@
>  #include <opensm/osm_mesh.h>
>  #include <opensm/osm_ucast_lash.h>
>  
> -/* //////////////////////////// */
> -/*  Local types                 */
> -/* //////////////////////////// */
> -
>  static cdg_vertex_t *create_cdg_vertex(unsigned num_switches)
>  {
>  	cdg_vertex_t *cdg_vertex = (cdg_vertex_t *)
> malloc(sizeof(cdg_vertex_t));
> @@ -150,6 +146,11 @@ static int cycle_exists(cdg_vertex_t * start,
> cdg_vertex_t * current,
>  	return cycle_found;
>  }
>  
> +static inline int get_next_switch(lash_t *p_lash, int sw, int link)
> +{
> +	return p_lash->switches[sw]->phys_connections[link];
> +}
> +
>  static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw,
>  					       int dest_switch, int lane)
>  {
> @@ -161,7 +162,7 @@ static void remove_semipermanent_depend_for_sp(lash_t *
> p_lash, int sw,
>  	int found;
>  
>  	output_link = switches[sw]->routing_table[dest_switch].out_link;
> -	i_next_switch = switches[sw]->phys_connections[output_link];
> +	i_next_switch = get_next_switch(p_lash, sw, output_link);
>  
>  	while (sw != dest_switch) {
>  		v = cdg_vertex_matrix[lane][sw][i_next_switch];
> @@ -177,8 +178,7 @@ static void remove_semipermanent_depend_for_sp(lash_t *
> p_lash, int sw,
>  			if (i_next_switch != dest_switch) {
>  				next_link =
>  
> switches[i_next_switch]->routing_table[dest_switch].out_link;
> -				i_next_next_switch =
> -
> switches[i_next_switch]->phys_connections[next_link];
> +				i_next_next_switch = get_next_switch(p_lash,
> i_next_switch, next_link);
>  				found = 0;
>  
>  				for (i = 0; i < v->num_dependencies; i++)
> @@ -211,8 +211,7 @@ static void remove_semipermanent_depend_for_sp(lash_t *
> p_lash, int sw,
>  		output_link =
> switches[sw]->routing_table[dest_switch].out_link;
>  
>  		if (sw != dest_switch)
> -			i_next_switch =
> -			    switches[sw]->phys_connections[output_link];
> +			i_next_switch = get_next_switch(p_lash, sw,
> output_link);
>  	}
>  }
>  
> @@ -312,7 +311,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw,
> int dest_switch,
>  	cdg_vertex_t *v, *prev = NULL;
>  
>  	output_link = switches[sw]->routing_table[dest_switch].out_link;
> -	next_switch = switches[sw]->phys_connections[output_link];
> +	next_switch = get_next_switch(p_lash, sw, output_link);
>  
>  	while (sw != dest_switch) {
>  
> @@ -368,7 +367,7 @@ static void generate_cdg_for_sp(lash_t * p_lash, int sw,
> int dest_switch,
>  
>  		if (sw != dest_switch) {
>  			CL_ASSERT(output_link != NONE);
> -			next_switch =
> switches[sw]->phys_connections[output_link];
> +			next_switch = get_next_switch(p_lash, sw,
> output_link);
>  		}
>  
>  		prev = v;
> @@ -384,7 +383,7 @@ static void set_temp_depend_to_permanent_for_sp(lash_t *
> p_lash, int sw,
>  	cdg_vertex_t *v;
>  
>  	output_link = switches[sw]->routing_table[dest_switch].out_link;
> -	next_switch = switches[sw]->phys_connections[output_link];
> +	next_switch = get_next_switch(p_lash, sw, output_link);
>  
>  	while (sw != dest_switch) {
>  		v = cdg_vertex_matrix[lane][sw][next_switch];
> @@ -399,8 +398,7 @@ static void set_temp_depend_to_permanent_for_sp(lash_t *
> p_lash, int sw,
>  		output_link =
> switches[sw]->routing_table[dest_switch].out_link;
>  
>  		if (sw != dest_switch)
> -			next_switch =
> -			    switches[sw]->phys_connections[output_link];
> +			next_switch = get_next_switch(p_lash, sw,
> output_link);
>  	}
>  
>  }
> @@ -414,7 +412,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash,
> int sw, int dest_switch,
>  	cdg_vertex_t *v;
>  
>  	output_link = switches[sw]->routing_table[dest_switch].out_link;
> -	next_switch = switches[sw]->phys_connections[output_link];
> +	next_switch = get_next_switch(p_lash, sw, output_link);
>  
>  	while (sw != dest_switch) {
>  		v = cdg_vertex_matrix[lane][sw][next_switch];
> @@ -439,8 +437,7 @@ static void remove_temp_depend_for_sp(lash_t * p_lash,
> int sw, int dest_switch,
>  		output_link =
> switches[sw]->routing_table[dest_switch].out_link;
>  
>  		if (sw != dest_switch)
> -			next_switch =
> -			    switches[sw]->phys_connections[output_link];
> +			next_switch = get_next_switch(p_lash, sw,
> output_link);
>  
>  	}
>  }
> @@ -502,10 +499,10 @@ static void balance_virtual_lanes(lash_t * p_lash,
> unsigned lanes_needed)
>  		generate_cdg_for_sp(p_lash, dest, src, min_filled_lane);
>  
>  		output_link =
> p_lash->switches[src]->routing_table[dest].out_link;
> -		next_switch =
> p_lash->switches[src]->phys_connections[output_link];
> +		next_switch = get_next_switch(p_lash, src, output_link);
>  
>  		output_link2 =
> p_lash->switches[dest]->routing_table[src].out_link;
> -		next_switch2 =
> p_lash->switches[dest]->phys_connections[output_link2];
> +		next_switch2 = get_next_switch(p_lash, dest, output_link2);
>  
>  
> CL_ASSERT(cdg_vertex_matrix[min_filled_lane][src][next_switch] != NULL);
>  
> CL_ASSERT(cdg_vertex_matrix[min_filled_lane][dest][next_switch2] != NULL);
> @@ -652,7 +649,7 @@ static switch_t *switch_create(lash_t * p_lash, unsigned
> id, osm_switch_t * p_sw
>  	}
>  
>  	if (osm_mesh_node_create(p_lash, sw))
> -		return -1;
> +		return NULL;
>  
>  	sw->p_sw = p_sw;
>  	if (p_sw)
> @@ -673,6 +670,8 @@ static void switch_delete(switch_t * sw)
>  		free(sw->phys_connections);
>  	if (sw->routing_table)
>  		free(sw->routing_table);
> +	if (sw->p_sw)
> +		sw->p_sw->priv = NULL;
>  	free(sw);
>  }
>  
> @@ -875,9 +874,8 @@ static int lash_core(lash_t * p_lash)
>  					output_link2 =
>  
> switches[dest_switch]->routing_table[i].out_link;
>  
> -					i_next_switch =
> switches[i]->phys_connections[output_link];
> -					i_next_switch2 =
> -
> switches[dest_switch]->phys_connections[output_link2];
> +					i_next_switch =
> get_next_switch(p_lash, i, output_link);
> +					i_next_switch2 =
> get_next_switch(p_lash, dest_switch, output_link2);
>  
>  					CL_ASSERT(p_lash->
>  
> cdg_vertex_matrix[v_lane][i][i_next_switch] !=
> @@ -1205,7 +1203,7 @@ static void process_switches(lash_t * p_lash)
>  	osm_switch_t *p_sw, *p_next_sw;
>  	osm_subn_t *p_subn = &p_lash->p_osm->subn;
>  
> -	/* Go through each swithc and process it. i.e build the connection
> +	/* Go through each switch and process it. i.e build the connection
>  	   structure required by LASH */
>  	p_next_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl);
>  	while (p_next_sw != (osm_switch_t *)
> cl_qmap_end(&p_subn->sw_guid_tbl)) {
> @@ -1229,7 +1227,9 @@ static int lash_process(void *context)
>  	// everything starts here
>  	lash_cleanup(p_lash);
>  
> -	discover_network_properties(p_lash);
> +	return_status = discover_network_properties(p_lash);
> +	if (return_status != IB_SUCCESS)
> +		goto Exit;
>  
>  	return_status = init_lash_structures(p_lash);
>  	if (return_status != IB_SUCCESS)
> 
> 


From alekseys at voltaire.com  Sun Nov 30 13:13:06 2008
From: alekseys at voltaire.com (Aleksey Senin)
Date: Sun, 30 Nov 2008 23:13:06 +0200
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6
	support for rdma_bind_addr
References: <1227721899.3121.18.camel@alst60.voltaire.com><ada8wr3cdwe.fsf@cisco.com><1228033480.3621.5.camel@alst60.voltaire.com>
	<adafxl9azsp.fsf@cisco.com>
Message-ID: <39C75744D164D948A170E9792AF8E7CA428B8F@exil.voltaire.com>


This is my first patch for kernel, and I thought that a smallest pieces is better,
but now I'd like to say that the fat one, where all parts must to be applied
together in order to work, more suitable for such change. 
-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com]
Sent: Sun 11/30/2008 7:28 PM
To: Aleksey Senin
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6 support for rdma_bind_addr
 
 > But.. All other patches depends one on another, and in my opinion better
 > to apply it all together, otherwise, when separated, all those 'if'
 > statements have no sense.

Umm, OK.  So why did you send 6 separate patches?


From sashak at voltaire.com  Sun Nov 30 13:15:19 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 23:15:19 +0200
Subject: [ofa-general] Re: [PATCH][3] opensm: per mesh node information
In-Reply-To: <010b01c95321$4e962a20$ebc27e60$@com>
References: <000501c943d4$57b3f8f0$071bead0$@com>
	<20081130134857.GF9338@sashak.voltaire.com>
	<010b01c95321$4e962a20$ebc27e60$@com>
Message-ID: <20081130211519.GQ9338@sashak.voltaire.com>

On 13:24 Sun 30 Nov     , Robert Pearson wrote:
> Hi Sasha
> 
> You wrote:
> 
> > +	if (!(node->links = calloc(num_ports, sizeof(link_t *))))
> > +		goto err;
> > +
> > +	for (i = 0; i < num_ports; i++) {
> > +		if (!(node->links[i] = calloc(1, sizeof(link_t))) ||
> > +		    !(node->links[i]->ports = calloc(num_ports,
> > sizeof(int))))
> > +			goto err;
> > +	}
> 
> Assuming that ports array is preallocated, wouldn't it be simpler to
> define link as:
> 
> typedef struct _link {
> 	int switch_id;
> 	int link_id;
> 	int num_ports;
> 	int next_port;
> 	int ports[0];
> } link_t;
> 
> , and then:
> 
> 	node->links[i] = calloc(1, sizeof(link_t *) + num_ports *
> sizeof(int))))
> 
> ?
> 
> (Similar optimizations are probably relevant in other places).
> 
> I agree they accomplish the same goal. It is a tradeoff between code that is
> a little shorter and faster and ease of understanding. I don't have strong
> feelings. (For the same reason I tend to use 'x = calloc(1, foo)' instead of
> 'x = malloc(foo); memset(x, 0, foo);' which is a very common usage pattern.)
> 
> The same applies to your later note. We can represent a two dimensional
> array as
> 
> int **array;
> 
> followed by array = calloc(1, n*sizeof(int *));
> 		array[i] = calloc(1, m*sizeof(int)); ...
> 
> and then you get to type
> 
> 		array[i][j] = xxx;
> 
> vs
> 
> int *array;
> 
> array = calloc(1, m*n*sizeof(int));
> 
> and then
> 
> array[i*m+j] = xxx;
> 
> You can't use array[i][j] here because the compiler doesn't know the size of
> the array until run time.
> 
> 
> If the code is at all complex I prefer the [][] notation because it is
> easier to read and understand. The optimizer in the compiler will take the
> pointer dereference or the multiply out of inner loops so there is not
> normally a big performance difference.
> 
> I guess that this code is complex enough that at least for now it is
> preferable to err on the side of keeping everything as straight forward as
> possible until we are sure that it is correct. Then if performance is an
> issue we can optimize it.
> 
> I am happy either way. Let me know what you want me to do.

Ok. Let's leave it for now and will look later.

Sasha


From rdreier at cisco.com  Sun Nov 30 13:20:37 2008
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 30 Nov 2008 13:20:37 -0800
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6] AF_INET6
	support for rdma_bind_addr
In-Reply-To: <39C75744D164D948A170E9792AF8E7CA428B8F@exil.voltaire.com>
	(Aleksey Senin's message of "Sun, 30 Nov 2008 23:13:06 +0200")
References: <1227721899.3121.18.camel@alst60.voltaire.com>
	<ada8wr3cdwe.fsf@cisco.com>
	<1228033480.3621.5.camel@alst60.voltaire.com>
	<adafxl9azsp.fsf@cisco.com>
	<39C75744D164D948A170E9792AF8E7CA428B8F@exil.voltaire.com>
Message-ID: <ada7i6kc3mi.fsf@cisco.com>

    Aleksey> This is my first patch for kernel, and I thought that a
    Aleksey> smallest pieces is better, but now I'd like to say that the
    Aleksey> fat one, where all parts must to be applied together in
    Aleksey> order to work, more suitable for such change.

Yes, it's a tricky balance.  You don't want to combine multiple ideas in
one patch, because such patches are hard to review and hard to debug
later.  But splitting one ideas into multiple patches also causes
similar problems (and you have to get the pieces in the right order
too).  And of course the whole question of what constitues an "idea" is
rather subjective.  So we just do the best we can.

 - R.


From sashak at voltaire.com  Sun Nov 30 13:34:02 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 30 Nov 2008 23:34:02 +0200
Subject: [ofa-general] Re: [PATCH][10] opensm: hook mesh code into lash
	(updated)
In-Reply-To: <00ad01c9444e$96e5f300$c4b1d900$@com>
References: <00ad01c9444e$96e5f300$c4b1d900$@com>
Message-ID: <20081130213401.GR9338@sashak.voltaire.com>

On 16:41 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> Here is the tenth patch implementing the mesh analysis algorithm.
> I am resending it because I inadvertently left a bug in the last version.
> 
> This patch
>       - hooks mesh code into lash
>       - replaces sw->phys_connections by the equivalent switch->node->links
>       - replaces sw->num_connections by the equivalent
> switch->node->num_links
>       - replaces sw->virtual_physical_port_table by
> switch->node->links[]->ports
> 
> When the do_mesh_analysis flag is not set there is no change to the function
> except To replace the variables with variables in node that have the same
> size. In this Case the port table in link_t will always have just one port.
> 
> When the do_mesh_analysis flag is set multiple physical links will collapse
> to a Single logical link with a port list with more than one element.
> 
>       - fixed bug, mesh not set in osm_do_mesh_analysis

I think it should be fixed in related patch.

>       - rewrote connect switches to use variables in node
>       - in log Lane requirements (%d) exceed available lanes (%d)
>         Arguments were reversed, fixed

Nice finding.

>       - compute physical egress port in routine get_next_port
>         Which will use round robin if there are more than one
>         Physical links between switches
>       - changed printf's to OSM_LOG's in mesh.c
> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/include/opensm/osm_ucast_lash.h
> b/opensm/include/opensm/osm_ucast_lash.h
> index c037571..f3bde5d 100644
> --- a/opensm/include/opensm/osm_ucast_lash.h
> +++ b/opensm/include/opensm/osm_ucast_lash.h
> @@ -82,9 +82,6 @@ typedef struct _switch {
>  		unsigned lane;
>  	} *routing_table;
>  	mesh_node_t *node;
> -	unsigned int num_connections;
> -	int *virtual_physical_port_table;
> -	int *phys_connections;
>  } switch_t;
>  
>  typedef struct _lash {
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index a248522..dbe3eeb 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -750,7 +750,7 @@ static void make_geometry(lash_t *p_lash, int sw)
>  					continue;
>  
>  				if (l2 == -1) {
> -					printf("ERROR no reverse link\n");
> +					OSM_LOG(p_log, OSM_LOG_DEBUG, "ERROR
> no reverse link\n");
>  					continue;
>  				}
>  
> @@ -919,6 +919,7 @@ static int reorder_links(lash_t *p_lash, int sw)
>   */
>  static int measure_geometry(lash_t *p_lash, int seed)
>  {
> +	osm_log_t *p_log = &p_lash->p_osm->log;
>  	int i, j, k;
>  	int sw;
>  	switch_t *s, *s1;
> @@ -942,7 +943,7 @@ static int measure_geometry(lash_t *p_lash, int seed)
>  				assigned_axes++;
>  	}
>  
> -	printf("lash: %d/%d unassigned/assigned axes\n", unassigned_axes,
> assigned_axes);
> +	OSM_LOG(p_log, OSM_LOG_DEBUG, "%d/%d unassigned/assigned axes\n",
> unassigned_axes, assigned_axes);
>  
>  	do {
>  		change = 0;
> @@ -1069,8 +1070,7 @@ int osm_do_mesh_analysis(lash_t *p_lash)
>  	int i;
>  	mesh_t *mesh;
>  	switch_t *s;
> -
> -	OSM_LOG_ENTER(p_log);
> +	char buf[256], *p;
>  
>  	/*
>  	 * allocate per mesh data structures
> @@ -1080,6 +1080,8 @@ int osm_do_mesh_analysis(lash_t *p_lash)
>  		return -1;
>  	}
>  
> +	mesh = p_lash->mesh;
> +
>  	/*
>  	 * get local metric and invariant for each switch
>  	 * also classify each switch
> @@ -1099,36 +1101,41 @@ int osm_do_mesh_analysis(lash_t *p_lash)
>  
>  	s = p_lash->switches[max_class_type];
>  
> -	printf("lash: found %d node type%s\n", mesh->num_class,
> (mesh->num_class == 1)? "" : "s");
> -	printf("lash: %snode type is ", (mesh->num_class == 1)? "" : "most
> common ");
> +	OSM_LOG(p_log, OSM_LOG_INFO, "found %d node type%s\n",
> mesh->num_class, (mesh->num_class == 1)? "" : "s");
> +
> +	p = buf;
> +	p += sprintf( p, "%snode type is ", (mesh->num_class == 1)? "" :
> "most common ");
>  
>  	if (s->node->type) {
>  		struct _mesh_info *t = &mesh_info[s->node->type];
>  
>  		for (i = 0; i < t->dimension; i++) {
> -			printf("%s%d%s", i? "X" : "", t->size[i],
> +			p += sprintf(p, "%s%d%s", i? " x " : "", t->size[i],
>  				(t->size[i] == 6)? "+" : "");

Would snprintf() be more suitable here in order to prevent potential
overflow? (This is a nit - dimension value is limited now in mesh_info
structure).

>  		}
> -		printf(" mesh\n");
> +		p += sprintf(p, " mesh\n");
>  
>  		p_lash->mesh->dimension = t->dimension;
>  	} else {
> -		printf("unknown geometry\n");
> +		p += sprintf(p, "unknown geometry\n");
>  	}
>  
> +	OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf);
> +
>  	if (s->node->type) {
>  		make_geometry(p_lash, max_class_type);
>  
>  		if (measure_geometry(p_lash, max_class_type))
>  			return -1;
>  
> -		printf("lash: found ");
> +		p = buf;
> +		p += sprintf(p, "found ");
>  		for (i = 0; i < mesh->dimension; i++)
> -			printf("%s%d", i? "X" : "", mesh->size[i]);
> -		printf(" mesh\n");
> -	}
> +			p += sprintf(p, "%s%d", i? " x " : "",
> mesh->size[i]);
> +		p += sprintf(p, " mesh\n");
>  
> -	OSM_LOG_EXIT(p_log);
> +		OSM_LOG(p_log, OSM_LOG_INFO, "%s", buf);
> +	}
>  
>  	return 0;
>  }
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index 95dbcc2..660ad56 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -67,16 +67,53 @@ static cdg_vertex_t *create_cdg_vertex(unsigned
> num_switches)
>  static void connect_switches(lash_t * p_lash, int sw1, int sw2, int
> phy_port_1)
>  {
>  	osm_log_t *p_log = &p_lash->p_osm->log;
> -	unsigned num = p_lash->switches[sw1]->num_connections;
> +	unsigned num = p_lash->switches[sw1]->node->num_links;
> +	switch_t *s1 = p_lash->switches[sw1];
> +	mesh_node_t *node = s1->node;
> +	switch_t *s2;
> +	link_t *l;
> +	int i;
> +
> +	/*
> +	 * if doing mesh analysis:
> +	 *  - do not consider connections to self
> +	 *  - collapse multiple connections between
> +	 *    pair of switches to a single locical link
> +	 */
> +	if (p_lash->p_osm->subn.opt.do_mesh_analysis) {
> +		if (sw1 == sw2)
> +			return;

This 'if (sw1 == sw2)' is related for non mesh case too, right?

Sasha

> +
> +		/* see if we are alredy linked to sw2 */
> +		for (i = 0; i < num; i++) {
> +			l = node->links[i];
> +
> +			if (node->links[i]->switch_id == sw2) {
> +				l->ports[l->num_ports++] = phy_port_1;
> +				return;
> +			}
> +		}
> +	}
> +
> +	l = node->links[num];
> +	l->switch_id = sw2;
> +	l->link_id = -1;
> +	l->ports[l->num_ports++] = phy_port_1;
> +
> +	s2 = p_lash->switches[sw2];
> +	for (i = 0; i < s2->node->num_links; i++) {
> +		if (s2->node->links[i]->switch_id == sw1) {
> +			s2->node->links[i]->link_id = num;
> +			l->link_id = i;
> +			break;
> +		}
> +	}
>  
> -	p_lash->switches[sw1]->phys_connections[num] = sw2;
> -	p_lash->switches[sw1]->virtual_physical_port_table[num] =
> phy_port_1;
> -	p_lash->switches[sw1]->num_connections++;
> +	node->num_links++;
>  
>  	OSM_LOG(p_log, OSM_LOG_VERBOSE,
>  		"LASH connect: %d, %d, %d\n", sw1, sw2,
>  		phy_port_1);
> -
>  }
>  
>  static osm_switch_t *get_osm_switch_from_port(osm_port_t * port)
> @@ -148,7 +185,7 @@ static int cycle_exists(cdg_vertex_t * start,
> cdg_vertex_t * current,
>  
>  static inline int get_next_switch(lash_t *p_lash, int sw, int link)
>  {
> -	return p_lash->switches[sw]->phys_connections[link];
> +	return p_lash->switches[sw]->node->links[link]->switch_id;
>  }
>  
>  static void remove_semipermanent_depend_for_sp(lash_t * p_lash, int sw,
> @@ -233,8 +270,8 @@ static int get_phys_connection(switch_t *sw, int
> switch_to)
>  {
>  	unsigned int i = 0;
>  
> -	for (i = 0; i < sw->num_connections; i++)
> -		if (sw->phys_connections[i] == switch_to)
> +	for (i = 0; i < sw->node->num_links; i++)
> +		if (sw->node->links[i]->switch_id == switch_to)
>  			return i;
>  	return i;
>  }
> @@ -252,8 +289,8 @@ static void shortest_path(lash_t * p_lash, int ir)
>  
>  	while (!cl_is_list_empty(&bfsq)) {
>  		dequeue(&bfsq, &sw);
> -		for (i = 0; i < sw->num_connections; i++) {
> -			swi = switches[sw->phys_connections[i]];
> +		for (i = 0; i < sw->node->num_links; i++) {
> +			swi = switches[sw->node->links[i]->switch_id];
>  			if (swi->q_state == UNQUEUED) {
>  				enqueue(&bfsq, swi);
>  				sw->dij_channels[sw->used_channels++] =
> swi->id;
> @@ -614,25 +651,8 @@ static switch_t *switch_create(lash_t * p_lash,
> unsigned id, osm_switch_t * p_sw
>  		return NULL;
>  	}
>  
> -	sw->virtual_physical_port_table = malloc(num_ports * sizeof(int));
> -	if (!sw->virtual_physical_port_table) {
> -		free(sw->dij_channels);
> -		free(sw);
> -		return NULL;
> -	}
> -
> -	sw->phys_connections = malloc(num_ports * sizeof(int));
> -	if (!sw->phys_connections) {
> -		free(sw->virtual_physical_port_table);
> -		free(sw->dij_channels);
> -		free(sw);
> -		return NULL;
> -	}
> -
>  	sw->routing_table = malloc(num_switches *
> sizeof(sw->routing_table[0]));
>  	if (!sw->routing_table) {
> -		free(sw->phys_connections);
> -		free(sw->virtual_physical_port_table);
>  		free(sw->dij_channels);
>  		free(sw);
>  		return NULL;
> @@ -643,18 +663,13 @@ static switch_t *switch_create(lash_t * p_lash,
> unsigned id, osm_switch_t * p_sw
>  		sw->routing_table[i].lane = NONE;
>  	}
>  
> -	for (i = 0; i < num_ports; i++) {
> -		sw->virtual_physical_port_table[i] = -1;
> -		sw->phys_connections[i] = NONE;
> -	}
> -
> -	if (osm_mesh_node_create(p_lash, sw))
> -		return NULL;
> -
>  	sw->p_sw = p_sw;
>  	if (p_sw)
>  		p_sw->priv = sw;
>  
> +	if (osm_mesh_node_create(p_lash, sw))
> +		return NULL;
> +
>  	return sw;
>  }
>  
> @@ -664,10 +679,6 @@ static void switch_delete(switch_t * sw)
>  
>  	if (sw->dij_channels)
>  		free(sw->dij_channels);
> -	if (sw->virtual_physical_port_table)
> -		free(sw->virtual_physical_port_table);
> -	if (sw->phys_connections)
> -		free(sw->phys_connections);
>  	if (sw->routing_table)
>  		free(sw->routing_table);
>  	if (sw->p_sw)
> @@ -972,7 +983,7 @@ Error_Not_Enough_Lanes:
>  	status = -1;
>  	OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D02: "
>  		"Lane requirements (%d) exceed available lanes (%d)\n",
> -		p_lash->vl_min, lanes_needed);
> +		lanes_needed, p_lash->vl_min);
>  Exit:
>  	if (switch_bitmap)
>  		free(switch_bitmap);
> @@ -985,6 +996,21 @@ static unsigned get_lash_id(osm_switch_t * p_sw)
>  	return ((switch_t *) p_sw->priv)->id;
>  }
>  
> +int get_next_port(switch_t *sw, int link)
> +{
> +	link_t *l = sw->node->links[link];
> +	int port = l->next_port++;
> +
> +	/*
> +	 * note if not doing mesh analysis
> +	 * then num_ports is always 1
> +	 */
> +	if (l->next_port >= l->num_ports)
> +		l->next_port = 0;
> +
> +	return l->ports[port];
> +}
> +
>  static void populate_fwd_tbls(lash_t * p_lash)
>  {
>  	osm_log_t *p_log = &p_lash->p_osm->log;
> @@ -1036,9 +1062,7 @@ static void populate_fwd_tbls(lash_t * p_lash)
>  				    (uint8_t) sw->
>  
> routing_table[dst_lash_switch_id].out_link;
>  				uint8_t physical_egress_port =
> -				    (uint8_t) sw->
> -				    virtual_physical_port_table
> -				    [lash_egress_port];
> +					get_next_port(sw, lash_egress_port);
>  
>  				p_sw->lft_buf[lid] = physical_egress_port;
>  				OSM_LOG(p_log, OSM_LOG_VERBOSE,
> 
> 


From sashak at voltaire.com  Sun Nov 30 15:54:14 2008
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 1 Dec 2008 01:54:14 +0200
Subject: [ofa-general] Re: {PATCH] [2] opensm: per mesh data
In-Reply-To: <000101c943ce$d2707880$77516980$@com>
References: <000101c943ce$d2707880$77516980$@com>
Message-ID: <20081130235414.GS9338@sashak.voltaire.com>

On 01:26 Tue 11 Nov     , Robert Pearson wrote:
> Sasha,
> 
> Here is the second patch implementing the mesh analysis algorithm.
> 
> This patch:
>       - creates a data structure, mesh_t, that holds per mesh information
>       - adds a pointer to this structure in lash_t
>       - creates methods to allocate and free memory for mesh_t
>       - adds osm_ prefix to global routine names (oops)
>       - calls create and cleanup methods
> 
> Regards,
> 
> Bob Pearson
> 
> Signed-off-by: Bob Pearson <rpearson at systemfabricworks.com>
> ----
> diff --git a/opensm/include/opensm/osm_mesh.h
> b/opensm/include/opensm/osm_mesh.h
> index 1467440..8313614 100644
> --- a/opensm/include/opensm/osm_mesh.h
> +++ b/opensm/include/opensm/osm_mesh.h
> @@ -41,6 +41,18 @@
>  
>  struct _lash;
>  
> -int do_mesh_analysis(struct _lash *p_lash);
> +/*
> + * per fabric mesh info
> + */
> +typedef struct _mesh {
> +	int num_class;			/* number of switch classes */
> +	int *class_type;		/* index of first switch found for
> each class */
> +	int *class_count;		/* population of each class */
> +	int dimension;			/* mesh dimension */
> +	int *size;			/* an array to hold size of mesh */
> +} mesh_t;
> +
> +void osm_mesh_cleanup(struct _lash *p_lash);
> +int osm_do_mesh_analysis(struct _lash *p_lash);
>  
>  #endif
> diff --git a/opensm/include/opensm/osm_ucast_lash.h
> b/opensm/include/opensm/osm_ucast_lash.h
> index 646e9a3..1ae3bb6 100644
> --- a/opensm/include/opensm/osm_ucast_lash.h
> +++ b/opensm/include/opensm/osm_ucast_lash.h
> @@ -95,6 +95,7 @@ typedef struct _lash {
>  	cdg_vertex_t ****cdg_vertex_matrix;
>  	int *num_mst_in_lane;
>  	int ***virtual_location;
> +	mesh_t *mesh;
>  } lash_t;
>  
>  #endif
> diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c
> index 7943274..c97925b 100644
> --- a/opensm/opensm/osm_mesh.c
> +++ b/opensm/opensm/osm_mesh.c
> @@ -41,6 +41,7 @@
>  #endif				/* HAVE_CONFIG_H */
>  
>  #include <stdio.h>
> +#include <stdlib.h>
>  #include <opensm/osm_switch.h>
>  #include <opensm/osm_opensm.h>
>  #include <opensm/osm_log.h>
> @@ -48,15 +49,72 @@
>  #include <opensm/osm_ucast_lash.h>
>  
>  /*
> + * osm_mesh_cleanup - free per mesh resources
> + */
> +void osm_mesh_cleanup(lash_t *p_lash)
> +{
> +	mesh_t *mesh = p_lash->mesh;
> +
> +	if (mesh) {
> +		if (mesh->class_type)
> +			free(mesh->class_type);
> +
> +		if (mesh->class_count)
> +			free(mesh->class_count);
> +
> +		free(mesh);
> +
> +		p_lash->mesh = NULL;
> +	}
> +}
> +
> +/*
> + * mesh_create - allocate per mesh resources
> + */
> +static int mesh_create(lash_t *p_lash)
> +{
> +	osm_log_t *p_log = &p_lash->p_osm->log;
> +	mesh_t *mesh;
> +
> +	if(!(mesh = p_lash->mesh = calloc(1, sizeof(mesh_t)))) {
> +		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating mesh - out
> of memory\n");
> +		return -1;
> +	}
> +
> +	if (!(mesh->class_type = calloc(p_lash->num_switches, sizeof(int))))
> {
> +		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating
> mesh->class_type - out of memory\n");
> +		free(mesh);
> +		return -1;
> +	}
> +
> +	if (!(mesh->class_count = calloc(p_lash->num_switches,
> sizeof(int)))) {
> +		OSM_LOG(p_log, OSM_LOG_ERROR, "Failed allocating
> mesh->class_count - out of memory\n");
> +		free(mesh->class_type);
> +		free(mesh);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
>   * do_mesh_analysis
>   */
> -int do_mesh_analysis(lash_t *p_lash)
> +int osm_do_mesh_analysis(lash_t *p_lash)
>  {
>  	int ret = 0;
>  	osm_log_t *p_log = &p_lash->p_osm->log;
>  
>  	OSM_LOG_ENTER(p_log);
>  
> +	/*
> +	 * allocate per mesh data structures
> +	 */
> +	if (mesh_create(p_lash)) {
> +		OSM_LOG_EXIT(p_log);
> +		return -1;
> +	}
> +
>  	printf("lash: do_mesh_analysis stub called\n");
>  
>  	OSM_LOG_EXIT(p_log);
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index e10371c..3577cca 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -825,7 +825,7 @@ static int lash_core(lash_t * p_lash)
>  
>  	OSM_LOG_ENTER(p_log);
>  
> -	if (p_lash->p_osm->subn.opt.do_mesh_analysis &&
> do_mesh_analysis(p_lash)) {
> +	if (p_lash->p_osm->subn.opt.do_mesh_analysis &&
> osm_do_mesh_analysis(p_lash)) {
>  		OSM_LOG(p_log, OSM_LOG_ERROR, "Mesh analysis failed\n");
>  		goto Exit;
>  	}
> @@ -1124,6 +1124,8 @@ static void lash_cleanup(lash_t * p_lash)
>  		free(p_lash->switches);
>  	}
>  	p_lash->switches = NULL;
> +
> +	osm_mesh_cleanup(p_lash);
>  }

lash_cleanup() is called at start of LASH processor, so mesh will keep
allocated data between routing calculation cycles. But as far as I can
see it is not used there. Also osm_mesh_cleanup() is not called on lash
deletion and we have a memory leak.

Maybe osm_mesh_cleanup() should be static function (mesh_cleanup()) and
be called somewhere at end of osm_do_mesh_analysis()?

Sasha


From aostvold at platform.com  Sun Nov 30 23:01:09 2008
From: aostvold at platform.com (Asmund Ostvold)
Date: Mon, 01 Dec 2008 08:01:09 +0100
Subject: [ofa-general] receiving wrong data after trying to allocation
	a too large memory chunk
In-Reply-To: <493104BA.9090607@platform.com>
References: <493104BA.9090607@platform.com>
Message-ID: <49338BB5.6010709@platform.com>

I apologize for my dyslectic subject.  This should be better.  We would 
very much like to know if anybody else can reproduce the results?  If 
you need more info please contact us.

Regards,
Asmund (dyslectic programmer)


Asmund Ostvold wrote:
> We discovered a strange problem running OFED; We're not sure if it is a
> OFED problem but we post it here anyway.
> 
> 
> Short description:
> We have a program that allocates a set of buffers with valloc, sends
> them with ibv_post_send and free them.
> This is run in loop;
> We have a "caching"-algorithm so that we register memory only the first
> time we come across a buffer address.
> We starts getting wrong data for parts of sends after a couple of
> iterations
> 
> There are a few things worth mentioning:
> - We must use valloc; the test works with malloc
> - We must have a malloc allocating a too large chunk before starting the
> loop (the malloc fails)
> 
> We have modified the "rdma_lat.c" program to show the error (attached)
> 
> Regards
> Asmund
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From nicolas.morey-chaisemartin at ext.bull.net  Sun Nov 30 23:11:31 2008
From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin)
Date: Mon, 01 Dec 2008 08:11:31 +0100
Subject: [ofa-general] GitWeb really slow
Message-ID: <49338E23.9000600@ext.bull.net>

Hi,

This is not necessary the best place to post it but I was wondering why 
is ofed's gitweb so slow on the main page?
It takes only a few seconds to display all the repository on kernel.org 
(and there's a lot more) but it takes nearly a minute to display the 
OFED git main page...

I know it's probably not the most critical issue you have to work on but 
I connect quite often on this page and it starts to be really bugging 
me. And I'm probably not the only one ;)

Thanks in advance


Nicolas Morey-Chaisemartin


From sean.hefty at intel.com  Sun Nov 30 23:41:37 2008
From: sean.hefty at intel.com (Sean Hefty)
Date: Sun, 30 Nov 2008 23:41:37 -0800
Subject: [ofa-general] [RMDA CM IPv6 support. PATCHv4 1/6]
	AF_INET6	support for rdma_bind_addr
In-Reply-To: <ada7i6kc3mi.fsf@cisco.com>
References: <1227721899.3121.18.camel@alst60.voltaire.com>	<ada8wr3cdwe.fsf@cisco.com>	<1228033480.3621.5.camel@alst60.voltaire.com>	<adafxl9azsp.fsf@cisco.com>	<39C75744D164D948A170E9792AF8E7CA428B8F@exil.voltaire.com>
	<ada7i6kc3mi.fsf@cisco.com>
Message-ID: <000101c95388$3e141150$dce0180a@amr.corp.intel.com>

>Yes, it's a tricky balance.  You don't want to combine multiple ideas in
>one patch, because such patches are hard to review and hard to debug
>later.  But splitting one ideas into multiple patches also causes
>similar problems (and you have to get the pieces in the right order
>too).  And of course the whole question of what constitues an "idea" is
>rather subjective.  So we just do the best we can.

The patch set taken collectively looks good to me.  I think it makes sense to
view the series as 2 patches, one for ib_addr (patches 4-6), and one for rdma_cm
(patches 1-3).  The ib_addr patch should come first.

- Sean