From vlad at lists.openfabrics.org Mon Sep 1 03:05:40 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 1 Sep 2008 03:05:40 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080901-0200 daily build status Message-ID: <20080901100540.6A9C2E608DC@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From vlad at mellanox.co.il Mon Sep 1 07:11:03 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 1 Sep 2008 17:11:03 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Set RAE and FRE flags, initialize mtt_sz field in the mpt entry. Message-ID: <20080901141103.GA32171@mellanox.co.il> rae - enable remote access on this fast memory region. fre - enable Fast Registration Operations on this region. mtt_sz - number of MTT entries allocated for this memory region. Signed-off-by: Vladimir Sokolovsky --- drivers/net/mlx4/mr.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index 62071d9..9c026e1 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -67,7 +67,8 @@ struct mlx4_mpt_entry { #define MLX4_MPT_FLAG_PHYSICAL (1 << 9) #define MLX4_MPT_FLAG_REGION (1 << 8) -#define MLX4_MPT_PD_FLAG_FAST_REG (1 << 26) +#define MLX4_MPT_PD_FLAG_FAST_REG (1 << 27) +#define MLX4_MPT_PD_FLAG_RAE (1 << 28) #define MLX4_MPT_PD_FLAG_EN_INV (3 << 24) #define MLX4_MTT_FLAG_PRESENT 1 @@ -349,6 +350,9 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr) /* fast register MR in free state */ mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_FREE); mpt_entry->pd_flags |= cpu_to_be32(MLX4_MPT_PD_FLAG_FAST_REG); + mpt_entry->pd_flags |= cpu_to_be32(MLX4_MPT_PD_FLAG_RAE); + mpt_entry->mtt_sz = cpu_to_be32((1 << mr->mtt.order) * + MLX4_MTT_ENTRY_PER_SEG); } else { mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_SW_OWNS); } -- 1.6.0.1.90.g27a6e From kliteyn at dev.mellanox.co.il Mon Sep 1 08:03:28 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 01 Sep 2008 18:03:28 +0300 Subject: [ofa-general] [PATCH] opensm/Makefile.am: adding yacc-generated .h file as dependency Message-ID: <48BC0440.8050807@dev.mellanox.co.il> Hi Sasha, Adding header file that is produced by yacc/bison to the general dependencies. W/o it compiling of lex-generated .c file sometimes fails. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/Makefile.am | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 7ca4c2a..f94842c 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -126,7 +126,7 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_vl15intf.h \ $(top_builddir)/include/opensm/osm_version.h -BUILT_SOURCES = osm_version +BUILT_SOURCES = osm_version osm_qos_parser_y.h osm_version: if [ -x $(top_srcdir)/../gen_ver.sh ] ; then \ ver_file=$(top_builddir)/include/opensm/osm_version.h ; \ -- 1.5.1.4 From rdreier at cisco.com Mon Sep 1 08:48:09 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 01 Sep 2008 08:48:09 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: Set RAE and FRE flags, initialize mtt_sz field in the mpt entry. In-Reply-To: <20080901141103.GA32171@mellanox.co.il> (Vladimir Sokolovsky's message of "Mon, 1 Sep 2008 17:11:03 +0300") References: <20080901141103.GA32171@mellanox.co.il> Message-ID: I need help deciding whether to get this in 2.6.27 or not. With this patch, how is send queue fast register working? If this is the last fix then I think we can get it in 2.6.27. If you are still debugging and it still doesn't work well, then I might want to wait and see how big the required fixes end up being. Thanks, Roland From arlin.r.davis at intel.com Mon Sep 1 19:25:59 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:25:59 -0700 Subject: [ofa-general] [PATCH 2/5][v1.2] dapl: fix compiler warnings in common code Message-ID: <000101c90ca3$3ced0a60$5464fe0a@amr.corp.intel.com> Cleanup uDAPL common code. Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/common/dapl_ep_get_status.c | 1 + dapl/common/dapl_ep_modify.c | 2 +- dapl/common/dapl_rmr_bind.c | 4 +++- dapl/udapl/dapl_evd_wait.c | 17 ++++++++++------- dapl/udapl/dapl_lmr_create.c | 5 +++-- 5 files changed, 18 insertions(+), 11 deletions(-) diff --git a/dapl/common/dapl_ep_get_status.c b/dapl/common/dapl_ep_get_status.c index a931355..3266134 100644 --- a/dapl/common/dapl_ep_get_status.c +++ b/dapl/common/dapl_ep_get_status.c @@ -38,6 +38,7 @@ #include "dapl.h" #include "dapl_ring_buffer_util.h" +#include "dapl_cookie.h" /* * dapl_ep_get_status diff --git a/dapl/common/dapl_ep_modify.c b/dapl/common/dapl_ep_modify.c index 74e3331..05b39db 100644 --- a/dapl/common/dapl_ep_modify.c +++ b/dapl/common/dapl_ep_modify.c @@ -84,7 +84,7 @@ dapl_ep_modify ( { DAPL_IA *ia; DAPL_EP *ep1, *ep2; - DAT_EP_ATTR ep_attr1, ep_attr2; + DAT_EP_ATTR ep_attr1 = {0}, ep_attr2 = {0}; DAPL_EP new_ep, copy_of_old_ep; DAPL_EP alloc_ep; /* Holder for resources. */ DAPL_PZ *tmp_pz; diff --git a/dapl/common/dapl_rmr_bind.c b/dapl/common/dapl_rmr_bind.c index 905ea2c..c9dc02f 100644 --- a/dapl/common/dapl_rmr_bind.c +++ b/dapl/common/dapl_rmr_bind.c @@ -84,15 +84,17 @@ dapli_rmr_bind_fuse ( DAPL_COOKIE *cookie; DAT_RETURN dat_status; DAT_BOOLEAN is_signaled; + DAPL_HASH_DATA hash_lmr; dat_status = dapls_hash_search (rmr->header.owner_ia->hca_ptr->lmr_hash_table, lmr_triplet->lmr_context, - (DAPL_HASH_DATA *) &lmr); + &hash_lmr); if ( DAT_SUCCESS != dat_status) { dat_status = DAT_ERROR (DAT_INVALID_PARAMETER, DAT_INVALID_ARG2); goto bail; } + lmr = (DAPL_LMR*)hash_lmr; /* if the ep in unconnected return an error. IB requires that the */ /* QP be connected to change a memory window binding since: */ diff --git a/dapl/udapl/dapl_evd_wait.c b/dapl/udapl/dapl_evd_wait.c index 966cef0..a03c5ea 100644 --- a/dapl/udapl/dapl_evd_wait.c +++ b/dapl/udapl/dapl_evd_wait.c @@ -141,27 +141,30 @@ DAT_RETURN dapl_evd_wait ( waitable = evd_ptr->evd_waitable; dapl_os_assert ( sizeof(DAT_COUNT) == sizeof(DAPL_EVD_STATE) ); - evd_state = dapl_os_atomic_assign ( (DAPL_ATOMIC *)&evd_ptr->evd_state, - (DAT_COUNT) DAPL_EVD_STATE_OPEN, - (DAT_COUNT) DAPL_EVD_STATE_WAITED ); - dapl_os_unlock ( &evd_ptr->header.lock ); + evd_state = evd_ptr->evd_state; + if (evd_ptr->evd_state == DAPL_EVD_STATE_OPEN) + evd_ptr->evd_state = DAPL_EVD_STATE_WAITED; if ( evd_state != DAPL_EVD_STATE_OPEN ) { /* Bogus state, bail out */ dat_status = DAT_ERROR (DAT_INVALID_STATE,0); + dapl_os_unlock ( &evd_ptr->header.lock ); goto bail; } if (!waitable) { /* This EVD is not waitable, reset the state and bail */ - (void) dapl_os_atomic_assign ((DAPL_ATOMIC *)&evd_ptr->evd_state, - (DAT_COUNT) DAPL_EVD_STATE_WAITED, - evd_state); + if (evd_ptr->evd_state == DAPL_EVD_STATE_WAITED) + evd_ptr->evd_state = evd_state; + dat_status = DAT_ERROR (DAT_INVALID_STATE, DAT_INVALID_STATE_EVD_UNWAITABLE); + dapl_os_unlock ( &evd_ptr->header.lock ); goto bail; } + dapl_os_unlock ( &evd_ptr->header.lock ); + /* * We now own the EVD, even though we don't have the lock anymore, diff --git a/dapl/udapl/dapl_lmr_create.c b/dapl/udapl/dapl_lmr_create.c index 2a71864..b2492ea 100644 --- a/dapl/udapl/dapl_lmr_create.c +++ b/dapl/udapl/dapl_lmr_create.c @@ -202,6 +202,7 @@ dapli_lmr_create_lmr ( DAPL_LMR *lmr; DAT_REGION_DESCRIPTION reg_desc; DAT_RETURN dat_status; + DAPL_HASH_DATA hash_lmr; dapl_dbg_log (DAPL_DBG_TYPE_API, "dapl_lmr_create_lmr (%p, %p, %p, %x, %p, %p, %p, %p)\n", @@ -215,13 +216,13 @@ dapli_lmr_create_lmr ( dat_status = dapls_hash_search (ia->hca_ptr->lmr_hash_table, original_lmr->param.lmr_context, - (DAPL_HASH_DATA *) &lmr); + &hash_lmr); if ( dat_status != DAT_SUCCESS ) { dat_status = DAT_ERROR (DAT_INVALID_PARAMETER,DAT_INVALID_ARG2); goto bail; } - + lmr = (DAPL_LMR*)hash_lmr; reg_desc.for_lmr_handle = (DAT_LMR_HANDLE) original_lmr; lmr = dapl_lmr_alloc (ia, -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:25:54 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:25:54 -0700 Subject: [ofa-general] [PATCH 1/5][v1.2] dtest/dapltest: fix compiler warnings, add _GNU_SOURCE to test application builds Message-ID: <000001c90ca3$3b3dd3c0$5464fe0a@amr.corp.intel.com> Patch set to cleanup all warnings and fix fedora build issues. dtest/dapltest: fix all compiler warnings, cleanup test code, build with -Wall, -D_GNU_SOURCE. Signed-off by: Arlin Davis ardavis at ichips.intel.com --- test/dapltest/Makefile.am | 2 +- test/dapltest/cmd/dapl_netaddr.c | 2 +- test/dapltest/test/dapl_limit.c | 125 ++++++++++++++++++-------------------- test/dtest/Makefile.am | 2 +- test/dtest/dtest.c | 62 ++++++++----------- 5 files changed, 89 insertions(+), 104 deletions(-) diff --git a/test/dapltest/Makefile.am b/test/dapltest/Makefile.am index 1a19c53..d826f80 100755 --- a/test/dapltest/Makefile.am +++ b/test/dapltest/Makefile.am @@ -3,7 +3,7 @@ INCLUDES = -I include \ -I $(srcdir)/../../dat/include bin_PROGRAMS = dapltest - +dapltest_CFLAGS = -g -Wall -D_GNU_SOURCE dapltest_SOURCES = \ cmd/dapl_main.c \ cmd/dapl_params.c \ diff --git a/test/dapltest/cmd/dapl_netaddr.c b/test/dapltest/cmd/dapl_netaddr.c index a306335..e1600d5 100644 --- a/test/dapltest/cmd/dapl_netaddr.c +++ b/test/dapltest/cmd/dapl_netaddr.c @@ -90,7 +90,7 @@ DT_NetAddrLookupHostAddress (DAT_IA_ADDRESS_PTR to_netaddr, whatzit = "service unavailable"; break; } -#if !defined(WIN32) +#if !defined(WIN32) && defined(__USE_GNU) case EAI_ADDRFAMILY: { whatzit = "node has no address in this family"; diff --git a/test/dapltest/test/dapl_limit.c b/test/dapltest/test/dapl_limit.c index f619edd..e308bef 100644 --- a/test/dapltest/test/dapl_limit.c +++ b/test/dapltest/test/dapl_limit.c @@ -36,13 +36,13 @@ static bool more_handles (DT_Tdep_Print_Head *phead, - DAT_HANDLE **old_ptrptr, /* pointer to current pointer */ + void **old_ptrptr, /* pointer to current pointer */ unsigned int *old_count, /* number pointed to */ unsigned int size) /* size of one datum */ { unsigned int count = *old_count; - DAT_HANDLE *old_handles = *old_ptrptr; - DAT_HANDLE *handle_tmp = DT_Mdep_Malloc (count * 2 * size); + void *old_handles = *old_ptrptr; + void *handle_tmp = DT_Mdep_Malloc (count * 2 * size); if (!handle_tmp) { @@ -166,9 +166,9 @@ limit_test ( DT_Tdep_Print_Head *phead, DAT_EVD_HANDLE ia_async_handle; } OneOpen; - unsigned int count = START_COUNT; - OneOpen *hdlptr = (OneOpen *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(OneOpen)); + OneOpen *hdlptr = (OneOpen *)hptr; /* IA Exhaustion test loop */ if (hdlptr) @@ -181,14 +181,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: IAs opened: %d\n", module, w); retval = true; break; } + hdlptr = (OneOpen *)hptr; /* Specify that we want to get back an async EVD. */ hdlptr[w].ia_async_handle = DAT_HANDLE_NULL; ret = dat_ia_open (cmd->device_name, @@ -265,9 +264,9 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many PZs we can create */ - unsigned int count = START_COUNT; - DAT_PZ_HANDLE *hdlptr = (DAT_PZ_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(DAT_PZ_HANDLE)); + DAT_PZ_HANDLE *hdlptr = (DAT_PZ_HANDLE *)hptr; /* PZ Exhaustion test loop */ if (hdlptr) @@ -282,14 +281,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: PZs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_PZ_HANDLE *)hptr; ret = dat_pz_create (hdl_sets[w % cmd->width].ia_handle, &hdlptr[w]); if (ret != DAT_SUCCESS) @@ -363,10 +361,10 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many CNOs we can create */ - unsigned int count = START_COUNT; - DAT_CNO_HANDLE *hdlptr = (DAT_CNO_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); - + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(DAT_CNO_HANDLE)); + DAT_CNO_HANDLE *hdlptr = (DAT_CNO_HANDLE *)hptr; + /* CNO Exhaustion test loop */ if (hdlptr) { @@ -380,14 +378,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: CNOs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_CNO_HANDLE *)hptr; ret = dat_cno_create (hdl_sets[w % cmd->width].ia_handle, DAT_OS_WAIT_PROXY_AGENT_NULL, &hdlptr[w]); @@ -484,9 +481,10 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many EVDs we can create */ - unsigned int count = START_COUNT; - DAT_EVD_HANDLE *hdlptr = (DAT_EVD_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc(count * sizeof(DAT_EVD_HANDLE)); + DAT_EVD_HANDLE *hdlptr = (DAT_EVD_HANDLE *)hptr; + DAT_EVD_FLAGS flags = ( DAT_EVD_DTO_FLAG | DAT_EVD_RMR_BIND_FLAG | DAT_EVD_CR_FLAG); @@ -519,14 +517,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: EVDs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_EVD_HANDLE *)hptr; ret = DT_Tdep_evd_create (hdl_sets[w % cmd->width].ia_handle, DFLT_QLEN, hdl_sets[w % cmd->width].cno_handle, @@ -603,9 +600,9 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many EPs we can create */ - unsigned int count = START_COUNT; - DAT_EP_HANDLE *hdlptr = (DAT_EP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc(count * sizeof(DAT_EP_HANDLE)); + DAT_EP_HANDLE *hdlptr = (DAT_EP_HANDLE *)hptr; /* EP Exhaustion test loop */ if (hdlptr) @@ -618,14 +615,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: EPs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_EP_HANDLE *)hptr; ret = dat_ep_create (hdl_sets[w % cmd->width].ia_handle, hdl_sets[w % cmd->width].pz_handle, hdl_sets[w % cmd->width].evd_handle, @@ -674,11 +670,11 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many RSPs we can create */ - unsigned int count = START_COUNT; - DAT_RSP_HANDLE *hdlptr = (DAT_RSP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); - DAT_EP_HANDLE *epptr = (DAT_EP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*epptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc(count * sizeof (DAT_RSP_HANDLE)); + DAT_RSP_HANDLE *hdlptr = (DAT_RSP_HANDLE *)hptr; + void *eptr = DT_Mdep_Malloc(count * sizeof (DAT_EP_HANDLE)); + DAT_EP_HANDLE *epptr = (DAT_EP_HANDLE *)eptr; /* RSP Exhaustion test loop */ if (hdlptr) @@ -695,23 +691,21 @@ limit_test ( DT_Tdep_Print_Head *phead, unsigned int count1 = count; unsigned int count2 = count; - if (!more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count1, - sizeof (*hdlptr))) + if (!more_handles(phead, &hptr, &count1, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: RSPs created: %d\n", module, w); retval = true; break; } - if (!more_handles (phead, (DAT_HANDLE **) &epptr, - &count2, - sizeof (*epptr))) + hdlptr = (DAT_RSP_HANDLE *)hptr; + + if (!more_handles (phead, &eptr, &count2, sizeof(*epptr))) { DT_Tdep_PT_Printf (phead, "%s: RSPs created: %d\n", module, w); retval = true; break; } - + epptr = (DAT_EP_HANDLE *)eptr; if (count1 != count2) { DT_Tdep_PT_Printf (phead, "%s: Mismatch in allocation of handle arrays at point %d\n", @@ -810,9 +804,9 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many PSPs we can create */ - unsigned int count = START_COUNT; - DAT_PSP_HANDLE *hdlptr = (DAT_PSP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof (DAT_PSP_HANDLE)); + DAT_PSP_HANDLE *hdlptr = (DAT_PSP_HANDLE *)hptr; /* PSP Exhaustion test loop */ if (hdlptr) @@ -825,14 +819,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: PSPs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_PSP_HANDLE *)hptr; ret = dat_psp_create (hdl_sets[w % cmd->width].ia_handle, CONN_QUAL0 + w, hdl_sets[w % cmd->width].evd_handle, @@ -936,10 +929,10 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many LMRs we can create */ - unsigned int count = START_COUNT; - Bpool **hdlptr = (Bpool **) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); - + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(Bpool*)); + Bpool **hdlptr = (Bpool **)hptr; + /* LMR Exhaustion test loop */ if (hdlptr) { @@ -951,9 +944,7 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: no memory for LMR handles\n", module); @@ -961,6 +952,7 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = true; break; } + hdlptr = (Bpool **)hptr; /* * Let BpoolAlloc do the hard work; this means that * we're testing unique memory registrations rather @@ -1012,14 +1004,15 @@ limit_test ( DT_Tdep_Print_Head *phead, * We are posting the same buffer 'cnt' times, deliberately, * but that should be OK. */ - unsigned int count = START_COUNT; - DAT_LMR_TRIPLET *hdlptr = (DAT_LMR_TRIPLET *) - DT_Mdep_Malloc (count * cmd->width * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = + DT_Mdep_Malloc(count * cmd->width * sizeof(DAT_LMR_TRIPLET)); + DAT_LMR_TRIPLET *hdlptr = (DAT_LMR_TRIPLET *)hptr; /* Recv-Post Exhaustion test loop */ if (hdlptr) { - unsigned int w = 0; + unsigned int w = 0; unsigned int i = 0; unsigned int done = 0; @@ -1028,9 +1021,8 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - cmd->width * sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, + cmd->width * sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: no memory for IOVs \n", module); @@ -1042,6 +1034,7 @@ limit_test ( DT_Tdep_Print_Head *phead, done = retval = true; break; } + hdlptr = (DAT_LMR_TRIPLET *)hptr; for (i = 0; i < cmd->width; i++) { DAT_LMR_TRIPLET *iovp = &hdlptr[w * cmd->width + i]; diff --git a/test/dtest/Makefile.am b/test/dtest/Makefile.am index fcb9b4e..fb605ba 100755 --- a/test/dtest/Makefile.am +++ b/test/dtest/Makefile.am @@ -1,5 +1,5 @@ bin_PROGRAMS = dtest dtest_SOURCES = dtest.c +dtest_CFLAGS = -g -Wall -D_GNU_SOURCE INCLUDES = -I $(srcdir)/../../dat/include dtest_LDADD = $(srcdir)/../../dat/udat/libdat.la - diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 039b6bf..a93f878 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -50,6 +50,7 @@ #define DAPL_PROVIDER "OpenIB-cma" #endif +#define F64x "%"PRIx64"" #define MAX_POLLING_CNT 50000 #define MAX_RDMA_RD 4 #define MAX_PROCS 1000 @@ -142,7 +143,6 @@ struct { } time; /* defaults */ -static int parent=1; static int connected=0; static int burst=10; static int server=1; @@ -151,17 +151,13 @@ static int polling=0; static int poll_count=0; static int rdma_wr_poll_count=0; static int rdma_rd_poll_count[MAX_RDMA_RD]={0}; -static int pin_memory=0; static int delay=0; static int buf_len=RDMA_BUFFER_SIZE; static int use_cno=0; -static int post_recv_count=MSG_BUF_COUNT; static int recv_msg_index=0; static int burst_msg_posted=0; static int burst_msg_index=0; -static pid_t child[MAX_PROCS+1]; - /* forward prototypes */ const char * DT_RetToString (DAT_RETURN ret_value); const char * DT_EventToSTr (DAT_EVENT_NUMBER event_code); @@ -188,7 +184,7 @@ DAT_RETURN do_ping_pong_msg( void ); #define LOGPRINTF(_format, _aa...) \ if (verbose) \ printf(_format, ##_aa) - +int main(int argc, char **argv) { int i,c; @@ -358,12 +354,12 @@ main(int argc, char **argv) inet_ntop(AF_INET, &((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr, addr_str, sizeof(addr_str)); - printf("\n%d Query EP: LOCAL addr %s port %d\n", getpid(), + printf("\n%d Query EP: LOCAL addr %s port "F64x"\n", getpid(), addr_str, ep_param.local_port_qual); inet_ntop(AF_INET, &((struct sockaddr_in *)ep_param.remote_ia_address_ptr)->sin_addr, addr_str, sizeof(addr_str)); - printf("%d Query EP: REMOTE addr %s port %d\n", getpid(), + printf("%d Query EP: REMOTE addr %s port "F64x"\n", getpid(), addr_str, ep_param.remote_port_qual); fflush(stdout); @@ -492,6 +488,7 @@ cleanup: /* free rdma buffers */ free(rbuf); free(sbuf); + return(0); } @@ -577,7 +574,7 @@ send_msg( void *data, if ((event.event_data.dto_completion_event_data.transfered_length != size ) || (event.event_data.dto_completion_event_data.user_cookie.as_64 != 0xaaaa )) { - fprintf(stderr, "%d: ERROR: DTO len %d or cookie " PRIx64 "\n", + fprintf(stderr, "%d: ERROR: DTO len "F64x" or cookie "F64x"\n", getpid(), event.event_data.dto_completion_event_data.transfered_length, event.event_data.dto_completion_event_data.user_cookie.as_64 ); @@ -599,7 +596,6 @@ DAT_RETURN connect_ep( char *hostname, int conn_id ) { DAT_SOCK_ADDR remote_addr; - DAT_EP_ATTR ep_attr; DAT_RETURN ret; DAT_REGION_DESCRIPTION region; DAT_EVENT event; @@ -611,7 +607,7 @@ connect_ep( char *hostname, int conn_id ) /* Register send message buffer */ LOGPRINTF("%d Registering send Message Buffer %p, len %d\n", - getpid(), &rmr_send_msg, sizeof(DAT_RMR_TRIPLET) ); + getpid(), &rmr_send_msg, (int)sizeof(DAT_RMR_TRIPLET)); region.for_va = &rmr_send_msg; ret = dat_lmr_create( h_ia, DAT_MEM_TYPE_VIRTUAL, @@ -800,7 +796,8 @@ connect_ep( char *hostname, int conn_id ) rmr_send_msg.target_address = (DAT_VADDR)(unsigned long)rbuf; rmr_send_msg.segment_length = RDMA_BUFFER_SIZE; - printf("%d Send RMR to remote: snd_msg: r_key_ctx=%x,pad=%x,va=%llx,len=0x%x\n", + printf("%d Send RMR to remote: snd_msg: r_key_ctx=%x,pad=%x, " + "va="F64x",len="F64x"\n", getpid(), rmr_send_msg.rmr_context, rmr_send_msg.pad, rmr_send_msg.target_address, rmr_send_msg.segment_length ); @@ -862,16 +859,17 @@ connect_ep( char *hostname, int conn_id ) sizeof( DAT_RMR_TRIPLET )) || (event.event_data.dto_completion_event_data.user_cookie.as_64 != recv_msg_index) ) { - fprintf(stderr,"ERR recv event: len=%d cookie=" PRIx64 " expected %d/%d\n", + fprintf(stderr,"ERR recv event: len=%d cookie="F64x" expected %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), recv_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)sizeof(DAT_RMR_TRIPLET), recv_msg_index ); return( DAT_ABORT ); } r_iov = rmr_recv_msg[ recv_msg_index ]; - printf("%d Received RMR from remote: r_iov: r_key_ctx=%x,pad=%x,va=%llx,len=0x%x\n", + printf("%d Received RMR from remote: r_iov: r_key_ctx=%x,pad=%x " + ",va="F64x",len="F64x"\n", getpid(), r_iov.rmr_context, r_iov.pad, r_iov.target_address, r_iov.segment_length ); @@ -887,7 +885,6 @@ disconnect_ep() DAT_RETURN ret; DAT_EVENT event; DAT_COUNT nmore; - int i,flush_cnt; if (connected) { @@ -962,13 +959,11 @@ disconnect_ep() DAT_RETURN do_rdma_write_with_msg( ) { - DAT_REGION_DESCRIPTION region; DAT_EVENT event; DAT_COUNT nmore; DAT_LMR_TRIPLET l_iov[MSG_IOV_COUNT]; DAT_RMR_TRIPLET r_iov; DAT_DTO_COOKIE cookie; - DAT_RMR_CONTEXT their_context; DAT_RETURN ret; int i; @@ -994,7 +989,7 @@ do_rdma_write_with_msg( ) l_iov[i].virtual_address = (DAT_VADDR)(unsigned long) (&sbuf[l_iov[i].segment_length*i]); - LOGPRINTF("%d rdma_write iov[%d] buf=%p,len=%d\n", + LOGPRINTF("%d rdma_write iov[%d] buf=%p,len="F64x"\n", getpid(), i, &sbuf[l_iov[i].segment_length*i], l_iov[i].segment_length); } @@ -1081,17 +1076,17 @@ do_rdma_write_with_msg( ) if ( (event.event_data.dto_completion_event_data.transfered_length != sizeof( DAT_RMR_TRIPLET )) || (event.event_data.dto_completion_event_data.user_cookie.as_64 != recv_msg_index) ) { + - fprintf(stderr,"unexpected event data for receive: len=%d cookie=" PRIx64 " exp %d/%d\n", + fprintf(stderr,"unexpected event data for receive: len=%d cookie="F64x" exp %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), recv_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)sizeof(DAT_RMR_TRIPLET), recv_msg_index ); return( DAT_ABORT ); } r_iov = rmr_recv_msg[ recv_msg_index ]; - printf("%d Received RMR from remote: r_iov: ctx=%x,pad=%x,va=%p,len=0x%x\n", + printf("%d Received RMR from remote: r_iov: ctx=%x,pad=%x,va=%p,len="F64x"\n", getpid(), r_iov.rmr_context, r_iov.pad, (void*)(unsigned long)r_iov.target_address, @@ -1112,13 +1107,11 @@ do_rdma_write_with_msg( ) DAT_RETURN do_rdma_read_with_msg( ) { - DAT_REGION_DESCRIPTION region; DAT_EVENT event; DAT_COUNT nmore; DAT_LMR_TRIPLET l_iov; DAT_RMR_TRIPLET r_iov; DAT_DTO_COOKIE cookie; - DAT_RMR_CONTEXT their_context; DAT_RETURN ret; int i; @@ -1191,9 +1184,9 @@ do_rdma_read_with_msg( ) } if ((event.event_data.dto_completion_event_data.transfered_length != buf_len ) || (event.event_data.dto_completion_event_data.user_cookie.as_64 != 0x9999 )) { - fprintf(stderr, "%d: ERROR: DTO len %d or cookie " PRIx64 "\n", + fprintf(stderr, "%d: ERROR: DTO len %d or cookie "F64x"\n", getpid(), - event.event_data.dto_completion_event_data.transfered_length, + (int)event.event_data.dto_completion_event_data.transfered_length, event.event_data.dto_completion_event_data.user_cookie.as_64 ); return( DAT_ABORT ); } @@ -1273,17 +1266,17 @@ do_rdma_read_with_msg( ) if ( (event.event_data.dto_completion_event_data.transfered_length != sizeof( DAT_RMR_TRIPLET )) || (event.event_data.dto_completion_event_data.user_cookie.as_64 != recv_msg_index) ) { - fprintf(stderr,"unexpected event data for receive: len=%d cookie=" PRIx64 " exp %d/%d\n", + fprintf(stderr,"unexpected event data for receive: len=%d cookie="F64x" exp %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), recv_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)sizeof(DAT_RMR_TRIPLET), recv_msg_index ); return( DAT_ABORT ); } r_iov = rmr_recv_msg[ recv_msg_index ]; - printf("%d Received RMR from remote: r_iov: ctx=%x,pad=%x,va=%p,len=0x%x\n", + printf("%d Received RMR from remote: r_iov: ctx=%x,pad=%x,va=%p,len="F64x"\n", getpid(), r_iov.rmr_context, r_iov.pad, (void*)(unsigned long)r_iov.target_address, r_iov.segment_length ); @@ -1425,9 +1418,9 @@ do_ping_pong_msg( ) != buf_len) || (event.event_data.dto_completion_event_data.user_cookie.as_64 != burst_msg_index) ) { - fprintf(stderr,"ERR: recv event: len=%d cookie=" PRIx64 " exp %d/%d\n", + fprintf(stderr,"ERR: recv event: len=%d cookie="F64x" exp %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, + event.event_data.dto_completion_event_data.user_cookie.as_64, buf_len, burst_msg_index ); return( DAT_ABORT ); @@ -1760,7 +1753,6 @@ const char * DT_RetToString (DAT_RETURN ret_value) { const char *major_msg, *minor_msg; - int sz; dat_strerror (ret_value, &major_msg, &minor_msg); -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:26:00 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:26:00 -0700 Subject: [ofa-general] [PATCH 3/5][v1.2] dat: fix compiler warnings in dat common code Message-ID: <000201c90ca3$3da95580$5464fe0a@amr.corp.intel.com> Cleanup uDAT common code Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dat/common/dat_dr.c | 26 +++++++++++--------------- dat/common/dat_sr.c | 19 ++++++++++++------- 2 files changed, 23 insertions(+), 22 deletions(-) diff --git a/dat/common/dat_dr.c b/dat/common/dat_dr.c index 89fc861..f40a94c 100644 --- a/dat/common/dat_dr.c +++ b/dat/common/dat_dr.c @@ -174,16 +174,16 @@ extern DAT_RETURN dat_dr_remove ( IN const DAT_PROVIDER_INFO *info ) { - DAT_DR_ENTRY *data; DAT_DICTIONARY_ENTRY dict_entry; DAT_RETURN status; + DAT_DICTIONARY_DATA data; dict_entry = NULL; dat_os_lock (&g_dr_lock); status = dat_dictionary_search ( g_dr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); + &data); if ( DAT_SUCCESS != status ) { @@ -191,7 +191,7 @@ dat_dr_remove ( goto bail; } - if ( 0 != data->ref_count ) + if ( 0 != ((DAT_DR_ENTRY*)data)->ref_count ) { status = DAT_ERROR (DAT_PROVIDER_IN_USE, 0); goto bail; @@ -200,7 +200,7 @@ dat_dr_remove ( status = dat_dictionary_remove ( g_dr_dictionary, &dict_entry, info, - (DAT_DICTIONARY_DATA *) &data); + &data); if ( DAT_SUCCESS != status ) { /* return status from dat_dictionary_remove() */ @@ -231,20 +231,18 @@ dat_dr_provider_open ( OUT DAT_IA_OPEN_FUNC *p_ia_open_func ) { DAT_RETURN status; - DAT_DR_ENTRY *data; + DAT_DICTIONARY_DATA data; dat_os_lock (&g_dr_lock); - status = dat_dictionary_search ( g_dr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); - + &data); dat_os_unlock (&g_dr_lock); if ( DAT_SUCCESS == status ) { - data->ref_count++; - *p_ia_open_func = data->ia_open_func; + ((DAT_DR_ENTRY*)data)->ref_count++; + *p_ia_open_func = ((DAT_DR_ENTRY*)data)->ia_open_func; } return status; @@ -260,19 +258,17 @@ dat_dr_provider_close ( IN const DAT_PROVIDER_INFO *info ) { DAT_RETURN status; - DAT_DR_ENTRY *data; + DAT_DICTIONARY_DATA data; dat_os_lock (&g_dr_lock); - status = dat_dictionary_search ( g_dr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); - + &data); dat_os_unlock (&g_dr_lock); if ( DAT_SUCCESS == status ) { - data->ref_count--; + ((DAT_DR_ENTRY*)data)->ref_count--; } return status; diff --git a/dat/common/dat_sr.c b/dat/common/dat_sr.c index d5d8666..e3b2a54 100644 --- a/dat/common/dat_sr.c +++ b/dat/common/dat_sr.c @@ -129,12 +129,13 @@ dat_sr_insert ( IN DAT_SR_ENTRY *entry ) { DAT_RETURN status; - DAT_SR_ENTRY *data, *prev_data; + DAT_SR_ENTRY *data; DAT_OS_SIZE lib_path_size; DAT_OS_SIZE lib_path_len; DAT_OS_SIZE ia_params_size; DAT_OS_SIZE ia_params_len; DAT_DICTIONARY_ENTRY dict_entry; + DAT_DICTIONARY_DATA prev_data; if ( NULL == (data = dat_os_alloc (sizeof (DAT_SR_ENTRY))) ) { @@ -184,7 +185,7 @@ dat_sr_insert ( status = dat_dictionary_search (g_sr_dictionary, info, - (DAT_DICTIONARY_DATA *) &prev_data); + &prev_data); if ( DAT_SUCCESS == status ) { /* We already have a dictionary entry, so we don't need a new one. @@ -196,12 +197,12 @@ dat_sr_insert ( dict_entry = NULL; /* Find the next available slot in this chain */ - while (NULL != prev_data->next) + while (NULL != ((DAT_SR_ENTRY*)prev_data)->next) { - prev_data = prev_data->next; + prev_data = ((DAT_SR_ENTRY*)prev_data)->next; } dat_os_assert (NULL != prev_data); - prev_data->next = data; + ((DAT_SR_ENTRY*)prev_data)->next = data; } else { @@ -350,15 +351,17 @@ dat_sr_provider_open ( { DAT_RETURN status; DAT_SR_ENTRY *data; + DAT_DICTIONARY_DATA dict_data; dat_os_lock (&g_sr_lock); status = dat_dictionary_search (g_sr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); + &dict_data); if ( DAT_SUCCESS == status ) { + data = (DAT_SR_ENTRY*)dict_data; while (data != NULL) { if ( 0 == data->ref_count ) @@ -428,15 +431,17 @@ dat_sr_provider_close ( { DAT_RETURN status; DAT_SR_ENTRY *data; + DAT_DICTIONARY_DATA dict_data; dat_os_lock (&g_sr_lock); status = dat_dictionary_search (g_sr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); + &dict_data); if ( DAT_SUCCESS == status ) { + data = (DAT_SR_ENTRY*)dict_data; while (data != NULL) { if ( 1 == data->ref_count ) -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:26:04 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:26:04 -0700 Subject: [ofa-general] [PATCH 5/5][v1.2] dapl providers: fix compiler warnings in cma and scm providers Message-ID: <000401c90ca3$40612280$5464fe0a@amr.corp.intel.com> dapl providers: fix compiler warnings in cma and scm providers Include provider definitions after some key definitions required by providers. Check results of writes and reads, print appropriate error message. Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/include/dapl.h | 38 +++++++++++++++++++------------------- dapl/openib_cma/dapl_ib_cq.c | 5 ++++- dapl/openib_cma/dapl_ib_util.c | 32 +++++++++++++++++++++++++------- dapl/openib_cma/dapl_ib_util.h | 14 +------------- dapl/openib_scm/dapl_ib_cm.c | 36 +++++++++++++++++++++++++++++------- dapl/openib_scm/dapl_ib_util.c | 9 ++++++++- dapl/openib_scm/dapl_ib_util.h | 14 +------------- 7 files changed, 87 insertions(+), 61 deletions(-) diff --git a/dapl/include/dapl.h b/dapl/include/dapl.h index 80c9ff3..9d3f546 100644 --- a/dapl/include/dapl.h +++ b/dapl/include/dapl.h @@ -50,19 +50,6 @@ #include "dapl_osd.h" #include "dapl_debug.h" -#ifdef IBAPI -#include "dapl_ibapi_util.h" -#elif VAPI -#include "dapl_vapi_util.h" -#elif __OPENIB__ -#include "dapl_openib_util.h" -#include "dapl_openib_cm.h" -#elif DUMMY -#include "dapl_dummy_util.h" -#elif OPENIB -#include "dapl_ib_util.h" -#endif - /********************************************************************* * * * Enumerations * @@ -215,12 +202,6 @@ typedef struct dapl_rmr_cookie DAPL_RMR_COOKIE; typedef struct dapl_private DAPL_PRIVATE; -typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( - IN DAPL_EP *, - IN ib_cm_events_t, - IN const void *, - OUT DAT_EVENT *); - /********************************************************************* * * @@ -252,6 +233,19 @@ struct dapl_cookie_buffer DAPL_ATOMIC tail; }; +#ifdef IBAPI +#include "dapl_ibapi_util.h" +#elif VAPI +#include "dapl_vapi_util.h" +#elif __OPENIB__ +#include "dapl_openib_util.h" +#include "dapl_openib_cm.h" +#elif DUMMY +#include "dapl_dummy_util.h" +#elif OPENIB +#include "dapl_ib_util.h" +#endif + struct dapl_hca { DAPL_OS_LOCK lock; @@ -673,6 +667,12 @@ void dapls_io_trc_dump ( * * *********************************************************************/ +typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( + IN DAPL_EP *, + IN ib_cm_events_t, + IN const void *, + OUT DAT_EVENT *); + /* * DAT Mandated functions */ diff --git a/dapl/openib_cma/dapl_ib_cq.c b/dapl/openib_cma/dapl_ib_cq.c index 25b4551..cf19f38 100644 --- a/dapl/openib_cma/dapl_ib_cq.c +++ b/dapl/openib_cma/dapl_ib_cq.c @@ -497,7 +497,10 @@ dapls_ib_wait_object_wakeup (IN ib_wait_obj_handle_t p_cq_wait_obj_handle) p_cq_wait_obj_handle ); /* write to pipe for wake up */ - write(p_cq_wait_obj_handle->pipe[1], "w", sizeof "w"); + if (write(p_cq_wait_obj_handle->pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " wait object wakeup write error = %s\n", + strerror(errno)); return DAT_SUCCESS; } diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index e76e319..afb7463 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -321,7 +321,10 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA *hca_ptr) dapl_llist_add_tail(&g_hca_list, (DAPL_LLIST_ENTRY*)&hca_ptr->ib_trans.entry, &hca_ptr->ib_trans.entry); - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " open_hca: thread wakeup error = %s\n", + strerror(errno)); dapl_os_unlock(&g_hca_lock); dapl_dbg_log( @@ -388,14 +391,20 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr) * Wakeup work thread to remove from polling list */ hca_ptr->ib_trans.destroy = 1; - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " close_hca: thread wakeup error = %s\n", + strerror(errno)); /* wait for thread to remove HCA references */ while (hca_ptr->ib_trans.destroy != 2) { struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 10000000; /* 10 ms */ - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " close_hca: thread wakeup error = %s\n", + strerror(errno)); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: wait on hca %p destroy\n"); nanosleep (&sleep, &remain); @@ -671,14 +680,20 @@ void dapli_ib_thread_destroy(void) goto bail; g_ib_thread_state = IB_THREAD_CANCEL; - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) { struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: waiting for ib_thread\n"); - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); dapl_os_unlock( &g_hca_lock ); nanosleep(&sleep, &remain); dapl_os_lock( &g_hca_lock ); @@ -894,8 +909,11 @@ void dapli_thread(void *arg) /* check and process user events, PIPE */ if (ufds[0].revents == POLLIN) { - read(g_ib_pipe[0], rbuf, 2); - + if (read(g_ib_pipe[0], rbuf, 2) == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " ib_thread: pipe rd err= %s\n", + strerror(errno)); + /* cleanup any device on list marked for destroy */ for(idx=3;idxdestroy == 1) { diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index 1e464b2..1d919d9 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -164,18 +164,6 @@ typedef enum } ib_thread_state_t; -/* - * dapl_llist_entry in dapl.h but dapl.h depends on provider - * typedef's in this file first. move dapl_llist_entry out of dapl.h - */ -struct ib_llist_entry -{ - struct dapl_llist_entry *flink; - struct dapl_llist_entry *blink; - void *data; - struct dapl_llist_entry *list_head; -}; - struct dapl_cm_id { DAPL_OS_LOCK lock; int destroy; @@ -256,7 +244,7 @@ typedef void (*ib_async_handler_t)( /* ib_hca_transport_t, specific to this implementation */ typedef struct _ib_hca_transport { - struct ib_llist_entry entry; + struct dapl_llist_entry entry; int destroy; struct dapl_hca *d_hca; struct rdma_cm_id *cm_id; diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index 9f845b6..48aa82d 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -119,7 +119,10 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) dapl_os_unlock(&cm_ptr->lock); /* wakeup work thread */ - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " cm_destroy: thread wakeup error = %s\n", + strerror(errno)); } /* queue socket for processing CM work */ @@ -133,7 +136,10 @@ static void dapli_cm_queue(struct ib_cm_handle *cm_ptr) dapl_os_unlock(&cm_ptr->hca->ib_trans.lock); /* wakeup CM work thread */ - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " cm_queue: thread wakeup error = %s\n", + strerror(errno)); } static uint16_t dapli_get_lid(IN struct ibv_context *ctx, IN uint8_t port) @@ -167,7 +173,11 @@ dapli_socket_disconnect(ib_cm_handle_t cm_ptr) } else { /* send disc date, close socket, schedule destroy */ if (cm_ptr->socket >= 0) { - write(cm_ptr->socket, &disc_data, sizeof(disc_data)); + if (write(cm_ptr->socket, + &disc_data, sizeof(disc_data)) == -1) + dapl_log(DAPL_DBG_TYPE_WARN, + " cm_disc: write error = %s\n", + strerror(errno)); close(cm_ptr->socket); cm_ptr->socket = -1; } @@ -473,7 +483,10 @@ dapli_socket_connect_rtu(ib_cm_handle_t cm_ptr) dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); /* complete handshake after final QP state change */ - write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)); + if (write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)) == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " CONN_RTU: write error = %s\n", + strerror(errno)); /* init cm_handle and post the event with private data */ ep_ptr->cm_handle = cm_ptr; @@ -1011,7 +1024,10 @@ dapls_ib_remove_conn_listener ( /* cr_thread will free */ cm_ptr->state = SCM_DESTROY; sp_ptr->cm_srvc_handle = NULL; - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " remove_listen: thread wakeup error = %s\n", + strerror(errno)); } return DAT_SUCCESS; } @@ -1106,7 +1122,10 @@ dapls_ib_reject_connection ( /* cr_thread will destroy CR */ cm_ptr->state = SCM_REJECTED; - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " reject_connection: thread wakeup error = %s\n", + strerror(errno)); return DAT_SUCCESS; } @@ -1442,7 +1461,10 @@ void cr_thread(void *arg) poll(ufds,idx+1,-1); /* infinite, all sockets and pipe */ /* if pipe used to wakeup, consume */ if (ufds[0].revents == POLLIN) - read(g_scm_pipe[0], rbuf, 2); + if (read(g_scm_pipe[0], rbuf, 2) == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cr_thread: read pipe error = %s\n", + strerror(errno)); dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n"); dapl_os_lock(&hca_ptr->ib_trans.lock); } diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index 76bde89..f1f6103 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -359,11 +359,18 @@ DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) /* destroy cr_thread and lock */ hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " close_hca: thread wakeup error = %s\n", + strerror(errno)); while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 2000000; /* 2 ms */ + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " close_hca: thread wakeup error = %s\n", + strerror(errno)); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: waiting for cr_thread\n"); nanosleep (&sleep, &remain); diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index 8ed4fac..91a4e67 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -85,18 +85,6 @@ typedef struct _ib_qp_cm union ibv_gid gid; } ib_qp_cm_t; -/* - * dapl_llist_entry in dapl.h but dapl.h depends on provider - * typedef's in this file first. move dapl_llist_entry out of dapl.h - */ -struct ib_llist_entry -{ - struct dapl_llist_entry *flink; - struct dapl_llist_entry *blink; - void *data; - struct dapl_llist_entry *list_head; -}; - typedef enum scm_state { SCM_INIT, @@ -114,7 +102,7 @@ typedef enum scm_state struct ib_cm_handle { - struct ib_llist_entry entry; + struct dapl_llist_entry entry; DAPL_OS_LOCK lock; SCM_STATE state; int socket; -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:26:02 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:26:02 -0700 Subject: [ofa-general] [PATCH 4/5][v1.2] dapl build: add correct CFLAGS for GNU Message-ID: <000301c90ca3$3ed0e590$5464fe0a@amr.corp.intel.com> Signed-off by: Arlin Davis ardavis at ichips.intel.com --- Makefile.am | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/Makefile.am b/Makefile.am index bccc6ff..29e6b3b 100644 --- a/Makefile.am +++ b/Makefile.am @@ -12,9 +12,9 @@ OSFLAGS += -DREDHAT_EL5 endif if DEBUG -DBGFLAGS = -ggdb -DDAPL_DBG +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG else -DBGFLAGS = -g +AM_CFLAGS = -g -Wall -D_GNU_SOURCE endif datlibdir = $(libdir) @@ -25,17 +25,17 @@ datlib_LTLIBRARIES = dat/udat/libdat.la dapllibcma_LTLIBRARIES = dapl/udapl/libdaplcma.la dapllibscm_LTLIBRARIES = dapl/udapl/libdaplscm.la -dat_udat_libdat_la_CFLAGS = -Wall $(DBGFLAGS) -D_GNU_SOURCE $(OSFLAGS) \ +dat_udat_libdat_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) \ -I$(srcdir)/dat/include/ -I$(srcdir)/dat/udat/ \ -I$(srcdir)/dat/udat/linux -I$(srcdir)/dat/common/ -dapl_udapl_libdaplcma_la_CFLAGS = -Wall $(DBGFLAGS) -D_GNU_SOURCE $(OSFLAGS) \ +dapl_udapl_libdaplcma_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) \ -DOPENIB -DCQ_WAIT_OBJECT \ -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \ -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \ -I$(srcdir)/dapl/openib_cma -dapl_udapl_libdaplscm_la_CFLAGS = -Wall $(DBGFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ +dapl_udapl_libdaplscm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ -DOPENIB -DCQ_WAIT_OBJECT \ -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \ -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \ -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:33:25 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:33:25 -0700 Subject: [ofa-general] [PATCH 1/5] [v2.0] dtest/dapltest: fix compiler warnings Message-ID: <000501c90ca4$47df0da0$5464fe0a@amr.corp.intel.com> Patch set for DAPL v2.0 to cleanup compiler warnings and fix fedora build issues. Signed-off by: Arlin Davis ardavis at ichips.intel.com --- test/dapltest/Makefile.am | 4 +- test/dapltest/cmd/dapl_netaddr.c | 2 +- test/dapltest/test/dapl_limit.c | 125 ++++++++++++++++++-------------------- test/dtest/Makefile.am | 4 +- test/dtest/dtest.c | 35 +++++------ test/dtest/dtestx.c | 20 +++--- 6 files changed, 90 insertions(+), 100 deletions(-) diff --git a/test/dapltest/Makefile.am b/test/dapltest/Makefile.am index 18660c8..fe69d71 100755 --- a/test/dapltest/Makefile.am +++ b/test/dapltest/Makefile.am @@ -4,7 +4,9 @@ else XFLAGS = endif -dapltest_CFLAGS = $(XFLAGS) +AM_CFLAGS = -g -Wall -D_GNU_SOURCE + +dapltest_CFLAGS = $(AM_FLAGS) $(XFLAGS) INCLUDES = -I include \ -I mdep/linux \ diff --git a/test/dapltest/cmd/dapl_netaddr.c b/test/dapltest/cmd/dapl_netaddr.c index a306335..e1600d5 100644 --- a/test/dapltest/cmd/dapl_netaddr.c +++ b/test/dapltest/cmd/dapl_netaddr.c @@ -90,7 +90,7 @@ DT_NetAddrLookupHostAddress (DAT_IA_ADDRESS_PTR to_netaddr, whatzit = "service unavailable"; break; } -#if !defined(WIN32) +#if !defined(WIN32) && defined(__USE_GNU) case EAI_ADDRFAMILY: { whatzit = "node has no address in this family"; diff --git a/test/dapltest/test/dapl_limit.c b/test/dapltest/test/dapl_limit.c index adf1139..133b3e0 100644 --- a/test/dapltest/test/dapl_limit.c +++ b/test/dapltest/test/dapl_limit.c @@ -36,13 +36,13 @@ static bool more_handles (DT_Tdep_Print_Head *phead, - DAT_HANDLE **old_ptrptr, /* pointer to current pointer */ + void **old_ptrptr, /* pointer to current pointer */ unsigned int *old_count, /* number pointed to */ unsigned int size) /* size of one datum */ { unsigned int count = *old_count; - DAT_HANDLE *old_handles = *old_ptrptr; - DAT_HANDLE *handle_tmp = DT_Mdep_Malloc (count * 2 * size); + void *old_handles = *old_ptrptr; + void *handle_tmp = DT_Mdep_Malloc (count * 2 * size); if (!handle_tmp) { @@ -171,9 +171,9 @@ limit_test ( DT_Tdep_Print_Head *phead, DAT_EVD_HANDLE ia_async_handle; } OneOpen; - unsigned int count = START_COUNT; - OneOpen *hdlptr = (OneOpen *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(OneOpen)); + OneOpen *hdlptr = (OneOpen *)hptr; /* IA Exhaustion test loop */ if (hdlptr) @@ -186,14 +186,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: IAs opened: %d\n", module, w); retval = true; break; } + hdlptr = (OneOpen *)hptr; /* Specify that we want to get back an async EVD. */ hdlptr[w].ia_async_handle = DAT_HANDLE_NULL; ret = dat_ia_open (cmd->device_name, @@ -270,9 +269,9 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many PZs we can create */ - unsigned int count = START_COUNT; - DAT_PZ_HANDLE *hdlptr = (DAT_PZ_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(DAT_PZ_HANDLE)); + DAT_PZ_HANDLE *hdlptr = (DAT_PZ_HANDLE *)hptr; /* PZ Exhaustion test loop */ if (hdlptr) @@ -287,14 +286,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: PZs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_PZ_HANDLE *)hptr; ret = dat_pz_create (hdl_sets[w % cmd->width].ia_handle, &hdlptr[w]); if (ret != DAT_SUCCESS) @@ -368,10 +366,10 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many CNOs we can create */ - unsigned int count = START_COUNT; - DAT_CNO_HANDLE *hdlptr = (DAT_CNO_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); - + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(DAT_CNO_HANDLE)); + DAT_CNO_HANDLE *hdlptr = (DAT_CNO_HANDLE *)hptr; + /* CNO Exhaustion test loop */ if (hdlptr) { @@ -385,14 +383,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: CNOs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_CNO_HANDLE *)hptr; ret = dat_cno_create (hdl_sets[w % cmd->width].ia_handle, DAT_OS_WAIT_PROXY_AGENT_NULL, &hdlptr[w]); @@ -489,9 +486,10 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many EVDs we can create */ - unsigned int count = START_COUNT; - DAT_EVD_HANDLE *hdlptr = (DAT_EVD_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc(count * sizeof(DAT_EVD_HANDLE)); + DAT_EVD_HANDLE *hdlptr = (DAT_EVD_HANDLE *)hptr; + DAT_EVD_FLAGS flags = ( DAT_EVD_DTO_FLAG | DAT_EVD_RMR_BIND_FLAG | DAT_EVD_CR_FLAG); @@ -524,14 +522,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: EVDs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_EVD_HANDLE *)hptr; ret = DT_Tdep_evd_create (hdl_sets[w % cmd->width].ia_handle, DFLT_QLEN, hdl_sets[w % cmd->width].cno_handle, @@ -608,9 +605,9 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many EPs we can create */ - unsigned int count = START_COUNT; - DAT_EP_HANDLE *hdlptr = (DAT_EP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc(count * sizeof(DAT_EP_HANDLE)); + DAT_EP_HANDLE *hdlptr = (DAT_EP_HANDLE *)hptr; /* EP Exhaustion test loop */ if (hdlptr) @@ -623,14 +620,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: EPs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_EP_HANDLE *)hptr; ret = dat_ep_create (hdl_sets[w % cmd->width].ia_handle, hdl_sets[w % cmd->width].pz_handle, hdl_sets[w % cmd->width].evd_handle, @@ -679,11 +675,11 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many RSPs we can create */ - unsigned int count = START_COUNT; - DAT_RSP_HANDLE *hdlptr = (DAT_RSP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); - DAT_EP_HANDLE *epptr = (DAT_EP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*epptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc(count * sizeof (DAT_RSP_HANDLE)); + DAT_RSP_HANDLE *hdlptr = (DAT_RSP_HANDLE *)hptr; + void *eptr = DT_Mdep_Malloc(count * sizeof (DAT_EP_HANDLE)); + DAT_EP_HANDLE *epptr = (DAT_EP_HANDLE *)eptr; /* RSP Exhaustion test loop */ if (hdlptr) @@ -700,23 +696,21 @@ limit_test ( DT_Tdep_Print_Head *phead, unsigned int count1 = count; unsigned int count2 = count; - if (!more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count1, - sizeof (*hdlptr))) + if (!more_handles(phead, &hptr, &count1, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: RSPs created: %d\n", module, w); retval = true; break; } - if (!more_handles (phead, (DAT_HANDLE **) &epptr, - &count2, - sizeof (*epptr))) + hdlptr = (DAT_RSP_HANDLE *)hptr; + + if (!more_handles (phead, &eptr, &count2, sizeof(*epptr))) { DT_Tdep_PT_Printf (phead, "%s: RSPs created: %d\n", module, w); retval = true; break; } - + epptr = (DAT_EP_HANDLE *)eptr; if (count1 != count2) { DT_Tdep_PT_Printf (phead, "%s: Mismatch in allocation of handle arrays at point %d\n", @@ -815,9 +809,9 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many PSPs we can create */ - unsigned int count = START_COUNT; - DAT_PSP_HANDLE *hdlptr = (DAT_PSP_HANDLE *) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof (DAT_PSP_HANDLE)); + DAT_PSP_HANDLE *hdlptr = (DAT_PSP_HANDLE *)hptr; /* PSP Exhaustion test loop */ if (hdlptr) @@ -830,14 +824,13 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: PSPs created: %d\n", module, w); retval = true; break; } + hdlptr = (DAT_PSP_HANDLE *)hptr; ret = dat_psp_create (hdl_sets[w % cmd->width].ia_handle, CONN_QUAL0 + w, hdl_sets[w % cmd->width].evd_handle, @@ -941,10 +934,10 @@ limit_test ( DT_Tdep_Print_Head *phead, /* * See how many LMRs we can create */ - unsigned int count = START_COUNT; - Bpool **hdlptr = (Bpool **) - DT_Mdep_Malloc (count * sizeof (*hdlptr)); - + unsigned int count = START_COUNT; + void *hptr = DT_Mdep_Malloc (count * sizeof(Bpool*)); + Bpool **hdlptr = (Bpool **)hptr; + /* LMR Exhaustion test loop */ if (hdlptr) { @@ -956,9 +949,7 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: no memory for LMR handles\n", module); @@ -966,6 +957,7 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = true; break; } + hdlptr = (Bpool **)hptr; /* * Let BpoolAlloc do the hard work; this means that * we're testing unique memory registrations rather @@ -1017,14 +1009,15 @@ limit_test ( DT_Tdep_Print_Head *phead, * We are posting the same buffer 'cnt' times, deliberately, * but that should be OK. */ - unsigned int count = START_COUNT; - DAT_LMR_TRIPLET *hdlptr = (DAT_LMR_TRIPLET *) - DT_Mdep_Malloc (count * cmd->width * sizeof (*hdlptr)); + unsigned int count = START_COUNT; + void *hptr = + DT_Mdep_Malloc(count * cmd->width * sizeof(DAT_LMR_TRIPLET)); + DAT_LMR_TRIPLET *hdlptr = (DAT_LMR_TRIPLET *)hptr; /* Recv-Post Exhaustion test loop */ if (hdlptr) { - unsigned int w = 0; + unsigned int w = 0; unsigned int i = 0; unsigned int done = 0; @@ -1033,9 +1026,8 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, (DAT_HANDLE **) &hdlptr, - &count, - cmd->width * sizeof (*hdlptr))) + && !more_handles (phead, &hptr, &count, + cmd->width * sizeof(*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: no memory for IOVs \n", module); @@ -1047,6 +1039,7 @@ limit_test ( DT_Tdep_Print_Head *phead, done = retval = true; break; } + hdlptr = (DAT_LMR_TRIPLET *)hptr; for (i = 0; i < cmd->width; i++) { DAT_LMR_TRIPLET *iovp = &hdlptr[w * cmd->width + i]; diff --git a/test/dtest/Makefile.am b/test/dtest/Makefile.am index aabd026..90c9d95 100755 --- a/test/dtest/Makefile.am +++ b/test/dtest/Makefile.am @@ -1,11 +1,11 @@ bin_PROGRAMS = dtest dtest_SOURCES = dtest.c +dtest_CFLAGS = -g -Wall -D_GNU_SOURCE if EXT_TYPE_IB bin_PROGRAMS += dtestx dtestx_SOURCES = dtestx.c -dtest_CFLAGS = -DDAT_EXTENSIONS -dtestx_CFLAGS = -DDAT_EXTENSIONS +dtestx_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAT_EXTENSIONS dtestx_LDADD = $(srcdir)/../../dat/udat/libdat2.la endif diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 095ff40..00d14e3 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -207,7 +207,6 @@ struct dt_time { struct dt_time time; /* defaults */ -static int parent=1; static int failed=0; static int performance_times=0; static int connected=0; @@ -218,17 +217,13 @@ static int polling=0; static int poll_count=0; static int rdma_wr_poll_count=0; static int rdma_rd_poll_count[MAX_RDMA_RD]={0}; -static int pin_memory=0; static int delay=0; static int buf_len=RDMA_BUFFER_SIZE; static int use_cno=0; -static int post_recv_count=MSG_BUF_COUNT; static int recv_msg_index=0; static int burst_msg_posted=0; static int burst_msg_index=0; -static int child[MAX_PROCS+1]; - /* forward prototypes */ const char * DT_RetToString (DAT_RETURN ret_value); const char * DT_EventToSTr (DAT_EVENT_NUMBER event_code); @@ -254,6 +249,7 @@ DAT_RETURN do_ping_pong_msg( void ); #define LOGPRINTF if (verbose) printf +int main(int argc, char **argv) { int i,c; @@ -446,7 +442,7 @@ main(int argc, char **argv) inet_ntop(AF_INET, &((struct sockaddr_in *)ep_param.local_ia_address_ptr)->sin_addr, addr_str, sizeof(addr_str)); - printf("\n%d Query EP: LOCAL addr %s port %lld\n", getpid(), + printf("\n%d Query EP: LOCAL addr %s port "F64x"\n", getpid(), addr_str, (ep_param.local_port_qual)); #endif #if defined(_WIN32) @@ -458,7 +454,7 @@ main(int argc, char **argv) inet_ntop(AF_INET, &((struct sockaddr_in *)ep_param.remote_ia_address_ptr)->sin_addr, addr_str, sizeof(addr_str)); - printf("%d Query EP: REMOTE addr %s port %lld\n", getpid(), + printf("%d Query EP: REMOTE addr %s port "F64x"\n", getpid(), addr_str, (ep_param.remote_port_qual)); #endif fflush(stdout); @@ -615,6 +611,7 @@ complete: #if defined(_WIN32) || defined(_WIN64) WSACleanup(); #endif + return(0); } #if defined(_WIN32) || defined(_WIN64) @@ -750,7 +747,7 @@ connect_ep( char *hostname, DAT_CONN_QUAL conn_id ) /* Register send message buffer */ LOGPRINTF("%d Registering send Message Buffer %p, len %d\n", - getpid(), &rmr_send_msg, sizeof(DAT_RMR_TRIPLET) ); + getpid(), &rmr_send_msg, (int)sizeof(DAT_RMR_TRIPLET) ); region.for_va = &rmr_send_msg; ret = dat_lmr_create( h_ia, DAT_MEM_TYPE_VIRTUAL, @@ -848,8 +845,8 @@ connect_ep( char *hostname, DAT_CONN_QUAL conn_id ) else LOGPRINTF("%d dat_psp_created for server listen\n", getpid()); - printf("%d Server waiting for connect request on port %lld\n", - getpid(),conn_id); + printf("%d Server waiting for connect request on port "F64x"\n", + getpid(), conn_id); ret = dat_evd_wait( h_cr_evd, SERVER_TIMEOUT, 1, &event, &nmore ); if(ret != DAT_SUCCESS) { @@ -936,7 +933,7 @@ connect_ep( char *hostname, DAT_CONN_QUAL conn_id ) rval = ((struct sockaddr_in *)target->ai_addr)->sin_addr.s_addr; #endif printf ("%d Server Name: %s \n", getpid(), hostname); - printf ("%d Server Net Address: %d.%d.%d.%d port %lld\n", getpid(), + printf ("%d Server Net Address: %d.%d.%d.%d port "F64x"\n", getpid(), (rval >> 0) & 0xff, (rval >> 8) & 0xff, (rval >> 16) & 0xff, (rval >> 24) & 0xff, conn_id); @@ -1099,8 +1096,8 @@ connect_ep( char *hostname, DAT_CONN_QUAL conn_id ) recv_msg_index) ) { fprintf(stderr,"ERR recv event: len=%d cookie="F64x" expected %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), recv_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)sizeof(DAT_RMR_TRIPLET), recv_msg_index ); return( DAT_ABORT ); } @@ -1322,8 +1319,8 @@ do_rdma_write_with_msg( void ) (event.event_data.dto_completion_event_data.user_cookie.as_64 != recv_msg_index) ) { fprintf(stderr,"unexpected event data for receive: len=%d cookie="F64x" exp %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), recv_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)sizeof(DAT_RMR_TRIPLET), recv_msg_index ); return( DAT_ABORT ); } @@ -1515,8 +1512,8 @@ do_rdma_read_with_msg( void ) fprintf(stderr,"unexpected event data for receive: len=%d cookie="F64x" exp %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - sizeof(DAT_RMR_TRIPLET), recv_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)sizeof(DAT_RMR_TRIPLET), recv_msg_index ); return( DAT_ABORT ); } @@ -1678,8 +1675,8 @@ do_ping_pong_msg( ) != burst_msg_index) ) { fprintf(stderr,"ERR: recv event: len=%d cookie="F64x" exp %d/%d\n", (int)event.event_data.dto_completion_event_data.transfered_length, - (int)event.event_data.dto_completion_event_data.user_cookie.as_64, - buf_len, burst_msg_index ); + event.event_data.dto_completion_event_data.user_cookie.as_64, + (int)buf_len, (int)burst_msg_index ); return( DAT_ABORT ); } diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c index e568aac..fb89364 100755 --- a/test/dtest/dtestx.c +++ b/test/dtest/dtestx.c @@ -439,7 +439,7 @@ connect_ep(char *hostname) r_iov->virtual_address = hton64((DAT_VADDR)buf[RCV_RDMA_BUF_INDEX]); r_iov->segment_length = hton32(buf_size); - printf("%d Send RMR msg to remote: r_key_ctx=0x%x,va=%p,len=0x%x\n", + printf("%d Send RMR msg to remote: r_key_ctx=0x%x,va="F64x",len=0x%x\n", getpid(), hton32(r_iov->rmr_context), hton64(r_iov->virtual_address), hton32(r_iov->segment_length)); @@ -545,16 +545,14 @@ disconnect_ep(void) int do_immediate() { - DAT_REGION_DESCRIPTION region; DAT_EVENT event; DAT_COUNT nmore; DAT_LMR_TRIPLET iov; DAT_RMR_TRIPLET r_iov; DAT_DTO_COOKIE cookie; - DAT_RMR_CONTEXT their_context; DAT_RETURN status; DAT_UINT32 immed_data; - DAT_UINT32 immed_data_recv; + DAT_UINT32 immed_data_recv = 0; DAT_DTO_COMPLETION_EVENT_DATA *dto_event = &event.event_data.dto_completion_event_data; DAT_IB_EXTENSION_EVENT_DATA *ext_event = @@ -620,10 +618,10 @@ do_immediate() (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) { printf("unexpected event data of immediate write: len=%d " - "cookie=%d expected %d/%d\n", + "cookie="F64x" expected %d/%d\n", (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64, - sizeof(int), RECV_BUF_INDEX+1); + dto_event->user_cookie.as_64, + (int)sizeof(int), RECV_BUF_INDEX+1); exit(1); } @@ -669,10 +667,10 @@ do_immediate() (dto_event->user_cookie.as_64 != RECV_BUF_INDEX+1)) { printf("unexpected event data of immediate write: len=%d " - "cookie=%d expected %d/%d\n", + "cookie="F64x" expected %d/%d\n", (int)dto_event->transfered_length, - (int)dto_event->user_cookie.as_64, - sizeof(int), RECV_BUF_INDEX+1); + dto_event->user_cookie.as_64, + (int)sizeof(int), RECV_BUF_INDEX+1); exit(1); } @@ -705,7 +703,7 @@ do_immediate() printf("Client received immed_data=0x%x\n",immed_data_recv); printf("rdma buffer %p contains: %s\n", - buf[ RCV_RDMA_BUF_INDEX ], buf[ RCV_RDMA_BUF_INDEX ]); + buf[RCV_RDMA_BUF_INDEX], (char*)buf[RCV_RDMA_BUF_INDEX]); printf("\n RDMA_WRITE_WITH_IMMEDIATE_DATA test - PASSED\n"); return (0); -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:33:31 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:33:31 -0700 Subject: [ofa-general] [PATCH 3/5] [v2.0] dat: fix compiler warnings in dat common code Message-ID: <000601c90ca4$4aedad30$5464fe0a@amr.corp.intel.com> dat: fix compiler warnings in dat common code Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dat/common/dat_dr.c | 26 +++++++++++--------------- dat/common/dat_sr.c | 19 ++++++++++++------- 2 files changed, 23 insertions(+), 22 deletions(-) diff --git a/dat/common/dat_dr.c b/dat/common/dat_dr.c index bda3002..f7e9ffd 100644 --- a/dat/common/dat_dr.c +++ b/dat/common/dat_dr.c @@ -173,16 +173,16 @@ DAT_RETURN dat_dr_remove ( IN const DAT_PROVIDER_INFO *info ) { - DAT_DR_ENTRY *data; DAT_DICTIONARY_ENTRY dict_entry; DAT_RETURN status; + DAT_DICTIONARY_DATA data; dict_entry = NULL; dat_os_lock (&g_dr_lock); status = dat_dictionary_search ( g_dr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); + &data); if ( DAT_SUCCESS != status ) { @@ -190,7 +190,7 @@ dat_dr_remove ( goto bail; } - if ( 0 != data->ref_count ) + if ( 0 != ((DAT_DR_ENTRY*)data)->ref_count ) { status = DAT_ERROR (DAT_PROVIDER_IN_USE, 0); goto bail; @@ -199,7 +199,7 @@ dat_dr_remove ( status = dat_dictionary_remove ( g_dr_dictionary, &dict_entry, info, - (DAT_DICTIONARY_DATA *) &data); + &data); if ( DAT_SUCCESS != status ) { /* return status from dat_dictionary_remove() */ @@ -230,20 +230,18 @@ dat_dr_provider_open ( OUT DAT_IA_OPEN_FUNC *p_ia_open_func ) { DAT_RETURN status; - DAT_DR_ENTRY *data; + DAT_DICTIONARY_DATA data; dat_os_lock (&g_dr_lock); - status = dat_dictionary_search ( g_dr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); - + &data); dat_os_unlock (&g_dr_lock); if ( DAT_SUCCESS == status ) { - data->ref_count++; - *p_ia_open_func = data->ia_open_func; + ((DAT_DR_ENTRY*)data)->ref_count++; + *p_ia_open_func = ((DAT_DR_ENTRY*)data)->ia_open_func; } return status; @@ -259,19 +257,17 @@ dat_dr_provider_close ( IN const DAT_PROVIDER_INFO *info ) { DAT_RETURN status; - DAT_DR_ENTRY *data; + DAT_DICTIONARY_DATA data; dat_os_lock (&g_dr_lock); - status = dat_dictionary_search ( g_dr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); - + &data); dat_os_unlock (&g_dr_lock); if ( DAT_SUCCESS == status ) { - data->ref_count--; + ((DAT_DR_ENTRY*)data)->ref_count--; } return status; diff --git a/dat/common/dat_sr.c b/dat/common/dat_sr.c index 05be499..10319b9 100755 --- a/dat/common/dat_sr.c +++ b/dat/common/dat_sr.c @@ -129,12 +129,13 @@ dat_sr_insert ( IN DAT_SR_ENTRY *entry ) { DAT_RETURN status; - DAT_SR_ENTRY *data, *prev_data; + DAT_SR_ENTRY *data; DAT_OS_SIZE lib_path_size; DAT_OS_SIZE lib_path_len; DAT_OS_SIZE ia_params_size; DAT_OS_SIZE ia_params_len; DAT_DICTIONARY_ENTRY dict_entry; + DAT_DICTIONARY_DATA prev_data; if ( NULL == (data = dat_os_alloc (sizeof (DAT_SR_ENTRY))) ) { @@ -184,7 +185,7 @@ dat_sr_insert ( status = dat_dictionary_search (g_sr_dictionary, info, - (DAT_DICTIONARY_DATA *) &prev_data); + &prev_data); if ( DAT_SUCCESS == status ) { /* We already have a dictionary entry, so we don't need a new one. @@ -196,12 +197,12 @@ dat_sr_insert ( dict_entry = NULL; /* Find the next available slot in this chain */ - while (NULL != prev_data->next) + while (NULL != ((DAT_SR_ENTRY*)prev_data)->next) { - prev_data = prev_data->next; + prev_data = ((DAT_SR_ENTRY*)prev_data)->next; } dat_os_assert (NULL != prev_data); - prev_data->next = data; + ((DAT_SR_ENTRY*)prev_data)->next = data; } else { @@ -350,15 +351,17 @@ dat_sr_provider_open ( { DAT_RETURN status; DAT_SR_ENTRY *data; + DAT_DICTIONARY_DATA dict_data; dat_os_lock (&g_sr_lock); status = dat_dictionary_search (g_sr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); + &dict_data); if ( DAT_SUCCESS == status ) { + data = (DAT_SR_ENTRY*)dict_data; while (data != NULL) { if ( 0 == data->ref_count ) @@ -450,15 +453,17 @@ dat_sr_provider_close ( { DAT_RETURN status; DAT_SR_ENTRY *data; + DAT_DICTIONARY_DATA dict_data; dat_os_lock (&g_sr_lock); status = dat_dictionary_search (g_sr_dictionary, info, - (DAT_DICTIONARY_DATA *) &data); + &dict_data); if ( DAT_SUCCESS == status ) { + data = (DAT_SR_ENTRY*)dict_data; while (data != NULL) { if ( 1 == data->ref_count ) -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:33:31 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:33:31 -0700 Subject: [ofa-general] [PATCH 2/5] [v2.0] dapl: fix compiler warnings in common code Message-ID: <000701c90ca4$4bad2ca0$5464fe0a@amr.corp.intel.com> dapl: fix compiler warnings in common code Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/common/dapl_ep_get_status.c | 1 + dapl/common/dapl_ep_modify.c | 2 +- dapl/common/dapl_rmr_bind.c | 4 +++- dapl/udapl/dapl_evd_wait.c | 17 ++++++++++------- dapl/udapl/dapl_lmr_create.c | 5 +++-- 5 files changed, 18 insertions(+), 11 deletions(-) diff --git a/dapl/common/dapl_ep_get_status.c b/dapl/common/dapl_ep_get_status.c index 853afff..3af7f9a 100644 --- a/dapl/common/dapl_ep_get_status.c +++ b/dapl/common/dapl_ep_get_status.c @@ -38,6 +38,7 @@ #include "dapl.h" #include "dapl_ring_buffer_util.h" +#include "dapl_cookie.h" /* * dapl_ep_get_status diff --git a/dapl/common/dapl_ep_modify.c b/dapl/common/dapl_ep_modify.c index 05aa0ad..fff21a0 100644 --- a/dapl/common/dapl_ep_modify.c +++ b/dapl/common/dapl_ep_modify.c @@ -84,7 +84,7 @@ dapl_ep_modify ( { DAPL_IA *ia; DAPL_EP *ep1, *ep2; - DAT_EP_ATTR ep_attr1, ep_attr2; + DAT_EP_ATTR ep_attr1 = {0}, ep_attr2 = {0}; DAPL_EP new_ep, copy_of_old_ep; DAPL_EP alloc_ep; /* Holder for resources. */ DAPL_PZ *tmp_pz; diff --git a/dapl/common/dapl_rmr_bind.c b/dapl/common/dapl_rmr_bind.c index 12a98f6..e4b8ecb 100755 --- a/dapl/common/dapl_rmr_bind.c +++ b/dapl/common/dapl_rmr_bind.c @@ -84,15 +84,17 @@ dapli_rmr_bind_fuse ( DAPL_COOKIE *cookie; DAT_RETURN dat_status; DAT_BOOLEAN is_signaled; + DAPL_HASH_DATA hash_lmr; dat_status = dapls_hash_search (rmr->header.owner_ia->hca_ptr->lmr_hash_table, lmr_triplet->lmr_context, - (DAPL_HASH_DATA *)&lmr); + &hash_lmr); if ( DAT_SUCCESS != dat_status) { dat_status = DAT_ERROR (DAT_INVALID_PARAMETER, DAT_INVALID_ARG2); goto bail; } + lmr = (DAPL_LMR*)hash_lmr; /* if the ep in unconnected return an error. IB requires that the */ /* QP be connected to change a memory window binding since: */ diff --git a/dapl/udapl/dapl_evd_wait.c b/dapl/udapl/dapl_evd_wait.c index 42a51a7..578041a 100644 --- a/dapl/udapl/dapl_evd_wait.c +++ b/dapl/udapl/dapl_evd_wait.c @@ -141,27 +141,30 @@ DAT_RETURN DAT_API dapl_evd_wait ( waitable = evd_ptr->evd_waitable; dapl_os_assert ( sizeof(DAT_COUNT) == sizeof(DAPL_EVD_STATE) ); - evd_state = dapl_os_atomic_assign ( (DAPL_ATOMIC *)&evd_ptr->evd_state, - (DAT_COUNT) DAPL_EVD_STATE_OPEN, - (DAT_COUNT) DAPL_EVD_STATE_WAITED ); - dapl_os_unlock ( &evd_ptr->header.lock ); + evd_state = evd_ptr->evd_state; + if (evd_ptr->evd_state == DAPL_EVD_STATE_OPEN) + evd_ptr->evd_state = DAPL_EVD_STATE_WAITED; if ( evd_state != DAPL_EVD_STATE_OPEN ) { /* Bogus state, bail out */ dat_status = DAT_ERROR (DAT_INVALID_STATE,0); + dapl_os_unlock ( &evd_ptr->header.lock ); goto bail; } if (!waitable) { /* This EVD is not waitable, reset the state and bail */ - (void) dapl_os_atomic_assign ((DAPL_ATOMIC *)&evd_ptr->evd_state, - (DAT_COUNT) DAPL_EVD_STATE_WAITED, - evd_state); + if (evd_ptr->evd_state == DAPL_EVD_STATE_WAITED) + evd_ptr->evd_state = evd_state; + dat_status = DAT_ERROR (DAT_INVALID_STATE, DAT_INVALID_STATE_EVD_UNWAITABLE); + dapl_os_unlock ( &evd_ptr->header.lock ); goto bail; } + dapl_os_unlock ( &evd_ptr->header.lock ); + /* * We now own the EVD, even though we don't have the lock anymore, diff --git a/dapl/udapl/dapl_lmr_create.c b/dapl/udapl/dapl_lmr_create.c index 350abe0..99b184a 100644 --- a/dapl/udapl/dapl_lmr_create.c +++ b/dapl/udapl/dapl_lmr_create.c @@ -208,6 +208,7 @@ dapli_lmr_create_lmr ( DAPL_LMR *lmr; DAT_REGION_DESCRIPTION reg_desc; DAT_RETURN dat_status; + DAPL_HASH_DATA hash_lmr; dapl_dbg_log (DAPL_DBG_TYPE_API, "dapl_lmr_create_lmr (%p, %p, %p, %x, %x, %p, %p, %p, %p)\n", @@ -221,13 +222,13 @@ dapli_lmr_create_lmr ( dat_status = dapls_hash_search (ia->hca_ptr->lmr_hash_table, original_lmr->param.lmr_context, - (DAPL_HASH_DATA *) &lmr); + &hash_lmr); if ( dat_status != DAT_SUCCESS ) { dat_status = DAT_ERROR (DAT_INVALID_PARAMETER,DAT_INVALID_ARG2); goto bail; } - + lmr = (DAPL_LMR*)hash_lmr; reg_desc.for_lmr_handle = (DAT_LMR_HANDLE) original_lmr; lmr = dapl_lmr_alloc (ia, -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:33:36 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:33:36 -0700 Subject: [ofa-general] [PATCH 4/5] [v2.0] dapl providers: fix compiler warnings in cma and scm providers Message-ID: <000801c90ca4$4d6a45f0$5464fe0a@amr.corp.intel.com> dapl providers: cleanup all compiler warnings in cma and scm providers Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/include/dapl.h | 41 +++++++++++++++++++++------------------ dapl/openib_cma/dapl_ib_cq.c | 5 +++- dapl/openib_cma/dapl_ib_dto.h | 2 +- dapl/openib_cma/dapl_ib_util.c | 33 ++++++++++++++++++++++++------- dapl/openib_cma/dapl_ib_util.h | 16 +------------- dapl/openib_scm/dapl_ib_cm.c | 38 +++++++++++++++++++++++++++++------- dapl/openib_scm/dapl_ib_dto.h | 2 +- dapl/openib_scm/dapl_ib_util.c | 9 +++++++- dapl/openib_scm/dapl_ib_util.h | 14 +------------ 9 files changed, 94 insertions(+), 66 deletions(-) diff --git a/dapl/include/dapl.h b/dapl/include/dapl.h index f0f2095..58af95d 100755 --- a/dapl/include/dapl.h +++ b/dapl/include/dapl.h @@ -53,20 +53,7 @@ #include "dapl_osd.h" #include "dapl_debug.h" -#ifdef IBAPI -#include "dapl_ibapi_util.h" -#elif VAPI -#include "dapl_vapi_util.h" -#elif __OPENIB__ -#include "dapl_openib_util.h" -#include "dapl_openib_cm.h" -#elif DUMMY -#include "dapl_dummy_util.h" -#elif OPENIB -#include "dapl_ib_util.h" -#else /* windows - IBAL and/or IBAL+Sock_CM */ -#include "dapl_ibal_util.h" -#endif + /********************************************************************* * * @@ -231,11 +218,6 @@ typedef struct dapl_rmr_cookie DAPL_RMR_COOKIE; typedef struct dapl_private DAPL_PRIVATE; -typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( - IN DAPL_EP *, - IN ib_cm_events_t, - IN const void *, - OUT DAT_EVENT *); /********************************************************************* @@ -268,6 +250,21 @@ struct dapl_cookie_buffer DAPL_ATOMIC tail; }; +#ifdef IBAPI +#include "dapl_ibapi_util.h" +#elif VAPI +#include "dapl_vapi_util.h" +#elif __OPENIB__ +#include "dapl_openib_util.h" +#include "dapl_openib_cm.h" +#elif DUMMY +#include "dapl_dummy_util.h" +#elif OPENIB +#include "dapl_ib_util.h" +#else /* windows - IBAL and/or IBAL+Sock_CM */ +#include "dapl_ibal_util.h" +#endif + struct dapl_hca { DAPL_OS_LOCK lock; @@ -701,6 +698,12 @@ void dapls_io_trc_dump ( * * *********************************************************************/ +typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( + IN DAPL_EP *, + IN ib_cm_events_t, + IN const void *, + OUT DAT_EVENT *); + /* * DAT Mandated functions */ diff --git a/dapl/openib_cma/dapl_ib_cq.c b/dapl/openib_cma/dapl_ib_cq.c index d7b3309..742c247 100755 --- a/dapl/openib_cma/dapl_ib_cq.c +++ b/dapl/openib_cma/dapl_ib_cq.c @@ -486,7 +486,10 @@ dapls_ib_wait_object_wakeup (IN ib_wait_obj_handle_t p_cq_wait_obj_handle) p_cq_wait_obj_handle ); /* write to pipe for wake up */ - write(p_cq_wait_obj_handle->pipe[1], "w", sizeof "w"); + if (write(p_cq_wait_obj_handle->pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " wait object wakeup write error = %s\n", + strerror(errno)); return DAT_SUCCESS; } diff --git a/dapl/openib_cma/dapl_ib_dto.h b/dapl/openib_cma/dapl_ib_dto.h index 334fa4b..2b01963 100644 --- a/dapl/openib_cma/dapl_ib_dto.h +++ b/dapl/openib_cma/dapl_ib_dto.h @@ -304,7 +304,7 @@ dapls_ib_post_ext_send ( remote_iov, completion_flags); ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p; + ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; DAT_COUNT i, total_len; diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index 4bbeb8b..a8e1fe3 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2005-2007 Intel Corporation. All rights reserved. + * Copyright (c) 2005-2008 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: * @@ -317,7 +317,10 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA *hca_ptr) dapl_llist_add_tail(&g_hca_list, (DAPL_LLIST_ENTRY*)&hca_ptr->ib_trans.entry, &hca_ptr->ib_trans.entry); - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " open_hca: thread wakeup error = %s\n", + strerror(errno)); dapl_os_unlock(&g_hca_lock); dapl_dbg_log( @@ -384,14 +387,20 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr) * Wakeup work thread to remove from polling list */ hca_ptr->ib_trans.destroy = 1; - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); /* wait for thread to remove HCA references */ while (hca_ptr->ib_trans.destroy != 2) { struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 10000000; /* 10 ms */ - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: wait on hca %p destroy\n"); nanosleep (&sleep, &remain); @@ -670,14 +679,20 @@ void dapli_ib_thread_destroy(void) goto bail; g_ib_thread_state = IB_THREAD_CANCEL; - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) { struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: waiting for ib_thread\n"); - write(g_ib_pipe[1], "w", sizeof "w"); + if (write(g_ib_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " destroy: thread wakeup error = %s\n", + strerror(errno)); dapl_os_unlock( &g_hca_lock ); nanosleep(&sleep, &remain); dapl_os_lock( &g_hca_lock ); @@ -890,9 +905,11 @@ void dapli_thread(void *arg) /* check and process user events, PIPE */ if (ufds[0].revents == POLLIN) { + if (read(g_ib_pipe[0], rbuf, 2) == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " cr_thread: pipe rd err= %s\n", + strerror(errno)); - read(g_ib_pipe[0], rbuf, 2); - /* cleanup any device on list marked for destroy */ for(idx=3;idxdestroy == 1) { diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index 3368180..1e13db9 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2005-2007 Intel Corporation. All rights reserved. + * Copyright (c) 2005-2008 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: * @@ -155,18 +155,6 @@ typedef enum } ib_thread_state_t; -/* - * dapl_llist_entry in dapl.h but dapl.h depends on provider - * typedef's in this file first. move dapl_llist_entry out of dapl.h - */ -struct ib_llist_entry -{ - struct dapl_llist_entry *flink; - struct dapl_llist_entry *blink; - void *data; - struct dapl_llist_entry *list_head; -}; - struct dapl_cm_id { DAPL_OS_LOCK lock; int destroy; @@ -247,7 +235,7 @@ typedef void (*ib_async_handler_t)( /* ib_hca_transport_t, specific to this implementation */ typedef struct _ib_hca_transport { - struct ib_llist_entry entry; + struct dapl_llist_entry entry; int destroy; struct dapl_hca *d_hca; struct rdma_cm_id *cm_id; diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index d2982c7..cf5891d 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -119,7 +119,10 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) dapl_os_unlock(&cm_ptr->lock); /* wakeup work thread */ - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cm_destroy: thread wakeup error = %s\n", + strerror(errno)); } /* queue socket for processing CM work */ @@ -133,7 +136,10 @@ static void dapli_cm_queue(struct ib_cm_handle *cm_ptr) dapl_os_unlock(&cm_ptr->hca->ib_trans.lock); /* wakeup CM work thread */ - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cm_queue: thread wakeup error = %s\n", + strerror(errno)); } static uint16_t dapli_get_lid(IN struct ibv_context *ctx, IN uint8_t port) @@ -167,7 +173,11 @@ dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr) } else { /* send disc date, close socket, schedule destroy */ if (cm_ptr->socket >= 0) { - write(cm_ptr->socket, &disc_data, sizeof(disc_data)); + if (write(cm_ptr->socket, + &disc_data, sizeof(disc_data)) == -1) + dapl_log(DAPL_DBG_TYPE_WARN, + " cm_disc: write error = %s\n", + strerror(errno)); close(cm_ptr->socket); cm_ptr->socket = -1; } @@ -483,8 +493,11 @@ dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); /* complete handshake after final QP state change */ - write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)); - + if (write(cm_ptr->socket, &rtu_data, sizeof(rtu_data)) == -1) { + dapl_log(DAPL_DBG_TYPE_ERR, + " CONN_RTU: write error = %s\n", strerror(errno)); + goto bail; + } /* init cm_handle and post the event with private data */ ep_ptr->cm_handle = cm_ptr; cm_ptr->state = SCM_CONNECTED; @@ -1097,7 +1110,10 @@ dapls_ib_remove_conn_listener ( /* cr_thread will free */ cm_ptr->state = SCM_DESTROY; sp_ptr->cm_srvc_handle = NULL; - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cm_destroy: thread wakeup error = %s\n", + strerror(errno)); } return DAT_SUCCESS; } @@ -1199,7 +1215,10 @@ dapls_ib_reject_connection( /* cr_thread will destroy CR */ cm_ptr->state = SCM_REJECTED; - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cm_destroy: thread wakeup error = %s\n", + strerror(errno)); return DAT_SUCCESS; } @@ -1536,7 +1555,10 @@ void cr_thread(void *arg) poll(ufds,idx+1,-1); /* infinite, all sockets and pipe */ /* if pipe used to wakeup, consume */ if (ufds[0].revents == POLLIN) - read(g_scm_pipe[0], rbuf, 2); + if (read(g_scm_pipe[0], rbuf, 2) == -1) + dapl_log(DAPL_DBG_TYPE_CM, + " cr_thread: read pipe error = %s\n", + strerror(errno)); dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread: wakeup\n"); dapl_os_lock(&hca_ptr->ib_trans.lock); } diff --git a/dapl/openib_scm/dapl_ib_dto.h b/dapl/openib_scm/dapl_ib_dto.h index b9826f5..45000b9 100644 --- a/dapl/openib_scm/dapl_ib_dto.h +++ b/dapl/openib_scm/dapl_ib_dto.h @@ -324,7 +324,7 @@ dapls_ib_post_ext_send ( remote_iov, completion_flags, remote_ah); ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p; + ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; DAT_COUNT i, total_len; diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index 11294fa..58c9943 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -359,13 +359,20 @@ DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) /* destroy cr_thread and lock */ hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; - write(g_scm_pipe[1], "w", sizeof "w"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " thread_destroy: thread wakeup err = %s\n", + strerror(errno)); while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { struct timespec sleep, remain; sleep.tv_sec = 0; sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: waiting for cr_thread\n"); + if (write(g_scm_pipe[1], "w", sizeof "w") == -1) + dapl_log(DAPL_DBG_TYPE_UTIL, + " thread_destroy: thread wakeup err = %s\n", + strerror(errno)); nanosleep (&sleep, &remain); } dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index 4e75d2c..f0230b8 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -90,18 +90,6 @@ typedef struct _ib_qp_cm uint16_t qp_type; } ib_qp_cm_t; -/* - * dapl_llist_entry in dapl.h but dapl.h depends on provider - * typedef's in this file first. move dapl_llist_entry out of dapl.h - */ -struct ib_llist_entry -{ - struct dapl_llist_entry *flink; - struct dapl_llist_entry *blink; - void *data; - struct dapl_llist_entry *list_head; -}; - typedef enum scm_state { SCM_INIT, @@ -119,7 +107,7 @@ typedef enum scm_state struct ib_cm_handle { - struct ib_llist_entry entry; + struct dapl_llist_entry entry; DAPL_OS_LOCK lock; SCM_STATE state; int socket; -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 19:33:38 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 1 Sep 2008 19:33:38 -0700 Subject: [ofa-general] [PATCH 5/5] [v2.0] dapl build: add correct CFLAGS, set non-debug build by default for v2 Message-ID: <000901c90ca4$4ed87bf0$5464fe0a@amr.corp.intel.com> dapl build: add correct CFLAGS, set non-debug build by default for v2 Signed-off by: Arlin Davis ardavis at ichips.intel.com --- Makefile.am | 10 +++++----- dapl.spec.in | 5 +---- 2 files changed, 6 insertions(+), 9 deletions(-) diff --git a/Makefile.am b/Makefile.am index dfab5e8..4cb339f 100755 --- a/Makefile.am +++ b/Makefile.am @@ -22,9 +22,9 @@ XPROGRAMS_SCM = endif if DEBUG -DBGFLAGS = -ggdb -DDAPL_DBG +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG else -DBGFLAGS = -g +AM_CFLAGS = -g -Wall -D_GNU_SOURCE endif datlibdir = $(libdir) @@ -35,17 +35,17 @@ datlib_LTLIBRARIES = dat/udat/libdat2.la dapllibofa_LTLIBRARIES = dapl/udapl/libdaplofa.la daplliboscm_LTLIBRARIES = dapl/udapl/libdaploscm.la -dat_udat_libdat2_la_CFLAGS = -Wall $(DBGFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ +dat_udat_libdat2_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ -I$(srcdir)/dat/include/ -I$(srcdir)/dat/udat/ \ -I$(srcdir)/dat/udat/linux -I$(srcdir)/dat/common/ -dapl_udapl_libdaplofa_la_CFLAGS = -Wall $(DBGFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ +dapl_udapl_libdaplofa_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ -DOPENIB -DCQ_WAIT_OBJECT \ -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \ -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \ -I$(srcdir)/dapl/openib_cma -dapl_udapl_libdaploscm_la_CFLAGS = -Wall $(DBGFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ +dapl_udapl_libdaploscm_la_CFLAGS = $(AM_CFLAGS) -D_GNU_SOURCE $(OSFLAGS) $(XFLAGS) \ -DOPENIB -DCQ_WAIT_OBJECT \ -I$(srcdir)/dat/include/ -I$(srcdir)/dapl/include/ \ -I$(srcdir)/dapl/common -I$(srcdir)/dapl/udapl/linux \ diff --git a/dapl.spec.in b/dapl.spec.in index e18f19a..4cb8860 100644 --- a/dapl.spec.in +++ b/dapl.spec.in @@ -75,7 +75,7 @@ Useful test suites to validate uDAPL library API's. %setup -q %build -%configure --enable-debug --enable-ext-type=ib +%configure --enable-ext-type=ib make %{?_smp_mflags} %install @@ -132,9 +132,6 @@ fi %{_mandir}/man5/*.5* %changelog -* Thu Aug 21 2008 Arlin Davis - 2.0.12 -- DAT/DAPL Version 2.0.12 Release 1, OFED 1.4 RC - * Sun Jul 20 2008 Arlin Davis - 2.0.11 - DAT/DAPL Version 2.0.11 Release 1, IB UD extensions in SCM provider -- 1.5.2.5 From arlin.r.davis at intel.com Mon Sep 1 20:12:48 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 1 Sep 2008 20:12:48 -0700 Subject: [ofa-general] [ANNOUNCE] compat-dapl-1.2.10 and dapl-2.0.13 Release Message-ID: New DAPL releases now available from OFA download page: http://www.openfabrics.org/downloads/dapl/ md5sum: 3998feecc43a66c979c3742c05c2bb62 compat-dapl-1.2.10.tar.gz md5sum: 0aa99a9f5a888cc554686d24c4f23369 dapl-2.0.13.tar.gz Summary of changes since last release: v1.2,v2.0 - cleanup all warnings in tests, common code, and providers v1.2,v2.0 - fix Fedora build Vlad, please pick up new packages and install following for OFED 1.4 rc1: compat-dapl-1.2.10-1 compat-dapl-devel-1.2.10-1 dapl-2.0.13-1 dapl-utils-2.0.13-1 dapl-devel-2.0.13-1 dapl-debuginfo-2.0.13-1 -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Tue Sep 2 03:01:53 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 2 Sep 2008 03:01:53 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080902-0200 daily build status Message-ID: <20080902100153.A4645E60865@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From sashak at voltaire.com Tue Sep 2 06:37:20 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 2 Sep 2008 16:37:20 +0300 Subject: [ofa-general] Re: [PATCH] opensm/Makefile.am: adding yacc-generated .h file as dependency In-Reply-To: <48BC0440.8050807@dev.mellanox.co.il> References: <48BC0440.8050807@dev.mellanox.co.il> Message-ID: <20080902133720.GK19828@sashak.voltaire.com> Hi Yevgeny, On 18:03 Mon 01 Sep , Yevgeny Kliteynik wrote: > > Adding header file that is produced by yacc/bison to the > general dependencies. W/o it compiling of lex-generated > .c file sometimes fails. Do you have a log of failure? Sasha > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/Makefile.am | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am > index 7ca4c2a..f94842c 100644 > --- a/opensm/opensm/Makefile.am > +++ b/opensm/opensm/Makefile.am > @@ -126,7 +126,7 @@ opensminclude_HEADERS = \ > $(srcdir)/../include/opensm/osm_vl15intf.h \ > $(top_builddir)/include/opensm/osm_version.h > > -BUILT_SOURCES = osm_version > +BUILT_SOURCES = osm_version osm_qos_parser_y.h > osm_version: > if [ -x $(top_srcdir)/../gen_ver.sh ] ; then \ > ver_file=$(top_builddir)/include/opensm/osm_version.h ; \ > -- > 1.5.1.4 > From kliteyn at dev.mellanox.co.il Tue Sep 2 06:55:45 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 02 Sep 2008 16:55:45 +0300 Subject: [ofa-general] Re: [PATCH] opensm/Makefile.am: adding yacc-generated .h file as dependency In-Reply-To: <20080902133720.GK19828@sashak.voltaire.com> References: <48BC0440.8050807@dev.mellanox.co.il> <20080902133720.GK19828@sashak.voltaire.com> Message-ID: <48BD45E1.3000900@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 18:03 Mon 01 Sep , Yevgeny Kliteynik wrote: >> Adding header file that is produced by yacc/bison to the >> general dependencies. W/o it compiling of lex-generated >> .c file sometimes fails. > > Do you have a log of failure? ... /bin/sh ../libtool --tag=CC --mode=link gcc -O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2 -I/usr/local/ofed/include -O2 -g -fmessage-length=0 -D_FORTIFY_ SOURCE=2 -L/usr/local/ofed/lib64 -L/usr/local/ofed/lib -o libopensm.la -rpath /usr/local/ofed/lib64 -version-info 2:2:0 -export-dynamic -Wl,--version-scrip t=./libopensm.map libopensm_la-osm_log.lo libopensm_la-osm_mad_pool.lo libopensm_la-osm_helper.lo -libumad -ldl -lpthread if gcc -DHAVE_CONFIG_H -I. -I. -I../include -I./../include -I./../../libibcommon/include -I./../../libibumad/include -I/usr/local/ofed/include -O2 -g -fme ssage-length=0 -D_FORTIFY_SOURCE=2 -I/usr/local/ofed/include -Wall -DOSM_VENDOR_INTF_OPENIB -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP -g -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -g -fmessage-length=0 -D_FORTIFY_SOURCE=2 -I/usr/local/ofed/include -MT opensm-osm_qos_parser_l.o -MD -MP -MF ".de ps/opensm-osm_qos_parser_l.Tpo" -c -o opensm-osm_qos_parser_l.o `test -f 'osm_qos_parser_l.c' || echo './'`osm_qos_parser_l.c; \ then mv -f ".deps/opensm-osm_qos_parser_l.Tpo" ".deps/opensm-osm_qos_parser_l.Po"; else rm -f ".deps/opensm-osm_qos_parser_l.Tpo"; exit 1; fi osm_qos_parser_l.l:49:30: error: osm_qos_parser_y.h: No such file or directory osm_qos_parser_l.l: In function 'yylex': osm_qos_parser_l.l:206: error: 'TK_TEXT' undeclared (first use in this function) ... Full log attached. The problem and solution is described here: http://www.gnu.org/software/libtool/manual/automake/Yacc-and-Lex.html -- Yevgeny > Sasha > >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/Makefile.am | 2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >> diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am >> index 7ca4c2a..f94842c 100644 >> --- a/opensm/opensm/Makefile.am >> +++ b/opensm/opensm/Makefile.am >> @@ -126,7 +126,7 @@ opensminclude_HEADERS = \ >> $(srcdir)/../include/opensm/osm_vl15intf.h \ >> $(top_builddir)/include/opensm/osm_version.h >> >> -BUILT_SOURCES = osm_version >> +BUILT_SOURCES = osm_version osm_qos_parser_y.h >> osm_version: >> if [ -x $(top_srcdir)/../gen_ver.sh ] ; then \ >> ver_file=$(top_builddir)/include/opensm/osm_version.h ; \ >> -- >> 1.5.1.4 >> > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: opensm.rpmbuild.log URL: From sashak at voltaire.com Tue Sep 2 07:18:05 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 2 Sep 2008 17:18:05 +0300 Subject: [ofa-general] Re: [PATCH] opensm/Makefile.am: adding yacc-generated .h file as dependency In-Reply-To: <48BC0440.8050807@dev.mellanox.co.il> References: <48BC0440.8050807@dev.mellanox.co.il> Message-ID: <20080902141805.GR19828@sashak.voltaire.com> On 18:03 Mon 01 Sep , Yevgeny Kliteynik wrote: > Hi Sasha, > > Adding header file that is produced by yacc/bison to the > general dependencies. W/o it compiling of lex-generated > .c file sometimes fails. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Tue Sep 2 07:24:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 2 Sep 2008 17:24:00 +0300 Subject: [ofa-general] [PATCH] opensm: move vendor specific compilation flags to config.h Message-ID: <20080902142400.GS19828@sashak.voltaire.com> From: Ira Weiny Move vendor specific compilation flags VENDOR_RMPP_SUPPORT, DUAL_SIDED_RMPP, and OSM_VENDOR_INTF_* to config.h. Signed-off-by: Ira Weiny Signed-off-by: Sasha Khapyorsky --- opensm/config/osmvsel.m4 | 11 ++++++----- opensm/libvendor/Makefile.am | 6 +----- opensm/opensm/Makefile.am | 12 ++---------- opensm/opensm/osm_sa_multipath_record.c | 4 ++-- opensm/osmtest/Makefile.am | 6 +----- 5 files changed, 12 insertions(+), 27 deletions(-) diff --git a/opensm/config/osmvsel.m4 b/opensm/config/osmvsel.m4 index 96208b2..74d5f79 100644 --- a/opensm/config/osmvsel.m4 +++ b/opensm/config/osmvsel.m4 @@ -65,7 +65,7 @@ with_sim="/usr") dnl based on the with_osmv we can try the vendor flag if test $with_osmv = "openib"; then - OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" + AC_DEFINE(OSM_VENDOR_INTF_OPENIB, 1, [Define as 1 for OpenIB vendor]) OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include -I\$(srcdir)/../../libibumad/include -I\$(includedir)" OSMV_LDADD="-L\$(abs_srcdir)/../../libibumad/.libs -L\$(abs_srcdir)/../../libibcommon/.libs -L\$(libdir) -libumad -libcommon" @@ -76,12 +76,13 @@ if test $with_osmv = "openib"; then if test "x$with_umad_includes" != "x"; then OSMV_INCLUDES="-I$with_umad_includes $OSMV_INCLUDES" fi + AC_DEFINE(DUAL_SIDED_RMPP, 1, [Define as 1 if you want Dual Sided RMPP Support]) elif test $with_osmv = "sim" ; then - OSMV_CFLAGS="-DOSM_VENDOR_INTF_SIM" + AC_DEFINE(OSM_VENDOR_INTF_SIM, 1, [Define as 1 for sim vendor]) OSMV_INCLUDES="-I$with_sim/include -I\$(srcdir)/../include" OSMV_LDADD="-L$with_sim/lib -libmscli" elif test $with_osmv = "gen1"; then - OSMV_CFLAGS="-DOSM_VENDOR_INTF_TS" + AC_DEFINE(OSM_VENDOR_INTF_TS, 1, [Define as 1 for ts vendor]) if test -z $MTHOME; then MTHOME=/usr/local/ibgd/driver/infinihost @@ -111,7 +112,7 @@ elif test $with_osmv = "gen1"; then fi OSMV_LDADD="-L/usr/local/ibgd/driver/infinihost/lib -lvapi -lmosal -lmtl_common -lmpga" elif test $with_osmv = "vapi"; then - OSMV_CFLAGS="-DOSM_VENDOR_INTF_MTL" + AC_DEFINE(OSM_VENDOR_INTF_MTL, 1, [Define as 1 for vapi vendor]) OSMV_INCLUDES="-I/usr/mellanox/include -I/usr/include -I\$(srcdir)/../include" OSMV_LDADD="-L/usr/lib -L/usr/mellanox/lib -lib_mgt -lvapi -lmosal -lmtl_common -lmpga" else @@ -122,9 +123,9 @@ AM_CONDITIONAL(OSMV_VAPI, test $with_osmv = "vapi") AM_CONDITIONAL(OSMV_GEN1, test $with_osmv = "gen1") AM_CONDITIONAL(OSMV_SIM, test $with_osmv = "sim") AM_CONDITIONAL(OSMV_OPENIB, test $with_osmv = "openib") +AC_DEFINE(VENDOR_RMPP_SUPPORT, 1, [Define as 1 if you want Vendor RMPP Support]) AC_SUBST(with_osmv) -AC_SUBST(OSMV_CFLAGS) AC_SUBST(OSMV_LDADD) AC_SUBST(OSMV_INCLUDES) diff --git a/opensm/libvendor/Makefile.am b/opensm/libvendor/Makefile.am index f72dbbe..f359dac 100644 --- a/opensm/libvendor/Makefile.am +++ b/opensm/libvendor/Makefile.am @@ -11,11 +11,7 @@ INCLUDES = $(OSMV_INCLUDES) lib_LTLIBRARIES = libosmvendor.la -if OSMV_OPENIB -libosmvendor_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -else -libosmvendor_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -endif +libosmvendor_la_CFLAGS = -Wall $(OSMV_CFLAGS) $(DBGFLAGS) if HAVE_LD_VERSION_SCRIPT libosmvendor_version_script = -Wl,--version-script=$(srcdir)/libosmvendor.map diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index f94842c..522977c 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -9,11 +9,7 @@ else DBGFLAGS = -g endif -if OSMV_OPENIB -libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -else -libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -endif +libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 if HAVE_LD_VERSION_SCRIPT libopensm_version_script = -Wl,--version-script=$(srcdir)/libopensm.map @@ -63,11 +59,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \ AM_YFLAGS:= -d -if OSMV_OPENIB -opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -else -opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -endif +opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 # we need to be able to load libraries from local build subtree before make install # we always give precedence to local tree libs and then use the pre-installed ones. diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c index 2b8e00a..c0a4904 100644 --- a/opensm/opensm/osm_sa_multipath_record.c +++ b/opensm/opensm/osm_sa_multipath_record.c @@ -40,12 +40,12 @@ * This object is part of the opensm family of objects. */ -#if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) - #if HAVE_CONFIG_H # include #endif /* HAVE_CONFIG_H */ +#if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) + #include #include #include diff --git a/opensm/osmtest/Makefile.am b/opensm/osmtest/Makefile.am index 236cdcf..785f1af 100644 --- a/opensm/osmtest/Makefile.am +++ b/opensm/osmtest/Makefile.am @@ -13,11 +13,7 @@ osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ if OSMV_VAPI osmtest_SOURCES += osmt_mtl_regular_qp.c endif -if OSMV_OPENIB -osmtest_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -else -osmtest_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -endif +osmtest_CFLAGS = -Wall $(OSMV_CFLAGS) $(DBGFLAGS) osmtest_LDADD = -L../complib -losmcomp -L../libvendor -losmvendor -L../opensm -lopensm $(OSMV_LDADD) EXTRA_DIST = $(srcdir)/include/osmt_inform.h \ -- 1.5.4.rc2.60.gb2e62 From vlad at mellanox.co.il Tue Sep 2 09:41:40 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 02 Sep 2008 19:41:40 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4: Set RAE and FRE flags, initialize mtt_sz field in the mpt entry. In-Reply-To: References: <20080901141103.GA32171@mellanox.co.il> Message-ID: <1220373700.13477.26.camel@vlad-laptop> On Mon, 2008-09-01 at 08:48 -0700, Roland Dreier wrote: > I need help deciding whether to get this in 2.6.27 or not. With this > patch, how is send queue fast register working? If this is the last fix > then I think we can get it in 2.6.27. If you are still debugging and it > still doesn't work well, then I might want to wait and see how big the > required fixes end up being. > > Thanks, > Roland Hi Roland, I am still debugging it, there might be more fixes. Regards, Vladimir From jeff at splitrockpr.com Tue Sep 2 09:56:29 2008 From: jeff at splitrockpr.com (Jeffrey Scott) Date: Tue, 02 Sep 2008 09:56:29 -0700 Subject: [ofa-general] IBTA Technical Forum Message-ID: Hello OFA Members - the technical forum is just two weeks away! Please register and book your travel now. We have added more speakers to our agenda, including Jacob Hall from Wachovia. The full agenda is posted at http://www.infinibandta.org/events/IBTATechForum08_. Event: InfiniBand Trade Association's Annual Technical Forum Date: Monday, September 15, 2008 Time: 8am - 5pm with networking reception immediately following Location: Harrah's Las Vegas Register: www.regonline.com/IBTATechForum08 Rate: $299 We're inviting the entire InfiniBand Community! Please assist us in spreading the word about this event to the entire InfiniBand community. The IBTA has created a formal invitation which has been posted online at: http://www.infinibandta.org/events/IBTATechForum08_/Invite_FINAL_081108.pdf. Feel free to forward this invite within your company and to colleagues, vendors, partners, prospects and customers. If you have any questions, please contact: Cheri Winterberg, 978-660-6405, cheriw at owenmedia.com. We'll see you in Las Vegas! -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.grover at oracle.com Tue Sep 2 13:04:19 2008 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 02 Sep 2008 13:04:19 -0700 Subject: [ofa-general] [RFC] dropping RDS over TCP support Message-ID: <48BD9C43.5020400@oracle.com> We've been discussing dropping RDS's support for using TCP as a transport, and just focusing on RDS as a IB and iWARP-focused protocol. This would simplify the RDS codebase, allow easier inclusion of more IB-centric features, and also give RDS an easier path towards mainline Linux kernel inclusion. Also, the imminent RDS iWARP support will address non-IB use cases. Any objections? Anyone using it? Thanks -- Andy From chu11 at llnl.gov Tue Sep 2 13:12:27 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 02 Sep 2008 13:12:27 -0700 Subject: [ofa-general] Re: [IBSIM] add ReLink command In-Reply-To: <20080831134503.GL27535@sashak.voltaire.com> References: <1219964487.29252.318.camel@cardanus.llnl.gov> <20080831134503.GL27535@sashak.voltaire.com> Message-ID: <1220386347.29252.358.camel@cardanus.llnl.gov> Hey Sasha, > So if one asked for ReLinking whole node? I think it should be > straightforward - restore links for all ports where previous_remote* > exists. What do you think? I didn't think of that before, but I think it's a good idea. So I tweaked it to handle this case when a port isn't specified. > Maybe "restore previously disconnected link(s)" help message? Actually it > is almost same :) Now that you mention it, I think "restore" is a better word to use than "reconnect". So I've now tweaked it to "restore previously unconnected". The new patch is attached. Thanks, Al On Sun, 2008-08-31 at 16:45 +0300, Sasha Khapyorsky wrote: > Hi Al, > > On 16:01 Thu 28 Aug , Al Chu wrote: > > Hey Sasha, > > > > This adds a "ReLink" command to ibsim. If a link was previously > > unlinked, you can run "ReLink" to reconnect it to whatever it was > > connected to before. It's easier than having to figure out what it was > > connected to previously and input both the local and remote ends under > > the "Link" command. > > > > The idea for this option came up when I was trying to simulate an entire > > cluster going down then going back up. Scripting the cluster to go down > > was easy ("Unlink" all CAs), but scripting it to come back up was a > > little harder since I had to figure out all the other end ports to input > > into "Link". > > > > Al > > > > -- > > Albert Chu > > chu11 at llnl.gov > > 925-422-5311 > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > From ec9cf72ac3dc5950337aa577f49ada6b8887d579 Mon Sep 17 00:00:00 2001 > > From: Albert Chu > > Date: Thu, 28 Aug 2008 15:25:14 -0700 > > Subject: [PATCH] add relink command > > > > > > Signed-off-by: Albert Chu > > --- > > ibsim/sim.h | 2 + > > ibsim/sim_cmd.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 70 insertions(+), 0 deletions(-) > > > > diff --git a/ibsim/sim.h b/ibsim/sim.h > > index f989252..32a4e20 100644 > > --- a/ibsim/sim.h > > +++ b/ibsim/sim.h > > @@ -206,6 +206,8 @@ struct Port { > > char alias[ALIASLEN + 1]; > > Node *remotenode; > > int remoteport; > > + Node *previous_remotenode; > > + int previous_remoteport; > > int errrate; > > uint16_t errattr; > > Node *node; > > diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c > > index d55fb4c..39eb316 100644 > > --- a/ibsim/sim_cmd.c > > +++ b/ibsim/sim_cmd.c > > @@ -149,14 +149,79 @@ static int do_link(FILE * f, char *line) > > if (link_ports(lport, rport) < 0) > > return -fprintf(f, > > "# can't link: local/remote port are already connected\n"); > > + > > + lport->previous_remotenode = NULL; > > + rport->previous_remotenode = NULL; > > + > > + return 0; > > +} > > + > > +static int do_relink(FILE * f, char *line) > > +{ > > + Port *lport, *rport; > > + Node *lnode; > > + char *orig = 0; > > + char *lnodeid = 0; > > + char *s = line, name[NAMELEN], *sp; > > + int lportnum = -1; > > + > > + // parse local > > + if (strsep(&s, "\"")) > > + orig = strsep(&s, "\""); > > + > > + lnodeid = expand_name(orig, name, &sp); > > + if (!sp && s && *s == '[') > > + sp = s + 1; > > + > > + DEBUG("lnodeid %s port [%s", lnodeid, sp); > > + if (!(lnode = find_node(lnodeid))) { > > + fprintf(f, "# nodeid \"%s\" (%s) not found\n", orig, lnodeid); > > + return -1; > > + } > > + > > + if (sp) { > > + lportnum = strtoul(sp, &sp, 0); > > + if (lportnum < 1 || lportnum > lnode->numports) { > > + fprintf(f, "# nodeid \"%s\": bad port %d\n", > > + lnodeid, lportnum); > > + return -1; > > + } > > + } else { > > + fprintf(f, "# no local port\n"); > > + return -1; > > So if one asked for ReLinking whole node? I think it should be > straightforward - restore links for all ports where previous_remote* > exists. What do you think? > > > + } > > + > > + lport = node_get_port(lnode, lportnum); > > + > > + if (!lport->previous_remotenode) { > > + fprintf(f, "# no previous link stored\n"); > > + return -1; > > + } > > + > > + rport = node_get_port(lport->previous_remotenode, lport->previous_remoteport); > > + > > + if (link_ports(lport, rport) < 0) > > + return -fprintf(f, > > + "# can't link: local/remote port are already connected\n"); > > + > > + lport->previous_remotenode = NULL; > > + rport->previous_remotenode = NULL; > > + > > return 0; > > } > > > > + > > No need extra lines between functions. > > > static void unlink_port(Node * lnode, Port * lport, Node * rnode, int rportnum) > > { > > Port *rport = node_get_port(rnode, rportnum); > > Port *endport; > > > > + /* save current connection for potential relink later */ > > + lport->previous_remotenode = lport->remotenode; > > + lport->previous_remoteport = lport->remoteport; > > + rport->previous_remotenode = rport->remotenode; > > + rport->previous_remoteport = rport->remoteport; > > + > > lport->remotenode = rport->remotenode = 0; > > lport->remoteport = rport->remoteport = 0; > > lport->remotenodeid[0] = rport->remotenodeid[0] = 0; > > @@ -713,6 +778,7 @@ static int dump_help(FILE * f) > > fprintf(f, "\tDump [nodeid] (def all network)\n"); > > fprintf(f, "\tRoute

\n"); > > fprintf(f, "\tLink \"nodeid\"[port] \"remoteid\"[port]\n"); > > + fprintf(f, "\tReLink \"nodeid\"[port] : reconnect previously unconnected link\n"); > > Maybe "restore previously disconnected link(s)" help message? Actually it > is almost same :) > > Sasha > > > fprintf(f, "\tUnlink \"nodeid\" : remove all links of the node\n"); > > fprintf(f, "\tUnlink \"nodeid\"[port]\n"); > > fprintf(f, > > @@ -814,6 +880,8 @@ int do_cmd(char *buf, FILE *f) > > * > > * please specify new command support below this comment. > > */ > > + else if (!strncasecmp(line, "ReLink", cmd_len)) > > + r = do_relink(f, line); > > else if (*line != '\n' && *line != '\0') > > fprintf(f, "command \'%s\' unknown - skipped\n", line); > > > > -- > > 1.5.4.5 > > > -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-add-relink-command.patch Type: text/x-patch Size: 4285 bytes Desc: not available URL: From rdreier at cisco.com Tue Sep 2 13:20:10 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Sep 2008 13:20:10 -0700 Subject: [ofa-general] [PATCH 2.6.27] ib/cm: free cm_device structure In-Reply-To: (Sean Hefty's message of "Mon, 25 Aug 2008 12:13:15 -0700") References: Message-ID: thanks, applied. From jon at opengridcomputing.com Tue Sep 2 13:24:15 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 2 Sep 2008 15:24:15 -0500 Subject: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <48BD9C43.5020400@oracle.com> References: <48BD9C43.5020400@oracle.com> Message-ID: <20080902202415.GG32022@opengridcomputing.com> On Tue, Sep 02, 2008 at 01:04:19PM -0700, Andy Grover wrote: > We've been discussing dropping RDS's support for using TCP as a > transport, and just focusing on RDS as a IB and iWARP-focused protocol. > > This would simplify the RDS codebase, allow easier inclusion of more > IB-centric features, and also give RDS an easier path towards mainline > Linux kernel inclusion. Also, the imminent RDS iWARP support will > address non-IB use cases. > > Any objections? Anyone using it? I found it useful for early development of RDS iWARP support. While I believe that mainline inclusion is much more important, I don't see any harm in keeping it around. Is there a thread on lkml listing it as an issue for mainline inclusion? Thanks, Jon > > Thanks -- Andy > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Sep 2 13:24:42 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Sep 2008 13:24:42 -0700 Subject: [ofa-general] [PATCH] Bug 988: BMA responses are discarded in kernel In-Reply-To: (Michael Brooks's message of "Wed, 27 Aug 2008 12:44:28 -0500") References: Message-ID: thanks, applied. (this patch was corrupted, because your mailer converted it to quoted-printable; in the future please use an MUA that can handle patches properly. I fixed this one up by hand) From rdreier at cisco.com Tue Sep 2 13:26:51 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Sep 2008 13:26:51 -0700 Subject: [ofa-general] Re: [PATCH] drivers/infiniband/core: Use a NULL test rather than an IS_ERR test In-Reply-To: <200808281531.07942.brunel@diku.dk> (Julien Brunel's message of "Thu, 28 Aug 2008 15:31:07 +0200") References: <200808281531.07942.brunel@diku.dk> Message-ID: thanks, applied From rdreier at cisco.com Tue Sep 2 13:28:25 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Sep 2008 13:28:25 -0700 Subject: [ofa-general] [PATCH] IB/ipath - fix SLID generation for RC/UC QPs In-Reply-To: <20080829171645.14033.34664.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Fri, 29 Aug 2008 10:16:45 -0700") References: <20080829171645.14033.34664.stgit@eng-46.mv.qlogic.com> Message-ID: thanks, applied. From rdreier at cisco.com Tue Sep 2 13:28:38 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Sep 2008 13:28:38 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: Set RAE and FRE flags, initialize mtt_sz field in the mpt entry. In-Reply-To: <20080901141103.GA32171@mellanox.co.il> (Vladimir Sokolovsky's message of "Mon, 1 Sep 2008 17:11:03 +0300") References: <20080901141103.GA32171@mellanox.co.il> Message-ID: thanks, applied. From andy.grover at oracle.com Tue Sep 2 13:50:39 2008 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 02 Sep 2008 13:50:39 -0700 Subject: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <20080902202415.GG32022@opengridcomputing.com> References: <48BD9C43.5020400@oracle.com> <20080902202415.GG32022@opengridcomputing.com> Message-ID: <48BDA71F.5030708@oracle.com> Jon Mason wrote: > On Tue, Sep 02, 2008 at 01:04:19PM -0700, Andy Grover wrote: >> We've been discussing dropping RDS's support for using TCP as a >> transport, and just focusing on RDS as a IB and iWARP-focused protocol. > I found it useful for early development of RDS iWARP support. Hmm! Were you actually running it or just using it as a reference? > While I > believe that mainline inclusion is much more important, I don't see any > harm in keeping it around. Is there a thread on lkml listing it as an > issue for mainline inclusion? We haven't brought up rds inclusion on lkml yet, but I think positioning RDS as an IB protocol is a good thing. RDS may be a better fit in drivers/infiniband/ulp/rds, rather than net/rds, for example. Removing unused code is also always a good thing. Lastly, we're looking to extend rds with more ib-centric features in the future, so it would be a development burden to tunnel that over the TCP transport...and that starts to sound a lot like reinventing iwarp ;-) Regards -- Andy From jgunthorpe at obsidianresearch.com Tue Sep 2 14:03:09 2008 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 2 Sep 2008 15:03:09 -0600 Subject: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <48BDA71F.5030708@oracle.com> References: <48BD9C43.5020400@oracle.com> <20080902202415.GG32022@opengridcomputing.com> <48BDA71F.5030708@oracle.com> Message-ID: <20080902210309.GW4314@obsidianresearch.com> On Tue, Sep 02, 2008 at 01:50:39PM -0700, Andy Grover wrote: > Lastly, we're looking to extend rds with more ib-centric features in the > future, so it would be a development burden to tunnel that over the TCP > transport...and that starts to sound a lot like reinventing iwarp ;-) Has anyone talked about a SW implementation of iWarp? It seems to me this same question is going to keep coming up the more protocols are developed.. Even if there is no HW offload a SW only version should get reasonable performance relative to straight TCP I'd think.. Jason From jon at opengridcomputing.com Tue Sep 2 14:07:09 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 2 Sep 2008 16:07:09 -0500 Subject: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <48BDA71F.5030708@oracle.com> References: <48BD9C43.5020400@oracle.com> <20080902202415.GG32022@opengridcomputing.com> <48BDA71F.5030708@oracle.com> Message-ID: <20080902210709.GI32022@opengridcomputing.com> On Tue, Sep 02, 2008 at 01:50:39PM -0700, Andy Grover wrote: > Jon Mason wrote: > > On Tue, Sep 02, 2008 at 01:04:19PM -0700, Andy Grover wrote: > >> We've been discussing dropping RDS's support for using TCP as a > >> transport, and just focusing on RDS as a IB and iWARP-focused protocol. > > > I found it useful for early development of RDS iWARP support. > > Hmm! Were you actually running it or just using it as a reference? I was running it to see if there were any issues with iWARP and IB co-existing (as I was developing a stand-alone iWARP RDS method at the time). It does work, contrary to the documentation. My only reason to keep it in would be for those developers who do not have access to IB/iWARP hardware. There may be people who would want to sue it for something beyond is current usage bad this would lower the bar of entry....but I can't imagine who they are or why they would want to. > > While I > > believe that mainline inclusion is much more important, I don't see any > > harm in keeping it around. Is there a thread on lkml listing it as an > > issue for mainline inclusion? > > We haven't brought up rds inclusion on lkml yet, but I think positioning > RDS as an IB protocol is a good thing. RDS may be a better fit in > drivers/infiniband/ulp/rds, rather than net/rds, for example. Removing the TCP module would remove the need for it to be in the net/ dir (as it would then be only IB/iWARP). It will also remove the need to have it go through the netdev mailing list..which may or may not be a good thing. > Removing unused code is also always a good thing. > > Lastly, we're looking to extend rds with more ib-centric features in the > future, so it would be a development burden to tunnel that over the TCP > transport...and that starts to sound a lot like reinventing iwarp ;-) I have no issues with removing it. I simply wanted to make sure that is is not being used. Thanks, Jon > Regards -- Andy From rdreier at cisco.com Tue Sep 2 14:24:22 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 02 Sep 2008 14:24:22 -0700 Subject: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <20080902210309.GW4314@obsidianresearch.com> (Jason Gunthorpe's message of "Tue, 2 Sep 2008 15:03:09 -0600") References: <48BD9C43.5020400@oracle.com> <20080902202415.GG32022@opengridcomputing.com> <48BDA71F.5030708@oracle.com> <20080902210309.GW4314@obsidianresearch.com> Message-ID: > Has anyone talked about a SW implementation of iWarp? http://www.osc.edu/research/network_file/projects/iwarp/iwarp_main.shtml From or.gerlitz at gmail.com Tue Sep 2 14:48:53 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 3 Sep 2008 00:48:53 +0300 Subject: ***SPAM*** Re: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <48BD9C43.5020400@oracle.com> References: <48BD9C43.5020400@oracle.com> Message-ID: <15ddcffd0809021448n55b47b2fv6f23c0e47c38b807@mail.gmail.com> On Tue, Sep 2, 2008 at 11:04 PM, Andy Grover wrote: > We've been discussing dropping RDS's support for using TCP as a transport, > and just focusing on RDS as a IB and iWARP-focused protocol. do we have any results that compare Oracle IPC using rds/bcopy/tcp vs udp? This would simplify the RDS codebase, allow easier inclusion of more > IB-centric features, and also give RDS an easier path towards mainline Linux > kernel inclusion. Also, the imminent RDS iWARP support will address non-IB > use cases. So just to make sure, do IB and iWARP share the same transport code today? if yes, does removing TCP means the transport abstraction would not be needed any more, or you still want to maintain it for the loopback case? Generally speaking, the loopback transport also uses IB, correct? and if it doesn't I am quite sure it can. I tend to agree with Jon that removing TCP might help with mainline inclusion or might create damage... Roland, maybe you have more definitive intuitions re the netdev people potential feedback on rds as a new socket type applicable to RDMA cards such as IB and iWARP using a verbs/rdmacm native transport AND to non RDMA cards with TCP transport, vs the case of RDS being "just" a ULP under drivers/infiniband/ulps that defines a new socket type, etc. Or -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon at opengridcomputing.com Tue Sep 2 14:57:21 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 2 Sep 2008 16:57:21 -0500 Subject: ***SPAM*** Re: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <15ddcffd0809021448n55b47b2fv6f23c0e47c38b807@mail.gmail.com> References: <48BD9C43.5020400@oracle.com> <15ddcffd0809021448n55b47b2fv6f23c0e47c38b807@mail.gmail.com> Message-ID: <20080902215721.GJ32022@opengridcomputing.com> On Wed, Sep 03, 2008 at 12:48:53AM +0300, Or Gerlitz wrote: > On Tue, Sep 2, 2008 at 11:04 PM, Andy Grover wrote: > > > We've been discussing dropping RDS's support for using TCP as a transport, > > and just focusing on RDS as a IB and iWARP-focused protocol. > > > do we have any results that compare Oracle IPC using rds/bcopy/tcp vs udp? > > This would simplify the RDS codebase, allow easier inclusion of more > > IB-centric features, and also give RDS an easier path towards mainline Linux > > kernel inclusion. Also, the imminent RDS iWARP support will address non-IB > > use cases. > > > So just to make sure, do IB and iWARP share the same transport code today? > if yes, does removing TCP means the transport abstraction would not be > needed any more, or you still want to maintain it for the loopback case? > Generally speaking, the loopback transport also uses IB, correct? and if it > doesn't I am quite sure it can. Not all iWARP devices can do loopback, so having it done in the IB specific code could be painful. I believe there is a loopback module in RDS which can handle this though. > > I tend to agree with Jon that removing TCP might help with mainline > inclusion or might create damage... > > Roland, maybe you have more definitive intuitions re the netdev people > potential feedback on rds as a new socket type applicable to RDMA cards such > as IB and iWARP using a verbs/rdmacm native transport AND to non RDMA cards > with TCP transport, vs the case of RDS being "just" a ULP under > drivers/infiniband/ulps that defines a new socket type, etc. > > > Or > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From richard.frank at oracle.com Tue Sep 2 15:42:05 2008 From: richard.frank at oracle.com (Richard Frank) Date: Tue, 02 Sep 2008 18:42:05 -0400 Subject: [rds-devel] ***SPAM*** Re: [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <20080902215721.GJ32022@opengridcomputing.com> References: <48BD9C43.5020400@oracle.com> <15ddcffd0809021448n55b47b2fv6f23c0e47c38b807@mail.gmail.com> <20080902215721.GJ32022@opengridcomputing.com> Message-ID: <48BDC13D.9050806@oracle.com> Jon Mason wrote: > On Wed, Sep 03, 2008 at 12:48:53AM +0300, Or Gerlitz wrote: > >> On Tue, Sep 2, 2008 at 11:04 PM, Andy Grover wrote: >> >> >>> We've been discussing dropping RDS's support for using TCP as a transport, >>> and just focusing on RDS as a IB and iWARP-focused protocol. >>> >> do we have any results that compare Oracle IPC using rds/bcopy/tcp vs udp? >> >> This would simplify the RDS codebase, allow easier inclusion of more >> >>> IB-centric features, and also give RDS an easier path towards mainline Linux >>> kernel inclusion. Also, the imminent RDS iWARP support will address non-IB >>> use cases. >>> >> So just to make sure, do IB and iWARP share the same transport code today? >> if yes, does removing TCP means the transport abstraction would not be >> needed any more, or you still want to maintain it for the loopback case? >> Generally speaking, the loopback transport also uses IB, correct? and if it >> doesn't I am quite sure it can. >> > > Not all iWARP devices can do loopback, so having it done in the IB > specific code could be painful. I believe there is a loopback module in > RDS which can handle this though. > Currently, loop back (connecting to local ip:port) is handled by the transport (with IB it uses a local IB RC) - we moved to this to simplify the driver and support things like performing local process comm including rdma'ing between processes - which is a key feature in of itself. If a particular transport can not support this - then it should either emulate operations when possible - or fail them. For example, getting a key for rdma may fail if the connection is loop back. > >> I tend to agree with Jon that removing TCP might help with mainline >> inclusion or might create damage... >> >> Roland, maybe you have more definitive intuitions re the netdev people >> potential feedback on rds as a new socket type applicable to RDMA cards such >> as IB and iWARP using a verbs/rdmacm native transport AND to non RDMA cards >> with TCP transport, vs the case of RDS being "just" a ULP under >> drivers/infiniband/ulps that defines a new socket type, etc. >> >> >> Or >> > > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > _______________________________________________ > rds-devel mailing list > rds-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/rds-devel > From richard.frank at oracle.com Tue Sep 2 15:49:18 2008 From: richard.frank at oracle.com (Richard Frank) Date: Tue, 02 Sep 2008 18:49:18 -0400 Subject: [rds-devel] [ofa-general] [RFC] dropping RDS over TCP support In-Reply-To: <15ddcffd0809021448n55b47b2fv6f23c0e47c38b807@mail.gmail.com> References: <48BD9C43.5020400@oracle.com> <15ddcffd0809021448n55b47b2fv6f23c0e47c38b807@mail.gmail.com> Message-ID: <48BDC2EE.5050105@oracle.com> Or Gerlitz wrote: > On Tue, Sep 2, 2008 at 11:04 PM, Andy Grover > wrote: > > We've been discussing dropping RDS's support for using TCP as a > transport, and just focusing on RDS as a IB and iWARP-focused > protocol. > > > do we have any results that compare Oracle IPC using rds/bcopy/tcp vs udp? > Nothing current - would be great to see rds-stress data for RDS/TCP over 10GE compared to RDS/IB over 10GIB and RDS/IWARP over 10G. The purpose of the TCP transport was to support simple ethernet NICs with bcopy. Our thinking at the time was that TCP (even if the path lengths are longer than UDP) would be more efficient under heavy load - than running UDP from user mode. > This would simplify the RDS codebase, allow easier inclusion of > more IB-centric features, and also give RDS an easier path towards > mainline Linux kernel inclusion. Also, the imminent RDS iWARP > support will address non-IB use cases. > > > So just to make sure, do IB and iWARP share the same transport code > today? if yes, does removing TCP means the transport abstraction would > not be needed any more, or you still want to maintain it for the > loopback case? Generally speaking, the loopback transport also uses > IB, correct? and if it doesn't I am quite sure it can. > > I tend to agree with Jon that removing TCP might help with mainline > inclusion or might create damage... > Why does having the TCP module affect the issue of main line inclusion - what is / are the issues ? > Roland, maybe you have more definitive intuitions re the netdev people > potential feedback on rds as a new socket type applicable to RDMA > cards such as IB and iWARP using a verbs/rdmacm native transport AND > to non RDMA cards with TCP transport, vs the case of RDS being "just" > a ULP under drivers/infiniband/ulps that defines a new socket type, etc. > > > Or > ------------------------------------------------------------------------ > > _______________________________________________ > rds-devel mailing list > rds-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/rds-devel From amirv at mellanox.co.il Wed Sep 3 00:42:17 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Wed, 03 Sep 2008 10:42:17 +0300 Subject: [ofa-general] Re: [PATCH] libsdp: enable fallback to TCP for nonblocking sockets In-Reply-To: <48B6E63F.6060309@gmail.com> References: <48AC445D.2050704@gmail.com> <5D49E7A8952DC44FB38C38FA0D758EAD5865EA@mtlexch01.mtl.com> <48AD9C80.8030305@gmail.com> <1219590681.1564.10.camel@amirv-laptop> <48B2CD3A.5020509@gmail.com> <5D49E7A8952DC44FB38C38FA0D758EAD61E699@mtlexch01.mtl.com> <48B6E63F.6060309@gmail.com> Message-ID: <1220427737.6824.3.camel@mtllpt156.mtl.com> Yossi Hi, Because you need things fixed immediately I applied your "enable fallback to TCP..." patch. And will fix it ASAP - not to break the non blocking semantics. If your IO signals solution looks good I'll be happy to use it instead. - Amir. On Thu, 2008-08-28 at 20:54 +0300, Yossi Etigin wrote: > Hi, > > I'm attempting to do this with IO signals - install a signal handler > that > will be called when the connect fails, and it will do the fallback. > > --Yossi > > Amir Vadai wrote: > > > > Yossi Hi, > > > > I'm on vacation till Monday. > > I'll check when can we have the full fix - and if it is not in the > near > > future > > we'll put your patch till the full fix be prepared. > > > > - Amir > > > > -----Original Message----- > > From: Yossi Etigin [mailto:yossi.openib at gmail.com] > > Sent: Mon 8/25/2008 6:18 PM > > To: Amir Vadai > > Cc: general list; Oren Duer; Olga Shern > > Subject: Re: [PATCH] libsdp: enable fallback to TCP for nonblocking > sockets > > > > Hi Amir, > > > > The single case in which we block connect() here (and only on SDP, > which > > is rather fast) is the case that is currenlty not supported anyway. > It can > > also be configurable. > > Anyway, we have a client which uses non-blocking sockets and really > needs > > that feature. How about putting this to OFED now and writing > something > > better > > later on? > > > > --Yossi > > > > > > Amir Vadai wrote: > > > See below > > > > > > On Thu, 2008-08-21 at 19:49 +0300, Yossi Etigin wrote: > > >> Hi Amir, > > >> > > >> What you suggesting is to replace almost all socket functions, > and I > > >> don't think that this is good either. > > > I agree - but to break the non-blocking semantics is worse. > > > > > >> It would be write(), send(), recv(), sendto(), recvfrom(), > sendmsg(), > > >> recvmsg(), and also need to change select() (to not return when > > >> fallback > > >> happens if SDP fails), and maybe also poll(). libsdp tries to > avoid > > >> the fast path. > > > I don't see another option. We could have a #ifdef to enable the > user > > > to choose - non blocking support or cleaner fast-path. > > >> Besides, how do we know when to do fallback - can we safely > assume > > >> that if some socket operation fails, then it happened because > > >> connect() failed? > > >>From a brief look at connect man page, they say we should use > select for > > > writing on the socket. after select indicates writability, use > > > getsockopt to determine whether connect() completed successfully > or not. > > >> Anyway, if I understand correctly, you suggest something like: > > >> > > >> int connect(fd, ...) > > >> { > > >> ... > > >> set_state(fd, SDP) > > >> ... > > >> } > > >> > > >> > > >> int read(int fd, ...) > > >> { > > >> int res = socket_funcs.read(shadow_fd(fd), ...); > > >> if (res < 0 && errno != EAGAIN && sock_state(fd) == SDP) > { > > >> sock_state = TCP; > > >> sockt_funs.connect(fd,...); > > >> close(shadow_fd(fd)); > > >> errno = EAGAIN; > > >> } > > >> return res; > > >> } > > >> > > >> > > > ... again, I don't like it too - but I don't think we should > block > > > connect when the user asks not to. > > > - Amir. > > >> --Yossi > > >> > > >> Amir Vadai wrote: > > >>> Yossi Hi, > > >>> > > >>> I think that breaking the semantic of non blocking socket is a > bad > > >> idea. > > >>> There is a solution that won't break this semantics: > > >>> > > >>> 1. User app calls connect(). > > >>> - libsdp try to connect through sdp. > > >>> 2. User app try another operation on the socket (e.g > read/write) > > >>> - if sdp connection established successfully - great > > >>> - if sdp still not established - return -EAGAIN. This is > the > > >>> same behaviour as if the tcp connection wasn't connected yet. > > >>> - if sdp timedout - return -EAGAIN and initiate TCP > connect. > > >>> - if tcp connection established - use it > > >>> - if tcp connection timedout - return error. > > >>> > > >>> Maybe we could optimize it and initiate a tcp connection in > parallel > > >>> with the sdp connection and use it only when the sdp connect is > > >>> timedout. > > >>> > > >>> I will add only the second patch (the debug print fix). > > >>> > > >>> - Amir > > >>> > > >>> > > >> > > >> > > > > > > > From sashak at voltaire.com Wed Sep 3 02:47:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 3 Sep 2008 12:47:06 +0300 Subject: [ofa-general] Re: [IBSIM] add ReLink command In-Reply-To: <1220386347.29252.358.camel@cardanus.llnl.gov> References: <1219964487.29252.318.camel@cardanus.llnl.gov> <20080831134503.GL27535@sashak.voltaire.com> <1220386347.29252.358.camel@cardanus.llnl.gov> Message-ID: <20080903094706.GE21573@sashak.voltaire.com> On 13:12 Tue 02 Sep , Al Chu wrote: > From 5316be9376a36d3ed075be9ccff58f07aaeb0cbd Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Thu, 28 Aug 2008 15:25:14 -0700 > Subject: [PATCH] add relink command > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From vlad at lists.openfabrics.org Wed Sep 3 03:05:20 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 3 Sep 2008 03:05:20 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080903-0200 daily build status Message-ID: <20080903100520.B8702E608A2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From tziporet at mellanox.co.il Wed Sep 3 09:06:27 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 3 Sep 2008 19:06:27 +0300 Subject: [ofa-general] OFED status toward RC1 Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD6BB502@mtlexch01.mtl.com> Hi These are the open items we had for RC1: Done: - iSER - MVAPICH2 1.1 - Open MPI 1.2.7 - Extended QP verb - decided not to include this change for 1.4 and use the workaround we had in 1.3 Not done: - NFS/RDMA support for SLES10 - Jeff when do you expect this will be ready In addition we have a critical bug that IPv6 is not working over IPoIB from kernel 2.6.23 and below - Vlad debugging this We thus delay the RC1 release to Friday (if this issue will be closed by tomorrow) of Monday next week. Tziporet From Jeffrey.C.Becker at nasa.gov Wed Sep 3 09:14:51 2008 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 03 Sep 2008 09:14:51 -0700 Subject: [ofa-general] Re: [ewg] OFED status toward RC1 In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD6BB502@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD6BB502@mtlexch01.mtl.com> Message-ID: <48BEB7FB.9010802@nasa.gov> Hi Tziporet Tziporet Koren wrote: > Hi > > These are the open items we had for RC1: > Done: > - iSER > - MVAPICH2 1.1 > - Open MPI 1.2.7 > - Extended QP verb - decided not to include this change for 1.4 and > use the workaround we had in 1.3 > > Not done: > - NFS/RDMA support for SLES10 - Jeff when do you expect this will be > ready > I should get it to completely build today, and then I will do some light testing. When it passes, I will send my patches to Vlad, hopefully by the end of this week. Thanks. -jeff > In addition we have a critical bug that IPv6 is not working over IPoIB > from kernel 2.6.23 and below - Vlad debugging this > > We thus delay the RC1 release to Friday (if this issue will be closed by > tomorrow) of Monday next week. > > > Tziporet > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From chu11 at llnl.gov Wed Sep 3 11:21:18 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 03 Sep 2008 11:21:18 -0700 Subject: [ofa-general] [OPENSM] fix console segfault corner casem Message-ID: <1220466078.29252.386.camel@cardanus.llnl.gov> Hey Sasha, If the call to osm_console_init() fails (most typically b/c bind fails b/c the port is already used), we can fall through into osm_console() and segfault b/c a bunch of stuff isn't initialized properly. Can be handled multiple ways. The patch below makes osm_console_init() return a non-void so we can recognize if an error occurred. Al -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-segfault-corner-case-when-osm_console_init-fails.patch Type: text/x-patch Size: 3511 bytes Desc: not available URL: From chu11 at llnl.gov Wed Sep 3 11:21:22 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 03 Sep 2008 11:21:22 -0700 Subject: [ofa-general] [OPENSM] close console socket Message-ID: <1220466082.29252.387.camel@cardanus.llnl.gov> Hey Sasha, While fixing the console segfault issue, I noticed that the console socket never seems to be closed. Al -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-close-console-socket-on-cleanup-path.patch Type: text/x-patch Size: 786 bytes Desc: not available URL: From christopher.tanner at gatech.edu Wed Sep 3 11:32:15 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Wed, 3 Sep 2008 14:32:15 -0400 Subject: [ofa-general] Compiling source using Intel Compiler Message-ID: Has anyone built the various IB source packages using the Intel compilers? The configure, make, and make install all progressed without any errors. However, when I try to start OpenSM, I get the following error error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory The LD_LIBRARY_PATH contains the path to the icc and ifort lib directories, so this is not the problem. The reason I'm building from source is because I'm trying to utilize Infiniband on an Ubuntu cluster. Additionally, I need to use the Intel compilers as some of our Fortran programs cannot be compiled using gfortran... Thanks! ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- From aj.guillon at gmail.com Wed Sep 3 11:49:00 2008 From: aj.guillon at gmail.com (AJ Guillon) Date: Wed, 3 Sep 2008 14:49:00 -0400 Subject: [ofa-general] ***SPAM*** Interrupt RDMA Read In-Reply-To: <2f3bf9a60808312300q778c7aaen23b2ca70d5f2c1ea@mail.gmail.com> References: <9870a2060808311148h65c7950g735e5d33d4690960@mail.gmail.com> <2f3bf9a60808312300q778c7aaen23b2ca70d5f2c1ea@mail.gmail.com> Message-ID: <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> Hrrrm. That's really too bad because I would like to use RDMA to steal work from other nodes along with dependent memory. If I'm loading memory for a task on one node, and another node steals the task, the node from which the task was stolen should stop fetching memory required for the now stolen task. A more complex scheduler might be able to deal with this but maybe not optimally. Suggestions for workarounds? AJ On Sep 1, 2008, at 2:00 AM, "Dotan Barak" wrote: > As much as i know, once you posted a WR, you can not cancel it. > The only thing that you can do is flush the whole QP by changing the > QP state to ERROR (which flushes the work Queues and produces > completion for every WR) or to RESET, which cleans the Queues from the > WRs. > > > Dotan > > On Sun, Aug 31, 2008 at 9:48 PM, Adrien Guillon > wrote: >> Hey, >> >> How can I interrupt an RDMA read cleanly? In my case, I might decide >> that I don't need to read some memory anymore (because something else >> happened), so I want to abort. >> >> AJ >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From tziporet at dev.mellanox.co.il Wed Sep 3 12:32:10 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 03 Sep 2008 22:32:10 +0300 Subject: [ofa-general] Re: [ewg] OFED status toward RC1 In-Reply-To: <48BEB7FB.9010802@nasa.gov> References: <5D49E7A8952DC44FB38C38FA0D758EAD6BB502@mtlexch01.mtl.com> <48BEB7FB.9010802@nasa.gov> Message-ID: <48BEE63A.2040707@mellanox.co.il> Jeff Becker wrote: >> >> Not done: >> - NFS/RDMA support for SLES10 - Jeff when do you expect this will >> be ready >> > I should get it to completely build today, and then I will do some > light testing. When it passes, I will send my patches to Vlad, > hopefully by the end of this week. Thanks. > So we will wait with RC1 for Monday Can you send first patches to Vlad today so he will try them tomorrow, therwise if we will have a problem it will delay the release to Tuesday Thanks Tziporet From dotanba at gmail.com Wed Sep 3 23:15:08 2008 From: dotanba at gmail.com (Dotan Barak) Date: Thu, 4 Sep 2008 09:15:08 +0300 Subject: [ofa-general] ***SPAM*** Interrupt RDMA Read In-Reply-To: <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> References: <9870a2060808311148h65c7950g735e5d33d4690960@mail.gmail.com> <2f3bf9a60808312300q778c7aaen23b2ca70d5f2c1ea@mail.gmail.com> <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> Message-ID: <2f3bf9a60809032315ha6ba13cobdb9682c1cbe0a5f@mail.gmail.com> How would you solve it if you would have used TCP/IP sockets? Dotan On Wed, Sep 3, 2008 at 9:49 PM, AJ Guillon wrote: > Hrrrm. That's really too bad because I would like to use RDMA to steal work > from other nodes along with dependent memory. If I'm loading memory for a > task on one node, and another node steals the task, the node from which the > task was stolen should stop fetching memory required for the now stolen > task. A more complex scheduler might be able to deal with this but maybe not > optimally. > > Suggestions for workarounds? > > AJ > > On Sep 1, 2008, at 2:00 AM, "Dotan Barak" wrote: > >> As much as i know, once you posted a WR, you can not cancel it. >> The only thing that you can do is flush the whole QP by changing the >> QP state to ERROR (which flushes the work Queues and produces >> completion for every WR) or to RESET, which cleans the Queues from the >> WRs. >> >> >> Dotan >> >> On Sun, Aug 31, 2008 at 9:48 PM, Adrien Guillon >> wrote: >>> >>> Hey, >>> >>> How can I interrupt an RDMA read cleanly? In my case, I might decide >>> that I don't need to read some memory anymore (because something else >>> happened), so I want to abort. >>> >>> AJ >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> > From vlad at lists.openfabrics.org Thu Sep 4 03:03:37 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 4 Sep 2008 03:03:37 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080904-0200 daily build status Message-ID: <20080904100337.5264EE60A04@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From chu11 at llnl.gov Thu Sep 4 10:00:58 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 04 Sep 2008 10:00:58 -0700 Subject: [ofa-general] [OPENSM] fix console segfault corner casem In-Reply-To: <1220466078.29252.386.camel@cardanus.llnl.gov> References: <1220466078.29252.386.camel@cardanus.llnl.gov> Message-ID: <1220547658.26758.23.camel@cardanus.llnl.gov> Hey Sasha, I thought of a way to make it slightly cleaner. New patch attached. Al On Wed, 2008-09-03 at 11:21 -0700, Al Chu wrote: > Hey Sasha, > > If the call to osm_console_init() fails (most typically b/c bind fails > b/c the port is already used), we can fall through into osm_console() > and segfault b/c a bunch of stuff isn't initialized properly. Can be > handled multiple ways. The patch below makes osm_console_init() return > a non-void so we can recognize if an error occurred. > > Al > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-segfault-corner-case-when-osm_console_init-fails.patch Type: text/x-patch Size: 3494 bytes Desc: not available URL: From aj.guillon at gmail.com Thu Sep 4 10:36:09 2008 From: aj.guillon at gmail.com (AJ Guillon) Date: Thu, 4 Sep 2008 13:36:09 -0400 Subject: [ofa-general] ***SPAM*** Interrupt RDMA Read In-Reply-To: <2f3bf9a60809032315ha6ba13cobdb9682c1cbe0a5f@mail.gmail.com> References: <9870a2060808311148h65c7950g735e5d33d4690960@mail.gmail.com> <2f3bf9a60808312300q778c7aaen23b2ca70d5f2c1ea@mail.gmail.com> <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> <2f3bf9a60809032315ha6ba13cobdb9682c1cbe0a5f@mail.gmail.com> Message-ID: Reading the socket would block. Use a signal to interrupt and have a variable set to tell it to abort. AJ On Sep 4, 2008, at 2:15 AM, "Dotan Barak" wrote: > How would you solve it if you would have used TCP/IP sockets? > > Dotan > > On Wed, Sep 3, 2008 at 9:49 PM, AJ Guillon > wrote: >> Hrrrm. That's really too bad because I would like to use RDMA to >> steal work >> from other nodes along with dependent memory. If I'm loading memory >> for a >> task on one node, and another node steals the task, the node from >> which the >> task was stolen should stop fetching memory required for the now >> stolen >> task. A more complex scheduler might be able to deal with this but >> maybe not >> optimally. >> >> Suggestions for workarounds? >> >> AJ >> >> On Sep 1, 2008, at 2:00 AM, "Dotan Barak" wrote: >> >>> As much as i know, once you posted a WR, you can not cancel it. >>> The only thing that you can do is flush the whole QP by changing the >>> QP state to ERROR (which flushes the work Queues and produces >>> completion for every WR) or to RESET, which cleans the Queues from >>> the >>> WRs. >>> >>> >>> Dotan >>> >>> On Sun, Aug 31, 2008 at 9:48 PM, Adrien Guillon >> > >>> wrote: >>>> >>>> Hey, >>>> >>>> How can I interrupt an RDMA read cleanly? In my case, I might >>>> decide >>>> that I don't need to read some memory anymore (because something >>>> else >>>> happened), so I want to abort. >>>> >>>> AJ >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >> From sashak at voltaire.com Thu Sep 4 11:18:50 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 4 Sep 2008 21:18:50 +0300 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: References: Message-ID: <20080904181850.GC6273@sashak.voltaire.com> On 14:32 Wed 03 Sep , Christopher Tanner wrote: > Has anyone built the various IB source packages using the Intel compilers? > The configure, make, and make install all progressed without any errors. > However, when I try to start OpenSM, I get the following error > > error while loading shared libraries: libimf.so: cannot open shared object > file: No such file or directory We don't have such library libimf.so. It is something from icc... > The LD_LIBRARY_PATH contains the path to the icc and ifort lib directories, > so this is not the problem. The reason I'm building from source is because > I'm trying to utilize Infiniband on an Ubuntu cluster. Additionally, I need > to use the Intel compilers as some of our Fortran programs cannot be > compiled using gfortran... But why you cannot use gcc for building OFED packages? Sasha From hal.rosenstock at gmail.com Thu Sep 4 13:26:07 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 4 Sep 2008 16:26:07 -0400 Subject: [ofa-general] Re: [PATCH] ibsim: Add support for vendor ID and system image GUID In-Reply-To: <20080831144645.GN27535@sashak.voltaire.com> References: <48A30108.4010307@obsidianresearch.com> <20080818201718.GJ27204@sashak.voltaire.com> <48B5E711.7030503@obsidianresearch.com> <20080831144645.GN27535@sashak.voltaire.com> Message-ID: Hi Sasha, On Sun, Aug 31, 2008 at 10:46 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 17:45 Wed 27 Aug , Hal Rosenstock wrote: >>>> diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c >>>> index 6e3c0e9..146bcde 100644 >>>> --- a/ibsim/sim_net.c >>>> +++ b/ibsim/sim_net.c >>>> @@ -190,7 +190,9 @@ char (*aliases)[NODEIDLEN + NODEPREFIX + 1]; // >>>> aliases map format: "%s@%s" >>>> int netnodes, netswitches, netports, netaliases; >>>> char netprefix[NODEPREFIX + 1]; >>>> +int netvendid; >>>> int netdevid; >>>> +uint64_t netsysimgguid; >>>> int netwidth = DEFAULT_LINKWIDTH; >>>> int netspeed = DEFAULT_LINKSPEED; >>>> @@ -324,11 +326,12 @@ static Node *new_node(int type, char *nodename, >>>> char *nodedesc, int nodeports) >>>> } >>>> mad_set_field(nd->nodeinfo, 0, IB_NODE_NPORTS_F, nd->numports); >>>> + mad_set_field(nd->nodeinfo, 0, IB_NODE_VENDORID_F, netvendid); >>>> mad_set_field(nd->nodeinfo, 0, IB_NODE_DEVID_F, netdevid); >>>> mad_encode_field(nd->nodeinfo, IB_NODE_GUID_F, &nd->nodeguid); >>>> mad_encode_field(nd->nodeinfo, IB_NODE_PORT_GUID_F, &nd->nodeguid); >>>> - mad_encode_field(nd->nodeinfo, IB_NODE_SYSTEM_GUID_F, &nd->nodeguid); >>>> + mad_encode_field(nd->nodeinfo, IB_NODE_SYSTEM_GUID_F, &netsysimgguid); >>>> >>> >>> And when netsysimgguid was not parsed for this node, it will put previous >>> value there (or "0" if it was never parsed)? >>> >> Is "state" for a node in the topology file needed to deal with this ? >> Something like the following: When the vendor ID line is seen, reset >> netsysimgguid and if 0 when new_node is invoked, then use the node GUID as >> currently done. Does that make sense ? > > Why to not reset netsysimgguid unconditionally at end of new_node()? Sure; that's better as the "state" for new node is already determined. Updated patch to follow shortly. -- Hal > The rest could be as you said: > > mad_encode_field(nd->nodeinfo, IB_NODE_SYSTEM_GUID_F, > netsysimgguid ? &netsysimgguid : &nd->nodeguid); > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at obsidianresearch.com Thu Sep 4 13:41:26 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 04 Sep 2008 14:41:26 -0600 Subject: [ofa-general] [PATCHv2] ibsim: Add support for vendor ID and system image GUID] Message-ID: <48C047F6.9040605@obsidianresearch.com> Sasha, Attached is the updated patch for adding support for vendor ID and system image GUID to ibsim utilizing your idea to reset netsysimgguid to 0 in new_node. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-ibsim-sysimg3 URL: From christopher.tanner at gatech.edu Thu Sep 4 13:50:47 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Thu, 4 Sep 2008 16:50:47 -0400 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: <20080904181850.GC6273@sashak.voltaire.com> References: <20080904181850.GC6273@sashak.voltaire.com> Message-ID: > We don't have such library libimf.so. It is something from icc... Yes, it is something from icc. The limited support for this type of error states that I need to load a compiler module in order for OpenSM to find the library. However, there's never a mention of the name of the module that I need to load. A modprobe -l *intel* gives the following: lvm-intel.ko intel_vr_nor.ko intelfb.ko intel-agp.ko intel.rng.ko snd-hda-intel.ko snd-intel8x0m.ko snd-intel8x0.ko intel-agp.ich9m.ko Nothing for a modprobe on *icc*. So, I'm stuck... > But why you cannot use gcc for building OFED packages? Our codes have a lot of Fortran 77 in them and gfortran hasn't been compiling those codes very well. Since we're using ifort for Fortran compiling, I figured we ought to use icc (C) and icpc (C++) to use a consistent compiler package. I don't know if programs partially compiled in gcc and ifort will work very well... ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- On Sep 4, 2008, at 2:18 PM, Sasha Khapyorsky wrote: > On 14:32 Wed 03 Sep , Christopher Tanner wrote: >> Has anyone built the various IB source packages using the Intel >> compilers? >> The configure, make, and make install all progressed without any >> errors. >> However, when I try to start OpenSM, I get the following error >> >> error while loading shared libraries: libimf.so: cannot open shared >> object >> file: No such file or directory > > We don't have such library libimf.so. It is something from icc... > >> The LD_LIBRARY_PATH contains the path to the icc and ifort lib >> directories, >> so this is not the problem. The reason I'm building from source is >> because >> I'm trying to utilize Infiniband on an Ubuntu cluster. >> Additionally, I need >> to use the Intel compilers as some of our Fortran programs cannot be >> compiled using gfortran... > > But why you cannot use gcc for building OFED packages? > > Sasha From halr at obsidianresearch.com Thu Sep 4 13:56:30 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Thu, 04 Sep 2008 14:56:30 -0600 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: References: <20080904181850.GC6273@sashak.voltaire.com> Message-ID: <48C04B7E.5080609@obsidianresearch.com> Christopher Tanner wrote: >> We don't have such library libimf.so. It is something from icc... > > Yes, it is something from icc. The limited support for this type of > error states that I need to load a compiler module in order for OpenSM > to find the library. However, there's never a mention of the name of > the module that I need to load. A modprobe -l *intel* gives the > following: > lvm-intel.ko > intel_vr_nor.ko > intelfb.ko > intel-agp.ko > intel.rng.ko > snd-hda-intel.ko > snd-intel8x0m.ko > snd-intel8x0.ko > intel-agp.ich9m.ko > > Nothing for a modprobe on *icc*. So, I'm stuck... Isn't it a library, not a module ? Shouldn't it be part of the icc install ? Does the link below help ? http://softwarecommunity.intel.com/isn/Community/en-us/search/SearchResults.aspx?q=libimf.so -- Hal > >> But why you cannot use gcc for building OFED packages? > > Our codes have a lot of Fortran 77 in them and gfortran hasn't been > compiling those codes very well. Since we're using ifort for Fortran > compiling, I figured we ought to use icc (C) and icpc (C++) to use a > consistent compiler package. I don't know if programs partially > compiled in gcc and ifort will work very well... > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner at gatech.edu > ------------------------------------------- > > > > On Sep 4, 2008, at 2:18 PM, Sasha Khapyorsky wrote: > >> On 14:32 Wed 03 Sep , Christopher Tanner wrote: >>> Has anyone built the various IB source packages using the Intel >>> compilers? >>> The configure, make, and make install all progressed without any >>> errors. >>> However, when I try to start OpenSM, I get the following error >>> >>> error while loading shared libraries: libimf.so: cannot open shared >>> object >>> file: No such file or directory >> >> We don't have such library libimf.so. It is something from icc... >> >>> The LD_LIBRARY_PATH contains the path to the icc and ifort lib >>> directories, >>> so this is not the problem. The reason I'm building from source is >>> because >>> I'm trying to utilize Infiniband on an Ubuntu cluster. Additionally, >>> I need >>> to use the Intel compilers as some of our Fortran programs cannot be >>> compiled using gfortran... >> >> But why you cannot use gcc for building OFED packages? >> >> Sasha > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From caitlin.bestler at gmail.com Thu Sep 4 13:55:03 2008 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Thu, 4 Sep 2008 13:55:03 -0700 Subject: [ofa-general] ***SPAM*** Interrupt RDMA Read In-Reply-To: <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> References: <9870a2060808311148h65c7950g735e5d33d4690960@mail.gmail.com> <2f3bf9a60808312300q778c7aaen23b2ca70d5f2c1ea@mail.gmail.com> <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> Message-ID: <469958e00809041355w20969a17gd32e71fd06c7bc2@mail.gmail.com> On Wed, Sep 3, 2008 at 11:49 AM, AJ Guillon wrote: > Hrrrm. That's really too bad because I would like to use RDMA to steal work > from other nodes along with dependent memory. If I'm loading memory for a > task on one node, and another node steals the task, the node from which the > task was stolen should stop fetching memory required for the now stolen > task. A more complex scheduler might be able to deal with this but maybe not > optimally. > > Suggestions for workarounds? > If the target device allows you to have multiple RDMA Reads in flight, you could just break the entire Read down into a series of N smaller reads, and have a fraction of that (say N/3 or N/4) in flight at a time. When you want to "cancel" the read, simply stop issuing the remaining Reads. If your read is small that the RTT time determines its duration rather than the length read then you didn't need to cancel anyway. From dillowda at ornl.gov Thu Sep 4 14:01:51 2008 From: dillowda at ornl.gov (David Dillow) Date: Thu, 04 Sep 2008 17:01:51 -0400 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: References: <20080904181850.GC6273@sashak.voltaire.com> Message-ID: <1220562111.7854.12.camel@obelisk.thedillows.org> On Thu, 2008-09-04 at 16:50 -0400, Christopher Tanner wrote: > > We don't have such library libimf.so. It is something from icc... > > Yes, it is something from icc. The limited support for this type of > error states that I need to load a compiler module in order for OpenSM > to find the library. [snip] > Nothing for a modprobe on *icc*. So, I'm stuck... You shouldn't need a kernel module for this.... If you do 'locate libimf.so' what do you get? If you get a path to it, try running openSM with LD_LIBRARY_PATH set to include that path, for example: $ locate libimf.so /opt/icc/some/path/libimf.so $ LD_LIBARY_PATH=/opt/icc/some/path /path/to/opensm options... If that works, then odds are good you installed icc (or its support libraries) incompletely -- are it's libraries installed in the correct place (or listed in /etc/ld.so.conf or /etc/ld.so.conf.d/*) If the path is listed in those places, did you run ldconfig as root after the install? -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office From christopher.tanner at gatech.edu Thu Sep 4 14:09:37 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Thu, 4 Sep 2008 17:09:37 -0400 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: <48C04B7E.5080609@obsidianresearch.com> References: <20080904181850.GC6273@sashak.voltaire.com> <48C04B7E.5080609@obsidianresearch.com> Message-ID: <6032B6F0-F04A-42A0-B155-5A4C57DF8EFB@gatech.edu> > Isn't it a library, not a module ? Yeah, which is why I'm really confused. Here's the link to the website which says I need to load a module (it's around the middle of the way down) http://asci-training.lanl.gov/BProc/ > Shouldn't it be part of the icc install ? Yup. The library exists and the path to it is in the LD_LIBRARY_PATH. Again, confusion. > Does the link below help ? Yes, I've read this before. However, I've also read that a static compilation won't work very well on a cluster, but I haven't tested it out. Since I'm stuck, I'll try this out to see if it works. Thanks Hal. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- On Sep 4, 2008, at 4:56 PM, Hal Rosenstock wrote: > Christopher Tanner wrote: >>> We don't have such library libimf.so. It is something from icc... >> >> Yes, it is something from icc. The limited support for this type of >> error states that I need to load a compiler module in order for >> OpenSM to find the library. However, there's never a mention of the >> name of the module that I need to load. A modprobe -l *intel* gives >> the following: >> lvm-intel.ko >> intel_vr_nor.ko >> intelfb.ko >> intel-agp.ko >> intel.rng.ko >> snd-hda-intel.ko >> snd-intel8x0m.ko >> snd-intel8x0.ko >> intel-agp.ich9m.ko >> >> Nothing for a modprobe on *icc*. So, I'm stuck... > Isn't it a library, not a module ? > > Shouldn't it be part of the icc install ? > > Does the link below help ? > > http://softwarecommunity.intel.com/isn/Community/en-us/search/SearchResults.aspx?q=libimf.so > > > -- Hal >> >>> But why you cannot use gcc for building OFED packages? >> >> Our codes have a lot of Fortran 77 in them and gfortran hasn't been >> compiling those codes very well. Since we're using ifort for >> Fortran compiling, I figured we ought to use icc (C) and icpc (C++) >> to use a consistent compiler package. I don't know if programs >> partially compiled in gcc and ifort will work very well... >> >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner at gatech.edu >> ------------------------------------------- >> >> >> >> On Sep 4, 2008, at 2:18 PM, Sasha Khapyorsky wrote: >> >>> On 14:32 Wed 03 Sep , Christopher Tanner wrote: >>>> Has anyone built the various IB source packages using the Intel >>>> compilers? >>>> The configure, make, and make install all progressed without any >>>> errors. >>>> However, when I try to start OpenSM, I get the following error >>>> >>>> error while loading shared libraries: libimf.so: cannot open >>>> shared object >>>> file: No such file or directory >>> >>> We don't have such library libimf.so. It is something from icc... >>> >>>> The LD_LIBRARY_PATH contains the path to the icc and ifort lib >>>> directories, >>>> so this is not the problem. The reason I'm building from source >>>> is because >>>> I'm trying to utilize Infiniband on an Ubuntu cluster. >>>> Additionally, I need >>>> to use the Intel compilers as some of our Fortran programs cannot >>>> be >>>> compiled using gfortran... >>> >>> But why you cannot use gcc for building OFED packages? >>> >>> Sasha >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Thu Sep 4 16:08:12 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 02:08:12 +0300 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: References: <20080904181850.GC6273@sashak.voltaire.com> Message-ID: <20080904230812.GJ6273@sashak.voltaire.com> On 16:50 Thu 04 Sep , Christopher Tanner wrote: > > Our codes have a lot of Fortran 77 in them and gfortran hasn't been > compiling those codes very well. Since we're using ifort for Fortran > compiling, I figured we ought to use icc (C) and icpc (C++) to use a > consistent compiler package. I don't know if programs partially compiled in > gcc and ifort will work very well... But you don't need ifort or gfortran for building OpenSM. So you can use gcc for OpenSM and icc/... for the rest. Sasha From sashak at voltaire.com Thu Sep 4 16:23:57 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 02:23:57 +0300 Subject: [ofa-general] [OPENSM] fix console segfault corner casem In-Reply-To: <1220547658.26758.23.camel@cardanus.llnl.gov> References: <1220466078.29252.386.camel@cardanus.llnl.gov> <1220547658.26758.23.camel@cardanus.llnl.gov> Message-ID: <20080904232357.GK6273@sashak.voltaire.com> On 10:00 Thu 04 Sep , Al Chu wrote: > >From 28b61a86e83f547409be6cd6b4a3c6a613e1123f Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Thu, 4 Sep 2008 09:58:01 -0700 > Subject: [PATCH] fix segfault corner case when osm_console_init fails > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Thu Sep 4 16:28:26 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 02:28:26 +0300 Subject: [ofa-general] Re: [OPENSM] close console socket In-Reply-To: <1220466082.29252.387.camel@cardanus.llnl.gov> References: <1220466082.29252.387.camel@cardanus.llnl.gov> Message-ID: <20080904232826.GL6273@sashak.voltaire.com> Hi Al, On 11:21 Wed 03 Sep , Al Chu wrote: > > diff --git a/opensm/opensm/osm_console_io.c b/opensm/opensm/osm_console_io.c > index 2822737..3d3ece4 100644 > --- a/opensm/opensm/osm_console_io.c > +++ b/opensm/opensm/osm_console_io.c > @@ -118,6 +118,10 @@ static void osm_console_close(osm_console_t * p_oct, osm_log_t * p_log) > p_oct->client_hn, p_oct->client_ip); > cio_close(p_oct); > } > + if (p_oct->socket > 0) { > + close(p_oct->socket); > + p_oct->socket = -1; > + } > #endif > } Would this work good for stdin (when local console is in use)? I see that fd_in descriptor is closed in cio_close(), isn't it enough (I didn't look closely yet)? Sasha From weiny2 at llnl.gov Thu Sep 4 16:41:44 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 4 Sep 2008 16:41:44 -0700 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: <20080904230812.GJ6273@sashak.voltaire.com> References: <20080904181850.GC6273@sashak.voltaire.com> <20080904230812.GJ6273@sashak.voltaire.com> Message-ID: <20080904164144.3637dc50.weiny2@llnl.gov> Christopher, Correct me if I am wrong below... On Fri, 5 Sep 2008 02:08:12 +0300 Sasha Khapyorsky wrote: > On 16:50 Thu 04 Sep , Christopher Tanner wrote: > > > > Our codes have a lot of Fortran 77 in them and gfortran hasn't been > > compiling those codes very well. Since we're using ifort for Fortran > > compiling, I figured we ought to use icc (C) and icpc (C++) to use a > > consistent compiler package. I don't know if programs partially compiled in > > gcc and ifort will work very well... > > But you don't need ifort or gfortran for building OpenSM. So you can use > gcc for OpenSM and icc/... for the rest. Sasha, I think he is compiling from the OFED release. Unfortunately I believe this only allows you to specify one complier for the entire "distro". Christopher, If you absolutely can't figure out why icc's libraries are being found, I can think of 2 alternatives. 1) Try and run install.pl 2 times with the different compilers. First to build only the packages required for MPI with icc. Then all the management and support stuff with gcc. I don't know if this is possible because I am afraid to run install.pl as root and have it corrupt one of my nodes right now. However, looking inside the script leads me to believe you can select the packages you want built. 2) Extract (from the OFED tarball) the OpenSM and management source rpms and build them with gcc. That list would be: opensm-3.2.2-1.ofed1.4.beta1.src.rpm infiniband-diags-1.4.1-1.ofed1.4.beta1.src.rpm libibcommon-1.1.1-1.ofed1.4.beta1.src.rpm libibmad-1.2.1-1.ofed1.4.beta1.src.rpm libibumad-1.2.1-1.ofed1.4.beta1.src.rpm Here at LLNL we have been building OFED pieces by hand for years. YMMV... Hope this helps, Ira From sashak at voltaire.com Thu Sep 4 16:50:04 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 02:50:04 +0300 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: <20080904164144.3637dc50.weiny2@llnl.gov> References: <20080904181850.GC6273@sashak.voltaire.com> <20080904230812.GJ6273@sashak.voltaire.com> <20080904164144.3637dc50.weiny2@llnl.gov> Message-ID: <20080904235004.GM6273@sashak.voltaire.com> On 16:41 Thu 04 Sep , Ira Weiny wrote: > > I think he is compiling from the OFED release. Unfortunately I believe this > only allows you to specify one complier for the entire "distro". And how about "CC=gcc ./configure"? Guess something similar may work rpmbuild, although management can be compiled from tarballs or git tree just fine. Sasha From chu11 at llnl.gov Thu Sep 4 17:07:00 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 04 Sep 2008 17:07:00 -0700 Subject: [ofa-general] Re: [OPENSM] close console socket In-Reply-To: <20080904232826.GL6273@sashak.voltaire.com> References: <1220466082.29252.387.camel@cardanus.llnl.gov> <20080904232826.GL6273@sashak.voltaire.com> Message-ID: <1220573220.27074.11.camel@cardanus.llnl.gov> Hey Sasha, On Fri, 2008-09-05 at 02:28 +0300, Sasha Khapyorsky wrote: > Hi Al, > > On 11:21 Wed 03 Sep , Al Chu wrote: > > > > diff --git a/opensm/opensm/osm_console_io.c b/opensm/opensm/osm_console_io.c > > index 2822737..3d3ece4 100644 > > --- a/opensm/opensm/osm_console_io.c > > +++ b/opensm/opensm/osm_console_io.c > > @@ -118,6 +118,10 @@ static void osm_console_close(osm_console_t * p_oct, osm_log_t * p_log) > > p_oct->client_hn, p_oct->client_ip); > > cio_close(p_oct); > > } > > + if (p_oct->socket > 0) { > > + close(p_oct->socket); > > + p_oct->socket = -1; > > + } > > #endif > > } > > Would this work good for stdin (when local console is in use)? As far as I can tell, p_oct->socket is only created when OSM_REMOTE_CONSOLE or OSM_LOOPBACK_CONSOLE is set (in osm_console_init ()). > I see that > fd_in descriptor is closed in cio_close(), isn't it enough (I didn't > look closely yet)? >From osm_console() it seems the in_fd (set via cio_open()) is the socket returned from accept() when a connection is accepted. I couldn't find where the original socket itself was actually being closed. Al > Sasha -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From raghuarur at gmail.com Thu Sep 4 18:17:46 2008 From: raghuarur at gmail.com (Raghu Arur) Date: Thu, 4 Sep 2008 18:17:46 -0700 Subject: [ofa-general] ***SPAM*** opensm master switchover Message-ID: <90a961640809041817wea775abtfa64aed623abcd2e@mail.gmail.com> When a opensm master changes in a subnet, is there a signal or event that is sent over that applications can listen to ? Thanks, From sashak at voltaire.com Thu Sep 4 18:21:21 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 04:21:21 +0300 Subject: [ofa-general] Re: [OPENSM] close console socket In-Reply-To: <1220573220.27074.11.camel@cardanus.llnl.gov> References: <1220466082.29252.387.camel@cardanus.llnl.gov> <20080904232826.GL6273@sashak.voltaire.com> <1220573220.27074.11.camel@cardanus.llnl.gov> Message-ID: <20080905012121.GN6273@sashak.voltaire.com> On 17:07 Thu 04 Sep , Al Chu wrote: > > As far as I can tell, p_oct->socket is only created when > OSM_REMOTE_CONSOLE or OSM_LOOPBACK_CONSOLE is set (in osm_console_init > ()). Ok, I see now. Sasha From sashak at voltaire.com Thu Sep 4 18:23:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 04:23:00 +0300 Subject: [ofa-general] Re: [OPENSM] close console socket In-Reply-To: <1220466082.29252.387.camel@cardanus.llnl.gov> References: <1220466082.29252.387.camel@cardanus.llnl.gov> Message-ID: <20080905012300.GO6273@sashak.voltaire.com> On 11:21 Wed 03 Sep , Al Chu wrote: > Subject: [PATCH] close console socket on cleanup path > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Thu Sep 4 18:44:23 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 04:44:23 +0300 Subject: [ofa-general] ***SPAM*** opensm master switchover In-Reply-To: <90a961640809041817wea775abtfa64aed623abcd2e@mail.gmail.com> References: <90a961640809041817wea775abtfa64aed623abcd2e@mail.gmail.com> Message-ID: <20080905014423.GP6273@sashak.voltaire.com> On 18:17 Thu 04 Sep , Raghu Arur wrote: > When a opensm master changes in a subnet, is there a signal or event > that is sent over that applications can listen to ? Look at IBV_EVENT_SM_CHANGE in verbs.h (libibverbs). Sasha From aj.guillon at gmail.com Thu Sep 4 18:53:51 2008 From: aj.guillon at gmail.com (Adrien Guillon) Date: Thu, 4 Sep 2008 21:53:51 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Interrupt RDMA Read In-Reply-To: <469958e00809041355w20969a17gd32e71fd06c7bc2@mail.gmail.com> References: <9870a2060808311148h65c7950g735e5d33d4690960@mail.gmail.com> <2f3bf9a60808312300q778c7aaen23b2ca70d5f2c1ea@mail.gmail.com> <5CE471FB-7159-4BFE-BD1B-371089AB8ED6@gmail.com> <469958e00809041355w20969a17gd32e71fd06c7bc2@mail.gmail.com> Message-ID: <9870a2060809041853m22d820f4gb7d1f533ba79390e@mail.gmail.com> That's another approach I have thought about... breaking the read into bite-sized chunks. However it seems to me that this could lead to more CPU and process time being spent working on managing network traffic. Realistically, I would only want to cancel RDMA reads which will take a relatively long time otherwise. So perhaps I set a data length maximum, and reads are broken down into segments of that maximum size as you suggested.... or I adjust the scheduler to not move tasks which large memory requirements. Another approach is to take a pattern from parallel programming: exponential backoff, but perhaps make it into exponential fetch... I fetch 1U, 2U, 4U, 16U, 32U... to n*nU (where U is the unit of measure, say KB or MB). This way I can cancel the request at certain points, but it becomes harder to cancel each time because I already have so much of the data. Some random thoughts anyways :-) AJ From sashak at voltaire.com Thu Sep 4 18:57:12 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 04:57:12 +0300 Subject: [ofa-general] Re: [PATCHv2] ibsim: Add support for vendor ID and system image GUID] In-Reply-To: <48C047F6.9040605@obsidianresearch.com> References: <48C047F6.9040605@obsidianresearch.com> Message-ID: <20080905015712.GQ6273@sashak.voltaire.com> On 14:41 Thu 04 Sep , Hal Rosenstock wrote: > Sasha, > > Attached is the updated patch for adding support for vendor ID and system > image GUID > to ibsim utilizing your idea to reset netsysimgguid to 0 in new_node. > > -- Hal > > ibsim: Add support for vendor ID and system image GUID > > Signed-off-by: Hal Rosenstock Applied with few fixes (see below). Thanks. > --- > v2: Reset netsysimgguid in new_node > > diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c > index 820f77e..d587128 100644 > --- a/ibsim/sim_cmd.c > +++ b/ibsim/sim_cmd.c > @@ -571,8 +571,8 @@ static int dump_net(FILE * f, char *line) > fprintf(f, "\n%s %d \"%s\"", > node_type_name(node->type), > node->numports, node->nodeid); > - fprintf(f, "\tnodeguid %" PRIx64 "\n", node->nodeguid); > - > + fprintf(f, "\tnodeguid %" PRIx64 "\tsysimgguid %" PRIx64 "\n", > + node->nodeguid, node->sysguid); > nports = node->numports; > if (node->type == SWITCH_NODE) { > nports++; > diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c > index 6e3c0e9..55da898 100644 > --- a/ibsim/sim_net.c > +++ b/ibsim/sim_net.c > @@ -190,7 +190,9 @@ char (*aliases)[NODEIDLEN + NODEPREFIX + 1]; // aliases map format: "%s@%s" > > int netnodes, netswitches, netports, netaliases; > char netprefix[NODEPREFIX + 1]; > +int netvendid; > int netdevid; > +uint64_t netsysimgguid; > int netwidth = DEFAULT_LINKWIDTH; > int netspeed = DEFAULT_LINKSPEED; > > @@ -324,11 +326,12 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) > } > > mad_set_field(nd->nodeinfo, 0, IB_NODE_NPORTS_F, nd->numports); > + mad_set_field(nd->nodeinfo, 0, IB_NODE_VENDORID_F, netvendid); > mad_set_field(nd->nodeinfo, 0, IB_NODE_DEVID_F, netdevid); > > mad_encode_field(nd->nodeinfo, IB_NODE_GUID_F, &nd->nodeguid); > mad_encode_field(nd->nodeinfo, IB_NODE_PORT_GUID_F, &nd->nodeguid); > - mad_encode_field(nd->nodeinfo, IB_NODE_SYSTEM_GUID_F, &nd->nodeguid); > + mad_encode_field(nd->nodeinfo, IB_NODE_SYSTEM_GUID_F, &netsysimgguid); As we discussed sysimage should be encoded to netsysimage if it is presnent in a file or otherwise to nodeguid (as it was). So I changed this to: mad_encode_field(nd->nodeinfo, IB_NODE_SYSTEM_GUID_F, netsysimgguid ? &netsysimgguid , &nd->nodeguid); > > if ((nd->portsbase = new_ports(nd, nodeports, firstport)) < 0) { > IBWARN("can't alloc %d ports for node %s", nodeports, > @@ -336,6 +339,8 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) > return 0; > } > > + netsysimgguid = 0; The same story is with newly introduced netvendid, added 'netvendid = 0' too. Sasha From vlad at lists.openfabrics.org Fri Sep 5 03:04:50 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 5 Sep 2008 03:04:50 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080905-0200 daily build status Message-ID: <20080905100450.5D39CE60972@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From yossi.openib at gmail.com Fri Sep 5 04:04:58 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Fri, 05 Sep 2008 14:04:58 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] libsdp: enable fallback to TCP for nonblocking sockets In-Reply-To: <1220427737.6824.3.camel@mtllpt156.mtl.com> References: <48AC445D.2050704@gmail.com> <5D49E7A8952DC44FB38C38FA0D758EAD5865EA@mtlexch01.mtl.com> <48AD9C80.8030305@gmail.com> <1219590681.1564.10.camel@amirv-laptop> <48B2CD3A.5020509@gmail.com> <5D49E7A8952DC44FB38C38FA0D758EAD61E699@mtlexch01.mtl.com> <48B6E63F.6060309@gmail.com> <1220427737.6824.3.camel@mtllpt156.mtl.com> Message-ID: <48C1125A.2050702@gmail.com> Thanks, Unfortunately the signal solution does not look so good, mainly because it creates a race with 'select', and interrupts system calls. Looks like doing the fallback inside the signal handler is not valid/compatible behaviour. Amir Vadai wrote: > Yossi Hi, > > Because you need things fixed immediately I applied your "enable > fallback to TCP..." patch. > > And will fix it ASAP - not to break the non blocking semantics. > > If your IO signals solution looks good I'll be happy to use it instead. > > - Amir. > > On Thu, 2008-08-28 at 20:54 +0300, Yossi Etigin wrote: >> Hi, >> >> I'm attempting to do this with IO signals - install a signal handler >> that >> will be called when the connect fails, and it will do the fallback. >> >> --Yossi >> >> Amir Vadai wrote: >>> Yossi Hi, >>> >>> I'm on vacation till Monday. >>> I'll check when can we have the full fix - and if it is not in the >> near >>> future >>> we'll put your patch till the full fix be prepared. >>> >>> - Amir >>> >>> -----Original Message----- >>> From: Yossi Etigin [mailto:yossi.openib at gmail.com] >>> Sent: Mon 8/25/2008 6:18 PM >>> To: Amir Vadai >>> Cc: general list; Oren Duer; Olga Shern >>> Subject: Re: [PATCH] libsdp: enable fallback to TCP for nonblocking >> sockets >>> Hi Amir, >>> >>> The single case in which we block connect() here (and only on SDP, >> which >>> is rather fast) is the case that is currenlty not supported anyway. >> It can >>> also be configurable. >>> Anyway, we have a client which uses non-blocking sockets and really >> needs >>> that feature. How about putting this to OFED now and writing >> something >>> better >>> later on? >>> >>> --Yossi >>> >>> >>> Amir Vadai wrote: >>> > See below >>> > >>> > On Thu, 2008-08-21 at 19:49 +0300, Yossi Etigin wrote: >>> >> Hi Amir, >>> >> >>> >> What you suggesting is to replace almost all socket functions, >> and I >>> >> don't think that this is good either. >>> > I agree - but to break the non-blocking semantics is worse. >>> > >>> >> It would be write(), send(), recv(), sendto(), recvfrom(), >> sendmsg(), >>> >> recvmsg(), and also need to change select() (to not return when >>> >> fallback >>> >> happens if SDP fails), and maybe also poll(). libsdp tries to >> avoid >>> >> the fast path. >>> > I don't see another option. We could have a #ifdef to enable the >> user >>> > to choose - non blocking support or cleaner fast-path. >>> >> Besides, how do we know when to do fallback - can we safely >> assume >>> >> that if some socket operation fails, then it happened because >>> >> connect() failed? >>> >>From a brief look at connect man page, they say we should use >> select for >>> > writing on the socket. after select indicates writability, use >>> > getsockopt to determine whether connect() completed successfully >> or not. >>> >> Anyway, if I understand correctly, you suggest something like: >>> >> >>> >> int connect(fd, ...) >>> >> { >>> >> ... >>> >> set_state(fd, SDP) >>> >> ... >>> >> } >>> >> >>> >> >>> >> int read(int fd, ...) >>> >> { >>> >> int res = socket_funcs.read(shadow_fd(fd), ...); >>> >> if (res < 0 && errno != EAGAIN && sock_state(fd) == SDP) >> { >>> >> sock_state = TCP; >>> >> sockt_funs.connect(fd,...); >>> >> close(shadow_fd(fd)); >>> >> errno = EAGAIN; >>> >> } >>> >> return res; >>> >> } >>> >> >>> >> >>> > ... again, I don't like it too - but I don't think we should >> block >>> > connect when the user asks not to. >>> > - Amir. >>> >> --Yossi >>> >> >>> >> Amir Vadai wrote: >>> >>> Yossi Hi, >>> >>> >>> >>> I think that breaking the semantic of non blocking socket is a >> bad >>> >> idea. >>> >>> There is a solution that won't break this semantics: >>> >>> >>> >>> 1. User app calls connect(). >>> >>> - libsdp try to connect through sdp. >>> >>> 2. User app try another operation on the socket (e.g >> read/write) >>> >>> - if sdp connection established successfully - great >>> >>> - if sdp still not established - return -EAGAIN. This is >> the >>> >>> same behaviour as if the tcp connection wasn't connected yet. >>> >>> - if sdp timedout - return -EAGAIN and initiate TCP >> connect. >>> >>> - if tcp connection established - use it >>> >>> - if tcp connection timedout - return error. >>> >>> >>> >>> Maybe we could optimize it and initiate a tcp connection in >> parallel >>> >>> with the sdp connection and use it only when the sdp connect is >>> >>> timedout. >>> >>> >>> >>> I will add only the second patch (the debug print fix). >>> >>> >>> >>> - Amir >>> >>> >>> >>> >>> >> >>> >> >>> > >>> >> > From Sumit.Gaur at Sun.COM Fri Sep 5 06:17:54 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Fri, 05 Sep 2008 18:47:54 +0530 Subject: [ofa-general] upgrade from 1.2.5* to 1.3.1 In-Reply-To: <20080905015440.9C33FE60D8F@openfabrics.org> References: <20080905015440.9C33FE60D8F@openfabrics.org> Message-ID: <48C13182.3020500@Sun.COM> Hi I have upgraded my OFED version from 1.2.5* to 1.3.1, Now application could not communicate with OFED libraries using umad_send and umad_recv function call for IB_SMI_CLASS (with DR path). Is there any major change in umad lib for such requests. Any help or info is appreciated. sumit From truelove at array.ca Fri Sep 5 06:50:14 2008 From: truelove at array.ca (Steven Truelove) Date: Fri, 05 Sep 2008 09:50:14 -0400 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: References: <48AC6495.1040807@array.ca> Message-ID: <48C13916.9020307@array.ca> Hi, Thanks, I grabbed the Intrepid package and got things rolling. My ports are now in the INIT state. I believe I now need to run OpenSM. Is there a Ubuntu/Debian package that contains it? I haven't been able to find one. If not, could you please point me to what I should install to move forward? Thanks, Steven Truelove Roland Dreier wrote: > > I am trying to get Infiniband up and running on a Ubuntu 8.04 > > system. I can load the modules and see plenty of infiniband content > > under /sys/class, but when I try to run ibv_devices, I get this error: > > > > libibverbs: Warning: no userspace device-specific driver found for > > /sys/class/infiniband_verbs/uverbs0 > > That's because you need to install the device-specific userspace driver ;) > > Add my PPA to your software sources: > > deb http://ppa.launchpad.net/roland.dreier/ubuntu hardy main > deb-src http://ppa.launchpad.net/roland.dreier/ubuntu hardy main > > and do "aptitude install libmlx4-1" and you should be all set. > (the libmlx4 packages are also in the 8.10/Intrepid archive already). > > Let me know if you have any issues. > > - R. > > -- Steven Truelove Array Systems Computing, Inc. 1120 Finch Avenue West, 7th Floor Toronto, Ontario M3J 3H7 CANADA http://www.array.ca truelove at array.ca Phone: (416) 736-0900 x307 Fax: (416) 736-4715 From sashak at voltaire.com Fri Sep 5 07:22:26 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 17:22:26 +0300 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <48C13916.9020307@array.ca> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> Message-ID: <20080905142226.GV6273@sashak.voltaire.com> On 09:50 Fri 05 Sep , Steven Truelove wrote: > > Thanks, I grabbed the Intrepid package and got things rolling. My ports > are now in the INIT state. I believe I now need to run OpenSM. Is there a > Ubuntu/Debian package that contains it? I haven't been able to find one. I am not aware about such. > If not, could you please point me to what I should install to move forward? http://www.openfabrics.org/downloads/management/README , and latest tarballs: http://www.openfabrics.org/downloads/management , or even more recent stuff directly from git tree: git close git://git.openfabrics.org/~sashak/management Sasha From bs at q-leap.de Fri Sep 5 07:39:56 2008 From: bs at q-leap.de (Bernd Schubert) Date: Fri, 5 Sep 2008 16:39:56 +0200 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <20080905142226.GV6273@sashak.voltaire.com> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> Message-ID: <200809051639.57300.bs@q-leap.de> On Friday 05 September 2008 16:22:26 Sasha Khapyorsky wrote: > On 09:50 Fri 05 Sep , Steven Truelove wrote: > > Thanks, I grabbed the Intrepid package and got things rolling. My > > ports are now in the INIT state. I believe I now need to run OpenSM. Is > > there a Ubuntu/Debian package that contains it? I haven't been able to > > find one. > > I am not aware about such. We just didn't have the time yet to complete all the packaging and to push it upstream to Debian, here is what we have so far # Etchy packages, but also should work for hardy deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ # Hardy packages, but not recently maintained (only for my workstation) deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH From yossi.openib at gmail.com Fri Sep 5 08:00:46 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Fri, 05 Sep 2008 18:00:46 +0300 Subject: [ofa-general] ***SPAM*** [PATCH] ipoib: fix hang while bringing down uninitialized interface Message-ID: <48C1499E.4080002@gmail.com> Fix bug #1172: If a pkey for an interface is not found during initialization, then poll_timer is left uninitialized. When the device is brought down, ipoib tries to del_timer_sync() it. This call hangs in an infinite loop in lock_timer_base(), because timer_base is NULL. We should check whether the timer was really initialized. Signed-off-by: Yossi Etigin -- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 66cafa2..3bbf46d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -850,7 +850,10 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) ipoib_dbg(priv, "All sends and receives done.\n"); timeout: - del_timer_sync(&priv->poll_timer); + /* Make sure the timer is initialized */ + if (priv->poll_timer.function) + del_timer_sync(&priv->poll_timer); + qp_attr.qp_state = IB_QPS_RESET; if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to RESET state\n"); --Yossi From sashak at voltaire.com Fri Sep 5 09:14:20 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 5 Sep 2008 19:14:20 +0300 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <200809051639.57300.bs@q-leap.de> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> Message-ID: <20080905161420.GZ6273@sashak.voltaire.com> On 16:39 Fri 05 Sep , Bernd Schubert wrote: > > We just didn't have the time yet to complete all the packaging and to push > it upstream to Debian, here is what we have so far > > # Etchy packages, but also should work for hardy > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ > > # Hardy packages, but not recently maintained (only for my workstation) > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ Great! Sasha From tgree at relay.phys.ualberta.ca Fri Sep 5 12:38:17 2008 From: tgree at relay.phys.ualberta.ca (Terry Greeniaus) Date: Fri, 5 Sep 2008 13:38:17 -0600 (MDT) Subject: [ofa-general] ib_cm question Message-ID: Hello all, We are porting out application to run on the OFED stack. As part of the porting process, I have a series of CM unit tests that I need to get to run. I am having trouble with one in particular. At a high level, the unit test implements a simple protocol for establishing a connection between a client and a server to test basic CM functionality. The protocol uses the private data field of the CM packets to exchange a key that is generated randomly by the server on a per-connection basis. Essentially, the client sends a REQ with a randomly chosen key which will not match the server's. When the server initially receives a REQ for a particular connection, it generates a random key and compares it against the key stored in the REQ. Since they don't match, the server sends a REJ back to the client, and the REJ contains the correct key in the private data field. Finally, the client resends the REQ, this time with the correct key: Client Server REQ -------------------> w/ bad key <------------------- REJ w/ good key REQ -------------------> w/ good key REP/etc. Everything works well until the second REQ is received at the server. It appears that instead of reusing the previous ib_cm_id, the OFED CM generates a new ib_cm_id to handle the second REQ. The unit test thinks that a new connection attempt is being requested instead of a retry of the original attempt and so it generates a new random key, resulting in the protocol being unable to establish a connection. Is something like I have described above supported by the OFED CM? I can try and distill this down to a fairly short code example if that would make things clearer. Thanks, TG From ofed at kononov.ftml.net Fri Sep 5 13:07:34 2008 From: ofed at kononov.ftml.net (Roman Kononov) Date: Fri, 05 Sep 2008 15:07:34 -0500 Subject: [ofa-general] Bogus Receive Completions Message-ID: <48C19186.2050903@kononov.ftml.net> This is continuation of http://lists.openfabrics.org/pipermail/general/2007-December/043658.html Basically, I have two processes on different computers talking to each other over a single QP per process. They both post and receive IBV_WR_RDMA_WRITE_WITH_IMM commands. All Send Work Requests are sequentially numbered in wr_id field. When the process receives Send Work Completion, wr_id is checked for consistency with the posted number. So far so good. All Receive Work Requests are sequentially numbered in wr_id field as well. When the process gets a Receive Work Completion, wr_id is checked for consistency with the posted number. The consistency test eventually fails. The Completion status is "success", wr_id is out of order. I believe that wr_id from Receive Work Completions must arrive in order, but they do not. I managed to reproduce the failure reliably in my environment. Then I modified mthca_tavor_post_recv(), mthca_tavor_post_send() to print all wr->wr_id values passing through them, and I modified mthca_poll_cq() to print all valid wc->wr_id values passing through it. The results from the two processes are attached. In stdout.1.log, one can see that a Receive Work Request with wr_id=0x7f was accepted and immediately completed, while the Receive Queue has 0x7f-0x40=0x3f uncompleted Work Requests. None mthca_tavor_post_recv() calls returned an error. This looks like a bug in libmthca or the firmware. I really need this fixed. Where should go from this point? Any suggestions are appreciated. The QP is created with both SQ and RQ sizes set to 64, with a single CQ. The CQ size is set to 128. I have libibverbs-1.1.2 and libmthca-1.0.5 compiled from sources. ~>cat /etc/issue CentOS release 5.2 (Final) Kernel \r on an \m ~>uname -a Linux node100 2.6.26.3 #1 SMP PREEMPT Wed Sep 3 14:11:03 CDT 2008 x86_64 x86_64 x86_64 GNU/Linux ~>grep 'model name' /proc/cpuinfo model name : Dual Core AMD Opteron(tm) Processor 285 model name : Dual Core AMD Opteron(tm) Processor 285 ~>ibv_devinfo hca_id: mthca0 fw_ver: 4.8.200 node_guid: 0002:c902:0026:dbe0 sys_image_guid: 0002:c902:0026:dbe3 vendor_id: 0x02c9 vendor_part_id: 25208 hw_ver: 0xA0 board_id: MT_02F0110002 phys_port_cnt: 2 ... Thanks, Roman Kononov -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: stdout.1.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: stdout.2.log URL: From truelove at array.ca Fri Sep 5 13:09:26 2008 From: truelove at array.ca (Steven Truelove) Date: Fri, 05 Sep 2008 16:09:26 -0400 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <20080905161420.GZ6273@sashak.voltaire.com> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> <20080905161420.GZ6273@sashak.voltaire.com> Message-ID: <48C191F6.5020302@array.ca> Okay, thanks, I have run opensm and I have gotten IPoIB working as well, although there is a problem. IPoIB works fine with static IPs, but I can't get DHCP to work. The logs suggest that the DHCP server simply isn't seeing the DHCPDISCOVERs from the client. Here is the relevant chunk of dhcpd.conf: subnet 192.168.200.0 netmask 255.255.255.0 { always-broadcast on; range 192.168.200.10 192.168.200.50; option broadcast-address 192.168.200.255; } host sappsu4-ib { hardware ethernet 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00; fixed-address 192.168.200.104; } Does this have any chance of working? Thanks, Steven Truelove Sasha Khapyorsky wrote: > On 16:39 Fri 05 Sep , Bernd Schubert wrote: > >> We just didn't have the time yet to complete all the packaging and to push >> it upstream to Debian, here is what we have so far >> >> # Etchy packages, but also should work for hardy >> deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ >> >> # Hardy packages, but not recently maintained (only for my workstation) >> deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ >> > > Great! > > Sasha > > -- Steven Truelove Array Systems Computing, Inc. 1120 Finch Avenue West, 7th Floor Toronto, Ontario M3J 3H7 CANADA http://www.array.ca truelove at array.ca Phone: (416) 736-0900 x307 Fax: (416) 736-4715 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Sep 5 15:39:06 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Sep 2008 15:39:06 -0700 Subject: [ofa-general] Bogus Receive Completions In-Reply-To: <48C19186.2050903@kononov.ftml.net> (Roman Kononov's message of "Fri, 05 Sep 2008 15:07:34 -0500") References: <48C19186.2050903@kononov.ftml.net> Message-ID: > I managed to reproduce the failure reliably in my environment. Can you provide the code to reproduce this? I'd like to try it on ConnectX to see if it is HCA-dependent. Also I would suggest raising this issue with whoever sold you the HCAs. From weiny2 at llnl.gov Fri Sep 5 15:47:16 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 5 Sep 2008 15:47:16 -0700 Subject: [ofa-general] [PATCH] ibnetdiscover.c: continue processing other ports even if smpquery fails on one port Message-ID: <20080905154716.54d82f0e.weiny2@llnl.gov> >From a08bca968a590bc041dabc733200469c78581d52 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 5 Sep 2008 15:40:17 -0700 Subject: [PATCH] ibnetdiscover.c: continue processing other ports even if smpquery fails on one port Signed-off-by: Ira Weiny --- infiniband-diags/src/ibnetdiscover.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 803c300..35e7118 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -424,7 +424,7 @@ discover(ib_portid_t *from) if (get_port(&port_buf, i, path) < 0) { IBWARN("can't reach node %s port %d", portid2str(path), i); - return 0; + continue; } port = find_port(node, &port_buf); -- 1.5.4.5 From ofed at kononov.ftml.net Fri Sep 5 16:11:05 2008 From: ofed at kononov.ftml.net (Roman Kononov) Date: Fri, 05 Sep 2008 18:11:05 -0500 Subject: [ofa-general] Bogus Receive Completions In-Reply-To: References: <48C19186.2050903@kononov.ftml.net> Message-ID: <48C1BC89.4080709@kononov.ftml.net> On 2008-09-05 17:39, Roland Dreier wrote: > > I managed to reproduce the failure reliably in my environment. > > Can you provide the code to reproduce this? I'd like to try it on > ConnectX to see if it is HCA-dependent. Perhaps, I can give you the code, but it needs lots of other HW and SW. Setting it up will be a big pain. And, by changing the code and moving stuff around, I can almost mask the problem, and it does not appear that soon. > > Also I would suggest raising this issue with whoever sold you the HCAs. What do you mean? Do you mean that the HCAs could be defective? The manufacturer is HP. The retailer is somebody. I'm sure, that money back is the most what I can get from them. BTW, I added more printfs in mthca_poll_one(), when it handles Receive Completions, and have noticed that cqe->wqe is out of sequence: ... wr_id=3a, is_error=0, wqe=e81, wqe_index=3a, cqe=0x84a7c0, imm=8000003a wr_id=3b, is_error=0, wqe=ec1, wqe_index=3b, cqe=0x84a7e0, imm=8000003b wr_id=3c, is_error=0, wqe=f01, wqe_index=3c, cqe=0x84a800, imm=8000003c wr_id=3d, is_error=0, wqe=f41, wqe_index=3d, cqe=0x84a820, imm=8000003d wr_id=3e, is_error=0, wqe=f81, wqe_index=3e, cqe=0x84a840, imm=8000003e wr_id=3f, is_error=0, wqe=fc1, wqe_index=3f, cqe=0x84a860, imm=8000003f wr_id=7f, is_error=0, wqe=fc1, wqe_index=3f, cqe=0x84a880, imm=80000040 "imm" is really cqe->imm_etype_pkey_eec, and it comes from the sender and is in sequence. Roman From rdreier at cisco.com Fri Sep 5 16:17:00 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 05 Sep 2008 16:17:00 -0700 Subject: [ofa-general] Bogus Receive Completions In-Reply-To: <48C1BC89.4080709@kononov.ftml.net> (Roman Kononov's message of "Fri, 05 Sep 2008 18:11:05 -0500") References: <48C19186.2050903@kononov.ftml.net> <48C1BC89.4080709@kononov.ftml.net> Message-ID: > Perhaps, I can give you the code, but it needs lots of other HW and > SW. Setting it up will be a big pain. And, by changing the code and > moving stuff around, I can almost mask the problem, and it does not > appear that soon. I suspect it's going to be hard to debug this without having a way to reproduce it. > > Also I would suggest raising this issue with whoever sold you the HCAs. > > What do you mean? Do you mean that the HCAs could be defective? The > manufacturer is HP. The retailer is somebody. I'm sure, that money > back is the most what I can get from them. I just mean that you should be able to get support as a customer and escalate this issue. Probably Mellanox is the only one who can debug this, especially because it could easily be a firmware issue. - R. From christopher.tanner at gatech.edu Fri Sep 5 17:02:19 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Fri, 5 Sep 2008 20:02:19 -0400 Subject: [ofa-general] OpenSM Ubuntu-unfriendly Message-ID: I'm trying to start the OpenSM daemon at startup by putting the opensmd file in the /etc/init.d directory. However, I get these errors when it tries to start: Starting opensm: /etc/init.d/opensmd: line 64: success: command not found /etc/init.d/opensmd: line 130: rc_exit: command not found Looking at the opensmd script, I noticed some things: a) It contains commands like rc_status, rc_exit, _rc_status_all which are not valid commands in Debian/Ubuntu b) It contains commands like success() and failure() which are not valid in Debian/Ubuntu I think these commands will work on Redhat or SUSE... Does anyone know the equivalent commands in Debian? For example, I think 'success' can be replaced with 'return 1', but I'm not certain how that affects the script. Thanks! ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- From dledford at redhat.com Fri Sep 5 18:41:48 2008 From: dledford at redhat.com (Doug Ledford) Date: Fri, 05 Sep 2008 21:41:48 -0400 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <48C191F6.5020302@array.ca> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> <20080905161420.GZ6273@sashak.voltaire.com> <48C191F6.5020302@array.ca> Message-ID: <1220665308.7801.45.camel@firewall.xsintricity.com> On Fri, 2008-09-05 at 16:09 -0400, Steven Truelove wrote: > Okay, thanks, I have run opensm and I have gotten IPoIB working as > well, although there is a problem. IPoIB works fine with static IPs, > but I can't get DHCP to work. The logs suggest that the DHCP server > simply isn't seeing the DHCPDISCOVERs from the client. Here is the > relevant chunk of dhcpd.conf: > > subnet 192.168.200.0 netmask 255.255.255.0 { > > always-broadcast on; > > range 192.168.200.10 192.168.200.50; > > option broadcast-address 192.168.200.255; > } > > > host sappsu4-ib { > hardware ethernet 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00; > fixed-address 192.168.200.104; > } > > > Does this have any chance of working? Will your dhcp server even start up with that hardware ethernet line in it? None of the patches for the dhcp server that I've seen enable dhcp to parse that big of an ethernet definition. > Thanks, > > Steven Truelove > > > > > Sasha Khapyorsky wrote: > > On 16:39 Fri 05 Sep , Bernd Schubert wrote: > > > > > We just didn't have the time yet to complete all the packaging and to push > > > it upstream to Debian, here is what we have so far > > > > > > # Etchy packages, but also should work for hardy > > > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ > > > > > > # Hardy packages, but not recently maintained (only for my workstation) > > > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ > > > > > > > Great! > > > > Sasha > > > > > > -- > Steven Truelove > Array Systems Computing, Inc. > 1120 Finch Avenue West, 7th Floor > Toronto, Ontario > M3J 3H7 > CANADA > http://www.array.ca > truelove at array.ca > Phone: (416) 736-0900 x307 > Fax: (416) 736-4715 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From vlad at lists.openfabrics.org Sat Sep 6 03:03:16 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 6 Sep 2008 03:03:16 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080906-0200 daily build status Message-ID: <20080906100316.6609CE60D8B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From truelove at array.ca Sat Sep 6 04:18:23 2008 From: truelove at array.ca (Steven Truelove) Date: Sat, 06 Sep 2008 07:18:23 -0400 Subject: ***SPAM*** Re: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <1220665308.7801.45.camel@firewall.xsintricity.com> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> <20080905161420.GZ6273@sashak.voltaire.com> <48C191F6.5020302@array.ca> <1220665308.7801.45.camel@firewall.xsintricity.com> Message-ID: <48C266FF.30108@array.ca> Yes, the DHCP server starts just fine. The hardware address was pulled from the output of ifconfig on the client. Even if the hardware address was wrong, or I didn't list the host at all, the 'range' setting should ensure that an address is provided from the open pool. There is output in the logs to indicate that there is no subnet listing for ib1 and eth0, and that it won't be listening on those interfaces. This implies that it is working on eth1 (where DHCP is tested working) and ib0 (where no log references to DHCPDISCOVER are made, even though the client is sending them). That said, am I barking up the wrong tree entirely by even trying to make this work? There are a few references to this being possible when I google for 'infiniband dhcp', and this is where I got the 'always-broadcast on' setting from. Apparently this is necessary. But I couldn't find anything further to help me. Thanks, Steven Truelove Doug Ledford wrote: > On Fri, 2008-09-05 at 16:09 -0400, Steven Truelove wrote: > >> Okay, thanks, I have run opensm and I have gotten IPoIB working as >> well, although there is a problem. IPoIB works fine with static IPs, >> but I can't get DHCP to work. The logs suggest that the DHCP server >> simply isn't seeing the DHCPDISCOVERs from the client. Here is the >> relevant chunk of dhcpd.conf: >> >> subnet 192.168.200.0 netmask 255.255.255.0 { >> >> always-broadcast on; >> >> range 192.168.200.10 192.168.200.50; >> >> option broadcast-address 192.168.200.255; >> } >> >> >> host sappsu4-ib { >> hardware ethernet 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00; >> fixed-address 192.168.200.104; >> } >> >> >> Does this have any chance of working? >> > > Will your dhcp server even start up with that hardware ethernet line in > it? None of the patches for the dhcp server that I've seen enable dhcp > to parse that big of an ethernet definition. > > >> Thanks, >> >> Steven Truelove >> >> >> >> >> Sasha Khapyorsky wrote: >> >>> On 16:39 Fri 05 Sep , Bernd Schubert wrote: >>> >>> >>>> We just didn't have the time yet to complete all the packaging and to push >>>> it upstream to Debian, here is what we have so far >>>> >>>> # Etchy packages, but also should work for hardy >>>> deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ >>>> >>>> # Hardy packages, but not recently maintained (only for my workstation) >>>> deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ >>>> >>>> >>> Great! >>> >>> Sasha >>> >>> >>> >> -- >> Steven Truelove >> Array Systems Computing, Inc. >> 1120 Finch Avenue West, 7th Floor >> Toronto, Ontario >> M3J 3H7 >> CANADA >> http://www.array.ca >> truelove at array.ca >> Phone: (416) 736-0900 x307 >> Fax: (416) 736-4715 >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Sat Sep 6 05:23:31 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 6 Sep 2008 08:23:31 -0400 Subject: ***SPAM*** Re: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <48C191F6.5020302@array.ca> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> <20080905161420.GZ6273@sashak.voltaire.com> <48C191F6.5020302@array.ca> Message-ID: On Fri, Sep 5, 2008 at 4:09 PM, Steven Truelove wrote: > Okay, thanks, I have run opensm and I have gotten IPoIB working as well, > although there is a problem. IPoIB works fine with static IPs, but I can't > get DHCP to work. The logs suggest that the DHCP server simply isn't seeing > the DHCPDISCOVERs from the client. Here is the relevant chunk of > dhcpd.conf: > > subnet 192.168.200.0 netmask 255.255.255.0 { > > always-broadcast on; > > range 192.168.200.10 192.168.200.50; > > option broadcast-address 192.168.200.255; > } > > > host sappsu4-ib { > hardware ethernet 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00; I think it's done with client identifier for IB and it should be set to the IPoIB hardware address (20 bytes) which is QPN + GID. You should be able to determine this from: ip addr show ib e.g. ip addr show ib0 5: ib0: mtu 2044 qdisc noop qlen 128 link/[32] 00:0d:00:48:20:06:00:00:00:00:00:00:00:02:c9:03:00:00:14:91 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff ip addr show ib1 6: ib1: mtu 2044 qdisc noop qlen 128 link/[32] 00:0d:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:00:14:92 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff -- Hal > fixed-address 192.168.200.104; > } > > > Does this have any chance of working? > > Thanks, > > Steven Truelove > > > > > Sasha Khapyorsky wrote: > > On 16:39 Fri 05 Sep , Bernd Schubert wrote: > > > We just didn't have the time yet to complete all the packaging and to push > it upstream to Debian, here is what we have so far > # Etchy packages, but also should work for hardy > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch > ./ > # Hardy packages, but not recently maintained (only for my workstation) > deb > http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ > > > Great! > Sasha > > > -- > Steven Truelove > Array Systems Computing, Inc. > 1120 Finch Avenue West, 7th Floor > Toronto, Ontario > M3J 3H7 > CANADA > http://www.array.ca > truelove at array.ca > Phone: (416) 736-0900 x307 > Fax: (416) 736-4715 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Sat Sep 6 06:33:55 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 6 Sep 2008 09:33:55 -0400 Subject: [ofa-general] ib_cm question In-Reply-To: References: Message-ID: Hi, On Fri, Sep 5, 2008 at 3:38 PM, Terry Greeniaus wrote: > Hello all, > > We are porting out application to run on the OFED stack. As part of the > porting process, I have a series of CM unit tests that I need to get to > run. I am having trouble with one in particular. The CM maintainer is currently on sabbatical for a little while. FWIW, I'll provide my take on this. > At a high level, the unit test implements a simple protocol for > establishing a connection between a client and a server to test basic CM > functionality. The protocol uses the private data field of the CM > packets to exchange a key that is generated randomly by the server on a > per-connection basis. Essentially, the client sends a REQ with a > randomly chosen key which will not match the server's. When the server > initially receives a REQ for a particular connection, it generates a > random key and compares it against the key stored in the REQ. Since > they don't match, the server sends a REJ back to the client, and the REJ > contains the correct key in the private data field. Finally, the client > resends the REQ, this time with the correct key: > > Client Server > REQ -------------------> > w/ bad key > > <------------------- REJ > w/ good key > > REQ -------------------> > w/ good key > > REP/etc. Are the keys in the private data ? Out of curiousity, what REJ code is used ? > Everything works well until the second REQ is received at the server. > It appears that instead of reusing the previous ib_cm_id, the OFED CM > generates a new ib_cm_id to handle the second REQ. The unit test thinks > that a new connection attempt is being requested instead of a retry of > the original attempt and so it generates a new random key, resulting in > the protocol being unable to establish a connection. > > Is something like I have described above supported by the OFED CM? ib_cm.h states: * ib_cm_handler - User-defined callback to process communication events. * @cm_id: Communication identifier associated with the reported event. * @event: Information about the communication event. * * IB_CM_REQ_RECEIVED and IB_CM_SIDR_REQ_RECEIVED communication events * generated as a result of listen requests result in the allocation of a * new @cm_id. The new @cm_id is returned to the user through this callback. Although some other CM's may have reused the same "cm id" on the passive side, I don't think that there's a requirement to do so. I think it's valid either way per the spec. IMO the unit test/protocol should not depend on implementation specific behavior which is what I think this amounts to. I don't sufficiently understand the details of your protocol (as to why the initial connection need be rejected) as opposed to passing the key back in the REP. There may also be other possibilities if a protocol change for your application is feasible. -- Hal > I can try and distill this down to a fairly short code example if that > would make things clearer. > > Thanks, > TG > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ofed at kononov.ftml.net Sat Sep 6 07:03:01 2008 From: ofed at kononov.ftml.net (Roman Kononov) Date: Sat, 06 Sep 2008 09:03:01 -0500 Subject: [ofa-general] Bogus Receive Completions In-Reply-To: References: <48C19186.2050903@kononov.ftml.net> <48C1BC89.4080709@kononov.ftml.net> Message-ID: <48C28D95.5060004@kononov.ftml.net> Roland Dreier wrote: > > Perhaps, I can give you the code, but it needs lots of other HW and > > SW. Setting it up will be a big pain. And, by changing the code and > > moving stuff around, I can almost mask the problem, and it does not > > appear that soon. > > Probably Mellanox is the only one who can debug this, especially because > it could easily be a firmware issue. Mellanox! Please! I can setup an SSH connection to the failing system and provide any assistance here 24/7. Roman From dledford at redhat.com Sat Sep 6 09:51:31 2008 From: dledford at redhat.com (Doug Ledford) Date: Sat, 06 Sep 2008 12:51:31 -0400 Subject: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <48C266FF.30108@array.ca> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> <20080905161420.GZ6273@sashak.voltaire.com> <48C191F6.5020302@array.ca> <1220665308.7801.45.camel@firewall.xsintricity.com> <48C266FF.30108@array.ca> Message-ID: <1220719891.7801.48.camel@firewall.xsintricity.com> On Sat, 2008-09-06 at 07:18 -0400, Steven Truelove wrote: > Yes, the DHCP server starts just fine. The hardware address was > pulled from the output of ifconfig on the client. Even if the hardware > address was wrong, or I didn't list the host at all, the 'range' > setting should ensure that an address is provided from the open pool. > > There is output in the logs to indicate that there is no subnet > listing for ib1 and eth0, and that it won't be listening on those > interfaces. This implies that it is working on eth1 (where DHCP is > tested working) and ib0 (where no log references to DHCPDISCOVER are > made, even though the client is sending them). > > That said, am I barking up the wrong tree entirely by even trying to > make this work? There are a few references to this being possible > when I google for 'infiniband dhcp', and this is where I got the > 'always-broadcast on' setting from. Apparently this is necessary. > But I couldn't find anything further to help me. Did you apply the dhcp patch that's in the OFED distribution to the dhcp server and recompile? Without, it doesn't know how to parse IB broadcast packets (and with it, it still doesn't, but it switches from raw mode to cooked socket mode where it doesn't have to know the structure of a raw IPoIB packet). It would certainly explain the dhcp server silently dropping the packets, they wouldn't look like dhcp requests in raw mode. > Thanks, > > Steven Truelove > > > > Doug Ledford wrote: > > On Fri, 2008-09-05 at 16:09 -0400, Steven Truelove wrote: > > > > > Okay, thanks, I have run opensm and I have gotten IPoIB working as > > > well, although there is a problem. IPoIB works fine with static IPs, > > > but I can't get DHCP to work. The logs suggest that the DHCP server > > > simply isn't seeing the DHCPDISCOVERs from the client. Here is the > > > relevant chunk of dhcpd.conf: > > > > > > subnet 192.168.200.0 netmask 255.255.255.0 { > > > > > > always-broadcast on; > > > > > > range 192.168.200.10 192.168.200.50; > > > > > > option broadcast-address 192.168.200.255; > > > } > > > > > > > > > host sappsu4-ib { > > > hardware ethernet 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00; > > > fixed-address 192.168.200.104; > > > } > > > > > > > > > Does this have any chance of working? > > > > > > > Will your dhcp server even start up with that hardware ethernet line in > > it? None of the patches for the dhcp server that I've seen enable dhcp > > to parse that big of an ethernet definition. > > > > > > > Thanks, > > > > > > Steven Truelove > > > > > > > > > > > > > > > Sasha Khapyorsky wrote: > > > > > > > On 16:39 Fri 05 Sep , Bernd Schubert wrote: > > > > > > > > > > > > > We just didn't have the time yet to complete all the packaging and to push > > > > > it upstream to Debian, here is what we have so far > > > > > > > > > > # Etchy packages, but also should work for hardy > > > > > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ > > > > > > > > > > # Hardy packages, but not recently maintained (only for my workstation) > > > > > deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ > > > > > > > > > > > > > > Great! > > > > > > > > Sasha > > > > > > > > > > > > > > > -- > > > Steven Truelove > > > Array Systems Computing, Inc. > > > 1120 Finch Avenue West, 7th Floor > > > Toronto, Ontario > > > M3J 3H7 > > > CANADA > > > http://www.array.ca > > > truelove at array.ca > > > Phone: (416) 736-0900 x307 > > > Fax: (416) 736-4715 > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From truelove at array.ca Sat Sep 6 09:57:10 2008 From: truelove at array.ca (Steven Truelove) Date: Sat, 06 Sep 2008 12:57:10 -0400 Subject: ***SPAM*** Re: [ofa-general] ConnectX IB HCA with Ubuntu 8.04 In-Reply-To: <1220719891.7801.48.camel@firewall.xsintricity.com> References: <48AC6495.1040807@array.ca> <48C13916.9020307@array.ca> <20080905142226.GV6273@sashak.voltaire.com> <200809051639.57300.bs@q-leap.de> <20080905161420.GZ6273@sashak.voltaire.com> <48C191F6.5020302@array.ca> <1220665308.7801.45.camel@firewall.xsintricity.com> <48C266FF.30108@array.ca> <1220719891.7801.48.camel@firewall.xsintricity.com> Message-ID: <48C2B666.3020206@array.ca> Okay, thanks, this is likely what I missed. Steven Truelove Doug Ledford wrote: > On Sat, 2008-09-06 at 07:18 -0400, Steven Truelove wrote: > >> Yes, the DHCP server starts just fine. The hardware address was >> pulled from the output of ifconfig on the client. Even if the hardware >> address was wrong, or I didn't list the host at all, the 'range' >> setting should ensure that an address is provided from the open pool. >> >> There is output in the logs to indicate that there is no subnet >> listing for ib1 and eth0, and that it won't be listening on those >> interfaces. This implies that it is working on eth1 (where DHCP is >> tested working) and ib0 (where no log references to DHCPDISCOVER are >> made, even though the client is sending them). >> >> That said, am I barking up the wrong tree entirely by even trying to >> make this work? There are a few references to this being possible >> when I google for 'infiniband dhcp', and this is where I got the >> 'always-broadcast on' setting from. Apparently this is necessary. >> But I couldn't find anything further to help me. >> > > Did you apply the dhcp patch that's in the OFED distribution to the dhcp > server and recompile? Without, it doesn't know how to parse IB > broadcast packets (and with it, it still doesn't, but it switches from > raw mode to cooked socket mode where it doesn't have to know the > structure of a raw IPoIB packet). It would certainly explain the dhcp > server silently dropping the packets, they wouldn't look like dhcp > requests in raw mode. > > >> Thanks, >> >> Steven Truelove >> >> >> >> Doug Ledford wrote: >> >>> On Fri, 2008-09-05 at 16:09 -0400, Steven Truelove wrote: >>> >>> >>>> Okay, thanks, I have run opensm and I have gotten IPoIB working as >>>> well, although there is a problem. IPoIB works fine with static IPs, >>>> but I can't get DHCP to work. The logs suggest that the DHCP server >>>> simply isn't seeing the DHCPDISCOVERs from the client. Here is the >>>> relevant chunk of dhcpd.conf: >>>> >>>> subnet 192.168.200.0 netmask 255.255.255.0 { >>>> >>>> always-broadcast on; >>>> >>>> range 192.168.200.10 192.168.200.50; >>>> >>>> option broadcast-address 192.168.200.255; >>>> } >>>> >>>> >>>> host sappsu4-ib { >>>> hardware ethernet 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00; >>>> fixed-address 192.168.200.104; >>>> } >>>> >>>> >>>> Does this have any chance of working? >>>> >>>> >>> Will your dhcp server even start up with that hardware ethernet line in >>> it? None of the patches for the dhcp server that I've seen enable dhcp >>> to parse that big of an ethernet definition. >>> >>> >>> >>>> Thanks, >>>> >>>> Steven Truelove >>>> >>>> >>>> >>>> >>>> Sasha Khapyorsky wrote: >>>> >>>> >>>>> On 16:39 Fri 05 Sep , Bernd Schubert wrote: >>>>> >>>>> >>>>> >>>>>> We just didn't have the time yet to complete all the packaging and to push >>>>>> it upstream to Debian, here is what we have so far >>>>>> >>>>>> # Etchy packages, but also should work for hardy >>>>>> deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/etch ./ >>>>>> >>>>>> # Hardy packages, but not recently maintained (only for my workstation) >>>>>> deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ >>>>>> >>>>>> >>>>>> >>>>> Great! >>>>> >>>>> Sasha >>>>> >>>>> >>>>> >>>>> >>>> -- >>>> Steven Truelove >>>> Array Systems Computing, Inc. >>>> 1120 Finch Avenue West, 7th Floor >>>> Toronto, Ontario >>>> M3J 3H7 >>>> CANADA >>>> http://www.array.ca >>>> truelove at array.ca >>>> Phone: (416) 736-0900 x307 >>>> Fax: (416) 736-4715 >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Sun Sep 7 02:10:43 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 07 Sep 2008 12:10:43 +0300 Subject: [ofa-general] Bogus Receive Completions In-Reply-To: <48C28D95.5060004@kononov.ftml.net> References: <48C19186.2050903@kononov.ftml.net> <48C1BC89.4080709@kononov.ftml.net> <48C28D95.5060004@kononov.ftml.net> Message-ID: <48C39A93.8060809@mellanox.co.il> Roman Kononov wrote: > Roland Dreier wrote: >> > Perhaps, I can give you the code, but it needs lots of other HW and >> > SW. Setting it up will be a big pain. And, by changing the code and >> > moving stuff around, I can almost mask the problem, and it does not >> > appear that soon. >> >> Probably Mellanox is the only one who can debug this, especially because >> it could easily be a firmware issue. > > Mellanox! Please! > > I can setup an SSH connection to the failing system and provide any > assistance here 24/7. > > Roman > We will work with you to debug it off line. Tziporet From vlad at lists.openfabrics.org Sun Sep 7 03:05:19 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 7 Sep 2008 03:05:19 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080907-0200 daily build status Message-ID: <20080907100519.102C0E60B08@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From vlad at lists.openfabrics.org Mon Sep 8 03:07:52 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 8 Sep 2008 03:07:52 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080908-0200 daily build status Message-ID: <20080908100752.C4970E60975@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From cap at nsc.liu.se Mon Sep 8 03:50:35 2008 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Mon, 8 Sep 2008 12:50:35 +0200 Subject: [ofa-general] Compiling source using Intel Compiler In-Reply-To: References: <20080904181850.GC6273@sashak.voltaire.com> Message-ID: <200809081250.37307.cap@nsc.liu.se> On Thursday 04 September 2008, Christopher Tanner wrote: > > But why you cannot use gcc for building OFED packages? > > Our codes have a lot of Fortran 77 in them and gfortran hasn't been > compiling those codes very well. Since we're using ifort for Fortran > compiling, I figured we ought to use icc (C) and icpc (C++) to use a > consistent compiler package. I don't know if programs partially > compiled in gcc and ifort will work very well... This is the case for a lot of users and sites (if not most HPC sites). There is no need what-so-ever to compile the IB-stack with icc. Just build in the recommended way and compile your applications with icc/ifort. /Peter From tziporet at mellanox.co.il Mon Sep 8 04:55:34 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 08 Sep 2008 14:55:34 +0300 Subject: [ofa-general] OFED meeting agenda for Sep-8 on OFED 1.4 release status Message-ID: <48C512B6.5080906@mellanox.co.il> This is the agenda for the OFED meeting today (8-Sep): 1. RC1 status: We have build RC1 today and will run testing today - should be out tomorrow. 2. Missing features for RC2: - NFS-RDMA over RHEL 5.1 - OSM: Cashed routing 3. Bug list review: *bug_id* *bug_severity* *op_sys* *assigned_to* *short_short_desc* 1128 blocker Other stefan.roscher at de.ibm.com release IPoIB-CM QP resources in flushing CQE context 1171 critical Other swise at opengridcomputing.com no mac stats with ofed-1.4 cxgb3 1113 critical RHEL 4 vu at mellanox.com rpm -e scsi-target-utils-0.1-2008715 fails 1117 critical SLES 10 yannick.cote at qlogic.com ib_ipath module hangs on unload 1172 major RHEL 5 eli at mellanox.co.il soft lockup in ipoib during hw driver unload 1153 major Other vlad at mellanox.co.il OpenSM- Multicast group will not open when IB host is the client (joined as send only). 1164 normal SLES 10 eli at mellanox.co.il iperf over IPoIB fails for 100 tcp connections 1131 normal Other sashak at voltaire.com ibnetdiscover - some options are mentioned in the man, but not implemented 1132 normal Other sashak at voltaire.com ibclearcounters - -N flag couse to irrelevant error (usage of perfquery) 1136 normal All sashak at voltaire.com ibtracert - some flags mentioned in man page but doesn't implemented 4. Open Discussion Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sumit.Gaur at Sun.COM Mon Sep 8 07:50:39 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Mon, 08 Sep 2008 20:20:39 +0530 Subject: [ofa-general] open_node_name_map on OFED 1.3.1 In-Reply-To: <20080907190004.4D9DDE60B0F@openfabrics.org> References: <20080907190004.4D9DDE60B0F@openfabrics.org> Message-ID: <48C53BBF.7030304@Sun.COM> I have upgraded my OFED version from 1.2.5* to 1.3.1. With OFED 1.3.1 I am facing problems in umad_send and umad_recv I have gone through OFED code for the same. Only extra thing I observed is call of open_node_name_map(node_name_map_file); Is it necessary in OFED 1.3.1 to use above function before making any smpquery ? Thanks sumit From hal.rosenstock at gmail.com Mon Sep 8 09:01:31 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 8 Sep 2008 12:01:31 -0400 Subject: [ofa-general] upgrade from 1.2.5* to 1.3.1 In-Reply-To: <48C13182.3020500@Sun.COM> References: <20080905015440.9C33FE60D8F@openfabrics.org> <48C13182.3020500@Sun.COM> Message-ID: On Fri, Sep 5, 2008 at 9:17 AM, Sumit Gaur - Sun Microsystem wrote: > Hi > I have upgraded my OFED version from 1.2.5* to 1.3.1, Now application could > not communicate with OFED libraries using umad_send and umad_recv function > call for IB_SMI_CLASS (with DR path). Is there any major change in umad lib > for such requests. Any help or info is appreciated. What kernel is being used ? On what machine architecture are you running ? Is it perhaps big endian ? I think there was a change that could affect those machines at a minimum. -- Hal > sumit > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Mon Sep 8 09:05:34 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 8 Sep 2008 12:05:34 -0400 Subject: [ofa-general] open_node_name_map on OFED 1.3.1 In-Reply-To: <48C53BBF.7030304@Sun.COM> References: <20080907190004.4D9DDE60B0F@openfabrics.org> <48C53BBF.7030304@Sun.COM> Message-ID: On Mon, Sep 8, 2008 at 10:50 AM, Sumit Gaur - Sun Microsystem wrote: > I have upgraded my OFED version from 1.2.5* to 1.3.1. With OFED 1.3.1 I am > facing problems in umad_send and umad_recv I have gone through OFED code for > the same. Only extra thing I observed is call of > open_node_name_map(node_name_map_file); > > Is it necessary in OFED 1.3.1 to use above function before making any > smpquery ? Are you talking about the smpquery diag or a custom SMP query ? In the former case, it should work whether or not there is a node name map file (which is optional). In the latter case, there is no need to issue this call (which is in the common diags). It is merely for getting more user friendly node names if they exist. I don't think it's related to any problems you are observing with umad_send/recv. -- Hal > Thanks > sumit > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Mon Sep 8 09:45:14 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 8 Sep 2008 12:45:14 -0400 Subject: [ofa-general] sched_setaffinity / sched_getaffinity Message-ID: <51D89AA7-D416-4315-AC33-964C998DC67B@cisco.com> There's at least one warning in OFED that Betsy mentioned today about sched_setaffinity(): ch3_smp_progress.c:2427: warning: passing argument 3 of 'sched_setaffinity' from incompatible pointer type Be advised that the prototypes for sched_setaffinity() and sched_getaffinity() have changed multiple times over the life of the 2.4 and 2.6 kernel series. Even worse, the the signatures in glibc have not always matched those in the kernel -- even in shipping Linux distros. The Open MPI project spun off a tiny library to solve exactly this problem (because it really has nothing to do with MPI): the Portable Linux Processor Affinity (PLPA) project. PLPA provides plpa_sched_setaffinity() and plpa_sched_getaffinity() API calls with constant signatures and will do the Right Thing regardless of what version of glibc and/or kernel you have. PLPA is fully embeddable in other software projects (e.g., we embed it in Open MPI; htop also embeds it, IIRC). PLPA's license is BSD. See the project page here: http://www.open-mpi.org/projects/plpa/ Ping me on the PLPA mailing list if you have any questions / comments / suggestions / patches / etc. You have to be subscribed to post, sorry. Enjoy. -- Jeff Squyres Cisco Systems From tgree at relay.phys.ualberta.ca Mon Sep 8 10:14:46 2008 From: tgree at relay.phys.ualberta.ca (Terry Greeniaus) Date: Mon, 8 Sep 2008 11:14:46 -0600 (MDT) Subject: [ofa-general] ib_cm question In-Reply-To: References: Message-ID: On Sat, 6 Sep 2008, Hal Rosenstock wrote: > On Fri, Sep 5, 2008 at 3:38 PM, Terry Greeniaus > wrote: > > The CM maintainer is currently on sabbatical for a little while. FWIW, > I'll provide my take on this. Thanks for your response Hal! > > Client Server > > REQ -------------------> > > w/ bad key > > > > <------------------- REJ > > w/ good key > > > > REQ -------------------> > > w/ good key > > > > REP/etc. > > Are the keys in the private data ? Yes. > Out of curiousity, what REJ code is used ? 28 - consumer reject. > ib_cm.h states: > * ib_cm_handler - User-defined callback to process communication events. > * @cm_id: Communication identifier associated with the reported event. > * @event: Information about the communication event. > * > * IB_CM_REQ_RECEIVED and IB_CM_SIDR_REQ_RECEIVED communication events > * generated as a result of listen requests result in the allocation of a > * new @cm_id. The new @cm_id is returned to the user through this callback. > > Although some other CM's may have reused the same "cm id" on the > passive side, I don't think that there's a requirement to do so. I > think it's valid either way per the spec. IMO the unit test/protocol > should not depend on implementation specific behavior which is what I > think this amounts to. You may be right here. The passive side of the CM state machine diagram (Fig 132 p 688 in my copy of the IBA) has an arc from "REJ Sent" to "REQ Rcvd" labelled "(retry) Rcv REQ". It also has an arc labelled "(no retry)" which essentially frees up the cm id. Unfortunately the spec doesn't specify which arc you should follow. We had interpreted this as the passive side waiting for a retried REQ if the number of CM retries as specified in the original REQ packet had not yet been exhausted. However, with dropped packets (UD) this could result in the passive cm id never being freed - so the OFED interpretation of freeing it immediately after sending the REJ and using a new cm id for subsequent REQs may be the more sensible interpretation. > I don't sufficiently understand the details of your protocol (as to > why the initial connection need be rejected) as opposed to passing the > key back in the REP. There may also be other possibilities if a > protocol change for your application is feasible. The protocol is completely contrived for this particular unit test - it isn't used anywhere in our application and was meant to test these particular state transitions in our CM implementation. It's comforting to know that they did their job and found this difference in how the two CMs work, but I will argue that it shouldn't be run against the OFED stack or that it should be modified to take the OFED interpretation of the spec into consideration. Thanks for your time, TG From tgree at relay.phys.ualberta.ca Mon Sep 8 11:33:34 2008 From: tgree at relay.phys.ualberta.ca (Terry Greeniaus) Date: Mon, 8 Sep 2008 12:33:34 -0600 (MDT) Subject: [ofa-general] DM question Message-ID: Hi all, Our application, which we are porting to OFED, makes use of Device Management to both advertise and discover services on the network. Our application uses the user-level MAD library and is currently able to perform DM queries on the subnet. Unfortunately, in order to advertise DM services our application needs to be able to set the "isDeviceManagementSupported" bit on our local port's capabilityMask. I don't see a way to do that from userspace in OFED. This is critical for our application since without that bit set nobody else on the fabric will be able to use our services. Perhaps I have overlooked something? If not, what would the recommended way of setting this bit in the capabilityMask be? Thanks, TG From hal.rosenstock at gmail.com Mon Sep 8 13:24:08 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 8 Sep 2008 16:24:08 -0400 Subject: [ofa-general] DM question In-Reply-To: References: Message-ID: Hi Terry, On Mon, Sep 8, 2008 at 2:33 PM, Terry Greeniaus wrote: > Hi all, > > Our application, which we are porting to OFED, makes use of Device > Management to both advertise and discover services on the network. Our > application uses the user-level MAD library and is currently able to > perform DM queries on the subnet. Unfortunately, in order to advertise > DM services our application needs to be able to set the > "isDeviceManagementSupported" bit on our local port's capabilityMask. I > don't see a way to do that from userspace in OFED. This is critical for > our application since without that bit set nobody else on the fabric > will be able to use our services. > > Perhaps I have overlooked something? If not, what would the recommended > way of setting this bit in the capabilityMask be? The only way I see to do this from user space is something like the following: SubnGet PortInfo of local port Set IsDeviceManagementSupport bit in PortInfo.CapabilityMask Change any other PortInfo fields so set will work (LinkState andPortPhysicalState set to no state change, don't think any others need changing) SubnSet PortInfo of local port (make sure set worked) The downside is that if your application crashes you will need to have a cleanup program to unset that bit. Hope this helps. -- Hal > Thanks, > TG > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tgree at relay.phys.ualberta.ca Mon Sep 8 13:49:26 2008 From: tgree at relay.phys.ualberta.ca (Terry Greeniaus) Date: Mon, 8 Sep 2008 14:49:26 -0600 (MDT) Subject: [ofa-general] DM question In-Reply-To: References: Message-ID: On Mon, 8 Sep 2008, Hal Rosenstock wrote: > > Perhaps I have overlooked something? If not, what would the recommended > > way of setting this bit in the capabilityMask be? > > The only way I see to do this from user space is something like the following: > > SubnGet PortInfo of local port > Set IsDeviceManagementSupport bit in PortInfo.CapabilityMask > Change any other PortInfo fields so set will work (LinkState > andPortPhysicalState set to no state change, don't think any others > need changing) > SubnSet PortInfo of local port (make sure set worked) > > The downside is that if your application crashes you will need to have > a cleanup program to unset that bit. The IBA lists the capabilityMask field as read-only. Is doing a SubnSet on the PortInfo.CapabilityMask field supported in OFED? That would solve the immediate problem. TG From ralph.campbell at qlogic.com Mon Sep 8 13:57:35 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 08 Sep 2008 13:57:35 -0700 Subject: [ofa-general] DM question In-Reply-To: References: Message-ID: <1220907455.30937.125.camel@chromite.mv.qlogic.com> On Mon, 2008-09-08 at 14:49 -0600, Terry Greeniaus wrote: > On Mon, 8 Sep 2008, Hal Rosenstock wrote: > > > > Perhaps I have overlooked something? If not, what would the recommended > > > way of setting this bit in the capabilityMask be? > > > > The only way I see to do this from user space is something like the following: > > > > SubnGet PortInfo of local port > > Set IsDeviceManagementSupport bit in PortInfo.CapabilityMask > > Change any other PortInfo fields so set will work (LinkState > > andPortPhysicalState set to no state change, don't think any others > > need changing) > > SubnSet PortInfo of local port (make sure set worked) > > > > The downside is that if your application crashes you will need to have > > a cleanup program to unset that bit. > > The IBA lists the capabilityMask field as read-only. Is doing a > SubnSet on the PortInfo.CapabilityMask field supported in OFED? That > would solve the immediate problem. You can't use the SubnSet(Portinfo) MADs to change the PortInfo.CapabilityMask. You (or someone else) will need to modify the kernel to call ib_modify_port() with the bit set in ib_port_modify.set_port_cap_mask. From hal.rosenstock at gmail.com Mon Sep 8 16:33:15 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 8 Sep 2008 19:33:15 -0400 Subject: ***SPAM*** Re: [ofa-general] DM question In-Reply-To: References: Message-ID: On Mon, Sep 8, 2008 at 4:49 PM, Terry Greeniaus wrote: > On Mon, 8 Sep 2008, Hal Rosenstock wrote: > >> > Perhaps I have overlooked something? If not, what would the recommended >> > way of setting this bit in the capabilityMask be? >> >> The only way I see to do this from user space is something like the following: >> >> SubnGet PortInfo of local port >> Set IsDeviceManagementSupport bit in PortInfo.CapabilityMask >> Change any other PortInfo fields so set will work (LinkState >> andPortPhysicalState set to no state change, don't think any others >> need changing) >> SubnSet PortInfo of local port (make sure set worked) >> >> The downside is that if your application crashes you will need to have >> a cleanup program to unset that bit. > > The IBA lists the capabilityMask field as read-only. Is doing a > SubnSet on the PortInfo.CapabilityMask field supported in OFED? You're right; my bad :-( -- Hal > That > would solve the immediate problem. > > TG > From hal.rosenstock at gmail.com Mon Sep 8 16:36:05 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 8 Sep 2008 19:36:05 -0400 Subject: [ofa-general] DM question In-Reply-To: <1220907455.30937.125.camel@chromite.mv.qlogic.com> References: <1220907455.30937.125.camel@chromite.mv.qlogic.com> Message-ID: On Mon, Sep 8, 2008 at 4:57 PM, Ralph Campbell wrote: > On Mon, 2008-09-08 at 14:49 -0600, Terry Greeniaus wrote: >> On Mon, 8 Sep 2008, Hal Rosenstock wrote: >> >> > > Perhaps I have overlooked something? If not, what would the recommended >> > > way of setting this bit in the capabilityMask be? >> > >> > The only way I see to do this from user space is something like the following: >> > >> > SubnGet PortInfo of local port >> > Set IsDeviceManagementSupport bit in PortInfo.CapabilityMask >> > Change any other PortInfo fields so set will work (LinkState >> > andPortPhysicalState set to no state change, don't think any others >> > need changing) >> > SubnSet PortInfo of local port (make sure set worked) >> > >> > The downside is that if your application crashes you will need to have >> > a cleanup program to unset that bit. >> >> The IBA lists the capabilityMask field as read-only. Is doing a >> SubnSet on the PortInfo.CapabilityMask field supported in OFED? That >> would solve the immediate problem. > > You can't use the SubnSet(Portinfo) MADs to change the > PortInfo.CapabilityMask. Right. >You (or someone else) will need > to modify the kernel to call ib_modify_port() with the > bit set in ib_port_modify.set_port_cap_mask. Doing it in the kernel is the straightforward part. It's not available from user space so something needs to be added for that. It could be done like issm but it was decided not to chew up additional fds needlessly. -- Hal > From christopher.tanner at gatech.edu Mon Sep 8 22:08:00 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Tue, 9 Sep 2008 01:08:00 -0400 Subject: [ofa-general] Compiled IB packages Message-ID: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> I am setting up a 16-node (homogeneous) cluster running Ubuntu 8.04 server with Mellanox Infiniband cards. I downloaded (from the OpenFabrics website), compiled, and installed the following IB packages on the master node into the /usr/local/lib directory. The / usr/local directory is being shared to all of the nodes via NFS. All packages seemed to compile and install fine. libibverbs librdmacm libibcm libipathverbs dapl compat-dapl libmlx4 libmthca libcxgb3 libibcommon libibumad libibmad opensm infiniband-diags I have a few questions: a) Do I need to run 'make install' on each node or just the master node? All of the libraries in /usr/local/lib are visible to all nodes... Stated another way, does 'make install' put files elsewhere beside the /usr/local/lib directory? Does it alter OS configuration files to tell it to look for certain files in /usr/local/lib? b) I know I need to load the IB kernel modules (mlx4_core, mlx4_ib, rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) in order for the IB cards to work. Are these compiled and installed with the above packages? Where does the kernel know where to look for modules? (Sorry, this question is very similar to the first one). c) The OFED software stack contains some stuff that isn't available for source download (e.g. ib-bonding, ibsim, libsdp). Are these necessary for the IB network to operate correctly? Since I'm running Ubuntu, obviously the src.rpm file won't work... Thanks to all for you help. Previous responses regarding issues with OpenSM worked great. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- From vlad at lists.openfabrics.org Tue Sep 9 03:09:00 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 9 Sep 2008 03:09:00 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080909-0200 daily build status Message-ID: <20080909100900.62052E60CF5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From mschlining at datadirectnet.com Tue Sep 9 06:07:23 2008 From: mschlining at datadirectnet.com (Marty Schlining) Date: Tue, 9 Sep 2008 06:07:23 -0700 Subject: [ofa-general] Forcing a DDR HCA to SDR speeds Message-ID: <60BA2AA14940C9429038D4E2BC53008D1645AB068D@MAILBOXCLUSTER.datadirect.datadirectnet.com> With OFED 1.3.1 or 1.4, is it possible to force the link speed of a DDR HCA port or the entire DDR HCA from a DDR link to strictly SDR? If so, how can it be done? The HCAs in question is a Mellanox MT25208 dual port HCA, rev A3, firmware 4.8.2. Martin Schlining mschlining at datadirectnet.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From trzyna at us.ibm.com Tue Sep 9 06:01:18 2008 From: trzyna at us.ibm.com (Matthew Trzyna) Date: Tue, 9 Sep 2008 07:01:18 -0600 Subject: [ofa-general] OpenSM Problems/Questions Message-ID: Hello A "Basic Fabric Diagram" at the end. I am working with a customer implementing a large IB fabric and is encountering problems with OpenSM (OFED 1.3) when they added a new 264 node cluster (with its own 288 port IB switch) to their existing cluster. Two more 264 clusters are planned to be added in the near future. They recently moved to SLES 10 SP1 and OFED 1.3 (before adding the new cluster) and had not been experiencing these problems before. Could you help provide answers to the questions listed below? Additional information about the configuration including a basic fabric diagram are provided after the questions. What parameters should be set on the non-SM nodes that affect how the Subnet Administrator functions? What parameters should be set on the SM node(s) that affect how the Subnet Administrator functions? And, what parameters should be removed from the SM node(s)? (ie. ib_sa paths_per_dest=0x7f) How should SM failover be setup? How many failover SM's should be configured? This must happen quickly and transparently or GPFS will die everywhere due to timeouts if this takes too long). Are there SA (Subnet Administrator) commands that should not be executed on a large "live" fabric? (ie. "saquery -p") Should GPFS be configured "off" on the SM node(s)? Do you know of any other OpenSM implementations that have 5 (or more) 288 port IB switches that might have already encountered/resolved some of these issues? The following problem that is being encountered may also be SA/SM related. A node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G). A ping from those node (NodesA-G) to NodeX returns "Destination Host Unreachable". A ping from NodeX to NodesA-G works. -------------------------------------------------------------------------------------------------- System Information Here is the current opensm.conf file: (See attached file: opensm.conf) It is the default configuration from the OFED 1.3 build with "priority" added at the bottom. Note that the /etc/init.d/opensmd sources /etc/sysconfig/opensm not etc/sysconfig/opensm.conf (opensm.conf was just copied to opensm). There are a couple of "proposed" settings that are commented out, that were found them on the web. Following are the present settings that may affect the Fabric: /etc/infiniband/openib.conf SET_IPOIB_CM=no /etc/modprobe.conf.local options ib_ipoib send_queue_size=512 recv_queue_size=512 options ib_sa paths_per_dest=0x7f /etc/sysctl.conf net.ipv4.neigh.ib0.base_reachable_time = 1200 net.ipv4.neigh.default.gc_thresh3 = 3072 net.ipv4.neigh.default.gc_thresh2 = 2500 net.ipv4.neigh.default.gc_thresh1 = 2048 /etc/sysconfig/opensm All defaults as supplied with OFED 1.3 OpenSM ------------------------------------------------------- Basic Fabric Diagram +----------+ |Top Level |-------------------+ 20 IO nodes +-----------------| 288 port |----------------+ 16 Viual nodes | | IB Sw |------------+ | 2 Admin nodes | +------| |---+ | | (SM nodes) | | +----------+ | | | 4 Support nodes | | | | | | | | | | | | 24 24 24 24 24 24 <--uplinks | | | | | | | | | | | +------+ | | | | | | |(BASE) |(SCU1) |(SCU2) |(SCU3) |(SCU4) |(SCU5) +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |288-port| |288-port| |288-port| |288-port| |288-port| |288-port| | IB Sw | | IB Sw | | IB Sw | | IB Sw | | IB Sw | | IB Sw | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ 140-nodes 264-nodes 264-nodes 264-nodes 264-nodes 264-nodes WhiteBox Dell Dell IBM IBM IBM (future) NOTE: SCU4 is not currently connected to the Top Level Switch. We'd like to address these issues before making that connection. Subnet Managers are configured on nodes connected to the Top Leval Switch. Let me know if you need any more information. Any help you could provide would be most appreciated. Thanks. Matt Trzyna IBM Linux Cluster Enablement 3039 Cornwallis Rd. RTP, NC 27709 e-mail: trzyna at us.ibm.com Office: (919) 254-9917 Tie Line: 444 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: opensm.conf Type: application/octet-stream Size: 4797 bytes Desc: not available URL: From tziporet at mellanox.co.il Tue Sep 9 07:22:33 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 9 Sep 2008 17:22:33 +0300 Subject: [ofa-general] OFED meeting summary for Sep 8, 2008 on 1.4 status Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD75971B@mtlexch01.mtl.com> OFED meeting summary for Sep 8, 2008 on 1.4 status ================================================== Summary: ======== - 1.4-rc1 is done on Sep 9. - Took a target to clean all compilation warning for RC2 - Moved to weekly meetings starting this week Details: ======== 1. Missing features for RC2: - NFS-RDMA over RHEL 5.1 - OSM: Cashed routing 2. We decided to cleanup all warnings for RC2 Each module owner - please start the cleanup work 3. Bugs review: bug_id bug_severity assigned_to Status update 1128 blocker stefan.roscher at de.ibm.com fix under test in low level driver - should be done this week 1171 critical swise at opengridcomputing.com should be fixed for rc2 1113 critical vu at mellanox.com on work 1117 critical yannick.cote at qlogic.com should be fixed in rc1 - to be tested by Qlogic 1153 major yosefe at voltaire.com On work - Voltaire 1164 normal eli at mellanox.co.il need details on the HCA type 1178 normal pasha at mellanox.co.il 1160 normal perkinjo at cse.ohio-state.edu Should be fixed with new package 1131 normal sashak at voltaire.com 1132 normal sashak at voltaire.com 1136 normal sashak at voltaire.com 4. Testing matrix: Betsy need to send Qlogic testing table Tziporet should publish the full matrix Tziporet From hal.rosenstock at gmail.com Tue Sep 9 07:36:50 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 9 Sep 2008 10:36:50 -0400 Subject: ***SPAM*** Re: [ofa-general] Forcing a DDR HCA to SDR speeds In-Reply-To: <60BA2AA14940C9429038D4E2BC53008D1645AB068D@MAILBOXCLUSTER.datadirect.datadirectnet.com> References: <60BA2AA14940C9429038D4E2BC53008D1645AB068D@MAILBOXCLUSTER.datadirect.datadirectnet.com> Message-ID: On Tue, Sep 9, 2008 at 9:07 AM, Marty Schlining wrote: > With OFED 1.3.1 or 1.4, is it possible to force the link speed of a DDR HCA > port or the entire DDR HCA from a DDR link to strictly SDR? If so, how can > it be done? The HCAs in question is a Mellanox MT25208 dual port HCA, rev > A3, firmware 4.8.2. Yes. For an individual port, look at the infiniband-diags ibportstate command included in management: ibportstate speed There are ramifications on the SM to use this in that it must not overwrite the speed (PortInfo:LinkSpeedEnabled). OpenSM has a force_link_speed option for this which can be set to 0 for this type of operation. You will also need to reset the port to make it take effect as renegotiation does not occur unless this is done. ibportstate reset The reset is only allowed on switch port so the peer port must be found. If you want all ports to be SDR and are using OpenSM, you can just set force_link_speed to SDR (1). -- Hal > > Martin Schlining > > mschlining at datadirectnet.com > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From akepner at sgi.com Tue Sep 9 07:54:35 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 9 Sep 2008 07:54:35 -0700 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled Message-ID: <20080909145435.GO2316@sgi.com> If a socket's sk_write_space() method expects to run with interrupts enabled, syslog can get very noisy with messages like: Badness in local_bh_enable at kernel/softirq.c:140 Call Trace: [] show_stack+0x40/0xa0 [] dump_stack+0x30/0x60 [] local_bh_enable+0x90/0x140 [] _spin_unlock_bh+0x30/0x60 [] svc_sock_enqueue+0x750/0x780 [sunrpc] [] svc_write_space+0xc0/0x1c0 [sunrpc] [] sock_wfree+0xd0/0x140 [] ipoib_send+0x1120/0x14a0 [ib_ipoib] [] ipoib_start_xmit+0x380/0x1140 [ib_ipoib] [] dev_hard_start_xmit+0x4b0/0x680 [] __qdisc_run+0x2d0/0x680 A simple fix is to defer calling skb_orphan() until interrupts have been reenabled. Signed-off-by: Arthur Kepner --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index b0ffc9a..8c9dcf1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -440,7 +440,7 @@ int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); int ipoib_add_umcast_attr(struct net_device *dev); -void ipoib_send(struct net_device *dev, struct sk_buff *skb, +int ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(struct work_struct *work); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 66cafa2..711a3ac 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -525,13 +525,14 @@ static inline int post_send(struct ipoib_dev_priv *priv, return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); } -void ipoib_send(struct net_device *dev, struct sk_buff *skb, +int ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; int hlen; void *phead; + int ret = 1; /* assume the worst */ if (skb_is_gso(skb)) { hlen = skb_transport_offset(skb) + tcp_hdrlen(skb); @@ -541,7 +542,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, ++dev->stats.tx_dropped; ++dev->stats.tx_errors; dev_kfree_skb_any(skb); - return; + return 1; } } else { if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { @@ -550,7 +551,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, ++dev->stats.tx_dropped; ++dev->stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); - return; + return 1; } phead = NULL; hlen = 0; @@ -571,7 +572,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) { ++dev->stats.tx_errors; dev_kfree_skb_any(skb); - return; + return 1; } if (skb->ip_summed == CHECKSUM_PARTIAL) @@ -593,6 +594,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, --priv->tx_outstanding; ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); + ret = 1; if (netif_queue_stopped(dev)) netif_wake_queue(dev); } else { @@ -600,13 +602,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, address->last_send = priv->tx_head; ++priv->tx_head; - skb_orphan(skb); - + ret = 0; } if (unlikely(priv->tx_outstanding > MAX_SEND_CQE)) while (poll_tx(priv)) ; /* nothing */ + + return ret; } static void __ipoib_reap_ah(struct net_device *dev) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 7e9e218..b67c793 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -604,7 +604,7 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) goto err_drop; } } else - ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + (void) ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); } else { neigh->ah = NULL; @@ -685,7 +685,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); + (void) ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ @@ -704,6 +704,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh; unsigned long flags; + int orphan = 0; if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags))) return NETDEV_TX_LOCKED; @@ -743,7 +744,9 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) goto out; } } else if (neigh->ah) { - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + int ret; + ret = ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + orphan = !ret; goto out; } @@ -788,6 +791,8 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) out: spin_unlock_irqrestore(&priv->tx_lock, flags); + if (orphan) + skb_orphan(skb); return NETDEV_TX_OK; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ac33c8f..d491801 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -723,7 +723,7 @@ out: } } - ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + (void) ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } unlock: From vlad at mellanox.co.il Tue Sep 9 08:03:42 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 09 Sep 2008 18:03:42 +0300 Subject: [ofa-general] Compiled IB packages In-Reply-To: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> Message-ID: <48C6904E.1020606@mellanox.co.il> Christopher Tanner wrote: > I am setting up a 16-node (homogeneous) cluster running Ubuntu 8.04 > server with Mellanox Infiniband cards. I downloaded (from the > OpenFabrics website), compiled, and installed the following IB packages > on the master node into the /usr/local/lib directory. The /usr/local > directory is being shared to all of the nodes via NFS. All packages > seemed to compile and install fine. > > libibverbs > librdmacm > libibcm > libipathverbs > dapl > compat-dapl > libmlx4 > libmthca > libcxgb3 > libibcommon > libibumad > libibmad > opensm > infiniband-diags > > I have a few questions: > a) Do I need to run 'make install' on each node or just the master node? > All of the libraries in /usr/local/lib are visible to all nodes... > Stated another way, does 'make install' put files elsewhere beside the > /usr/local/lib directory? Does it alter OS configuration files to tell > it to look for certain files in /usr/local/lib? > No, all the packages above will put their files under /usr/local > b) I know I need to load the IB kernel modules (mlx4_core, mlx4_ib, > rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) in order for > the IB cards to work. Are these compiled and installed with the above > packages? Where does the kernel know where to look for modules? (Sorry, > this question is very similar to the first one). > The packages above are user space libraries/binaries. To install kernel modules you should download the latest version of the ofa_1_4_kernel tgz file from: http://www.openfabrics.org/downloads/ofa_1_4_kernel/ To install, run: ./configure --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod --with-mthca_debug-mod --with-mlx4-mod --with-mlx4_en-mod --with-mlx4_debug-mod --with-cxgb3-mod --with-ehca-mod --with-ipoib-mod --with-ipoib_debug-mod (... , see --help) make make install > c) The OFED software stack contains some stuff that isn't available for > source download (e.g. ib-bonding, ibsim, libsdp). Are these necessary > for the IB network to operate correctly? Since I'm running Ubuntu, > obviously the src.rpm file won't work... > All OFED tgz files that are available under: http://www.openfabrics.org/~vlad/ofed_1_4/SOURCES/ ib-bonding source RPM can be downloaded from (you can open it to get tgz file using cpio, if you need): http://www.openfabrics.org/~monis/ofed_1_4/ This packages are not necessary for the IB network to operate correctly, but it depends on what are you planning to do. Regards, Vladimir > Thanks to all for you help. Previous responses regarding issues with > OpenSM worked great. > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner at gatech.edu > ------------------------------------------- From tziporet at mellanox.co.il Tue Sep 9 08:20:23 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 9 Sep 2008 18:20:23 +0300 Subject: [ofa-general] OFED 1.4-RC1 is available Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD75979A@mtlexch01.mtl.com> Hi, OFED 1.4-RC1 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4 Tziporet & Vladimir ======================================================================== Release information: -------------------- Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27-rc5 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED 1.4-beta =============================== o Kernel code based on 2.6.27-rc5 o Added NFS-RDMA support for SLES10 SP2 and kernel 2.6.26 and 27 o iSER backports added and its now available o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and MVAPICH2 1.1 o New DAPL libraries o 37 bugs fixed (see attached for details) Tasks that should be completed for the RC2: =========================================== 1. NFS-RDMA to work on RHEL 5.1 2. OSM: Cashed routing 3. Cleanup compilation warning 4. Bug fixes -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4-rc1-fixed-bugs.csv Type: application/octet-stream Size: 3628 bytes Desc: ofed-1.4-rc1-fixed-bugs.csv URL: From yossi.openib at gmail.com Tue Sep 9 09:52:17 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Tue, 09 Sep 2008 19:52:17 +0300 Subject: [ofa-general] ***SPAM*** [PATCH] ipoib: send creation parameters when doing send-only join Message-ID: <48C6A9C1.5070108@gmail.com> If creation parameters are not sent a sender will not trigger mcast group creation. Fixes bug #1153 in bugzilla. Signed-off-by: Yossi Etigin -- Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-09-08 23:04:46.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-09-09 19:40:26.000000000 +0300 @@ -327,6 +327,7 @@ static int ipoib_mcast_sendonly_join(str .join_state = 1 #endif }; + ib_sa_comp_mask comp_mask; int ret = 0; if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { @@ -339,16 +340,37 @@ static int ipoib_mcast_sendonly_join(str return -EBUSY; } - rec.mgid = mcast->mcmember.mgid; - rec.port_gid = priv->local_gid; - rec.pkey = cpu_to_be16(priv->pkey); + comp_mask = + IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_PKEY | + IB_SA_MCMEMBER_REC_JOIN_STATE | + IB_SA_MCMEMBER_REC_QKEY | + IB_SA_MCMEMBER_REC_MTU_SELECTOR | + IB_SA_MCMEMBER_REC_MTU | + IB_SA_MCMEMBER_REC_TRAFFIC_CLASS | + IB_SA_MCMEMBER_REC_RATE_SELECTOR | + IB_SA_MCMEMBER_REC_RATE | + IB_SA_MCMEMBER_REC_SL | + IB_SA_MCMEMBER_REC_FLOW_LABEL | + IB_SA_MCMEMBER_REC_HOP_LIMIT; + + rec.mgid = mcast->mcmember.mgid; + rec.port_gid = priv->local_gid; + rec.pkey = cpu_to_be16(priv->pkey); + rec.qkey = priv->broadcast->mcmember.qkey; + rec.mtu_selector = IB_SA_EQ; + rec.mtu = priv->broadcast->mcmember.mtu; + rec.traffic_class = priv->broadcast->mcmember.traffic_class; + rec.rate_selector = IB_SA_EQ; + rec.rate = priv->broadcast->mcmember.rate; + rec.sl = priv->broadcast->mcmember.sl; + rec.flow_label = priv->broadcast->mcmember.flow_label; + rec.hop_limit = priv->broadcast->mcmember.hop_limit; mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, &rec, - IB_SA_MCMEMBER_REC_MGID | - IB_SA_MCMEMBER_REC_PORT_GID | - IB_SA_MCMEMBER_REC_PKEY | - IB_SA_MCMEMBER_REC_JOIN_STATE, + comp_mask, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast); -- From chu11 at llnl.gov Tue Sep 9 10:01:43 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 09 Sep 2008 10:01:43 -0700 Subject: [ofa-general] [OpenSM][Trivial] Fix comment typo Message-ID: <1220979703.27074.56.camel@cardanus.llnl.gov> Hey Sasha, Noticed it while looking at some other code in the header file. Al -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-comment-typo.patch Type: text/x-patch Size: 759 bytes Desc: not available URL: From hal.rosenstock at gmail.com Tue Sep 9 11:35:48 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 9 Sep 2008 14:35:48 -0400 Subject: ***SPAM*** Re: [ofa-general] OpenSM Problems/Questions In-Reply-To: References: Message-ID: Hi, On Tue, Sep 9, 2008 at 9:01 AM, Matthew Trzyna wrote: > Hello > > > A "Basic Fabric Diagram" at the end. > > > I am working with a customer implementing a large IB fabric and is > encountering problems with OpenSM (OFED 1.3) when they added a new 264 node > cluster (with its own 288 port IB switch) to their existing cluster. Two > more 264 clusters are planned to be added in the near future. They recently > moved to SLES 10 SP1 and OFED 1.3 (before adding the new cluster) and had > not been experiencing these problems before. > > Could you help provide answers to the questions listed below? Additional > information about the configuration including a basic fabric diagram are > provided after the questions. > > What parameters should be set on the non-SM nodes that affect how the Subnet > Administrator functions? > What parameters should be set on the SM node(s) that affect how the Subnet > Administrator functions? And, what parameters should be removed from the SM > node(s)? (ie. ib_sa paths_per_dest=0x7f) > How should SM failover be setup? How many failover SM's should be > configured? This must happen quickly and transparently or GPFS will die > everywhere due to timeouts if this takes too long). What is quickly enough ? > Are there SA (Subnet Administrator) commands that should not be executed on > a large "live" fabric? (ie. "saquery -p") > Should GPFS be configured "off" on the SM node(s)? > Do you know of any other OpenSM implementations that have 5 (or more) 288 > port IB switches that might have already encountered/resolved some of these > issues? There are some deployments with multiple large switches deployed. Not sure what you mean by issues; I see questions above. > The following problem that is being encountered may also be SA/SM related. A > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G). > A ping from those node (NodesA-G) to NodeX returns "Destination Host > Unreachable". A ping from NodeX to NodesA-G works. Sounds like perhaps those nodes were unable to join the broadcast group perhaps due to a rate issue. -- Hal > -------------------------------------------------------------------------------------------------- > > System Information > > Here is the current opensm.conf file: (See attached file: opensm.conf) > > It is the default configuration from the OFED 1.3 build with "priority" > added at the bottom. Note that the /etc/init.d/opensmd sources > /etc/sysconfig/opensm not etc/sysconfig/opensm.conf (opensm.conf was just > copied to opensm). There are a couple of "proposed" settings that are > commented out, that were found them on the web. > > Following are the present settings that may affect the Fabric: > > /etc/infiniband/openib.conf > SET_IPOIB_CM=no > > /etc/modprobe.conf.local > options ib_ipoib send_queue_size=512 recv_queue_size=512 > options ib_sa paths_per_dest=0x7f > > /etc/sysctl.conf > net.ipv4.neigh.ib0.base_reachable_time = 1200 > net.ipv4.neigh.default.gc_thresh3 = 3072 > net.ipv4.neigh.default.gc_thresh2 = 2500 > net.ipv4.neigh.default.gc_thresh1 = 2048 > > /etc/sysconfig/opensm > All defaults as supplied with OFED 1.3 OpenSM > > > ------------------------------------------------------- > > > Basic Fabric Diagram > > +----------+ > |Top Level |-------------------+ 20 IO nodes > +-----------------| 288 port |----------------+ 16 Viual nodes > | | IB Sw |------------+ | 2 Admin nodes > | +------| |---+ | | (SM nodes) > | | +----------+ | | | 4 Support nodes > | | | | | | > | | | | | | > 24 24 24 24 24 24 <--uplinks > | | | | | | > | | | | | +------+ > | | | | | | > |(BASE) |(SCU1) |(SCU2) |(SCU3) |(SCU4) |(SCU5) > +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ > |288-port| |288-port| |288-port| |288-port| |288-port| |288-port| > | IB Sw | | IB Sw | | IB Sw | | IB Sw | | IB Sw | | IB Sw | > +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ > 140-nodes 264-nodes 264-nodes 264-nodes 264-nodes 264-nodes > WhiteBox Dell Dell IBM IBM IBM (future) > > NOTE: SCU4 is not currently connected to the Top Level Switch. > We'd like to address these issues before making that connection. > > Subnet Managers are configured on nodes connected to the > Top Leval Switch. > > Let me know if you need any more information. > > Any help you could provide would be most appreciated. > > Thanks. > > Matt Trzyna > IBM Linux Cluster Enablement > 3039 Cornwallis Rd. > RTP, NC 27709 > e-mail: trzyna at us.ibm.com > Office: (919) 254-9917 Tie Line: 444 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From christopher.tanner at gatech.edu Tue Sep 9 11:53:39 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Tue, 9 Sep 2008 14:53:39 -0400 Subject: [ofa-general] Compiled IB packages In-Reply-To: <48C6904E.1020606@mellanox.co.il> References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> Message-ID: Thanks Vladimir - very helpful. However, I'm running into a problem with compiling the ofa package. First, I had to specify the source location on the command line (Ubuntu puts it in a different place than RedHat or SUSE): $ ./configure --kernel-sources=/usr/src/linux-source-2.6.24 ... (other stuff) I'm getting this error: ERROR: Kernel configuration is invalid. include/linux/autoconf.h or include/config/auto.conf are missing. Run 'make oldconfig && make prepare' on kernel src to fix it. This is confusing b/c both of those files exist. $ locate autoconf.h /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h $ locate auto.conf /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf There's a whole bunch more errors that I assume spawn because of this initial error. The output from 'make' is attached (it's pretty long). Let me know what you think. Thanks! ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- On Sep 9, 2008, at 11:03 AM, Vladimir Sokolovsky wrote: > Christopher Tanner wrote: >> I am setting up a 16-node (homogeneous) cluster running Ubuntu 8.04 >> server with Mellanox Infiniband cards. I downloaded (from the >> OpenFabrics website), compiled, and installed the following IB >> packages on the master node into the /usr/local/lib directory. The / >> usr/local directory is being shared to all of the nodes via NFS. >> All packages seemed to compile and install fine. >> libibverbs >> librdmacm >> libibcm >> libipathverbs >> dapl >> compat-dapl >> libmlx4 >> libmthca >> libcxgb3 >> libibcommon >> libibumad >> libibmad >> opensm >> infiniband-diags >> I have a few questions: >> a) Do I need to run 'make install' on each node or just the master >> node? All of the libraries in /usr/local/lib are visible to all >> nodes... Stated another way, does 'make install' put files >> elsewhere beside the /usr/local/lib directory? Does it alter OS >> configuration files to tell it to look for certain files in /usr/ >> local/lib? > > No, all the packages above will put their files under /usr/local > >> b) I know I need to load the IB kernel modules (mlx4_core, >> mlx4_ib, rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) >> in order for the IB cards to work. Are these compiled and installed >> with the above packages? Where does the kernel know where to look >> for modules? (Sorry, this question is very similar to the first one). > > The packages above are user space libraries/binaries. To install > kernel > modules you should download the latest version of the ofa_1_4_kernel > tgz file from: > > http://www.openfabrics.org/downloads/ofa_1_4_kernel/ > To install, run: > ./configure --with-core-mod --with-user_mad-mod --with-user_access- > mod --with-addr_trans-mod --with-mthca-mod --with-mthca_debug-mod -- > with-mlx4-mod --with-mlx4_en-mod --with-mlx4_debug-mod --with-cxgb3- > mod --with-ehca-mod --with-ipoib-mod --with-ipoib_debug-mod (... , > see --help) > make > make install > > >> c) The OFED software stack contains some stuff that isn't available >> for source download (e.g. ib-bonding, ibsim, libsdp). Are these >> necessary for the IB network to operate correctly? Since I'm >> running Ubuntu, obviously the src.rpm file won't work... > > All OFED tgz files that are available under: > http://www.openfabrics.org/~vlad/ofed_1_4/SOURCES/ > > ib-bonding source RPM can be downloaded from (you can open it to get > tgz file using cpio, if you need): > http://www.openfabrics.org/~monis/ofed_1_4/ > > This packages are not necessary for the IB network to operate > correctly, but > it depends on what are you planning to do. > > Regards, > Vladimir > >> Thanks to all for you help. Previous responses regarding issues >> with OpenSM worked great. >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner at gatech.edu >> ------------------------------------------- From weiny2 at llnl.gov Tue Sep 9 12:11:40 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 9 Sep 2008 12:11:40 -0700 Subject: ***SPAM*** Re: [ofa-general] OpenSM Problems/Questions In-Reply-To: References: Message-ID: <20080909121140.1ec7838b.weiny2@llnl.gov> On Tue, 9 Sep 2008 14:35:48 -0400 "Hal Rosenstock" wrote: > Hi, > > On Tue, Sep 9, 2008 at 9:01 AM, Matthew Trzyna wrote: > > Hello > > > > > > A "Basic Fabric Diagram" at the end. > > > > > > I am working with a customer implementing a large IB fabric and is > > encountering problems with OpenSM (OFED 1.3) when they added a new 264 node > > cluster (with its own 288 port IB switch) to their existing cluster. Two > > more 264 clusters are planned to be added in the near future. They recently > > moved to SLES 10 SP1 and OFED 1.3 (before adding the new cluster) and had > > not been experiencing these problems before. Are there routing issues? > > > > Could you help provide answers to the questions listed below? Additional > > information about the configuration including a basic fabric diagram are > > provided after the questions. > > > > What parameters should be set on the non-SM nodes that affect how the Subnet > > Administrator functions? > > What parameters should be set on the SM node(s) that affect how the Subnet > > Administrator functions? And, what parameters should be removed from the SM > > node(s)? (ie. ib_sa paths_per_dest=0x7f) > > How should SM failover be setup? How many failover SM's should be > > configured? This must happen quickly and transparently or GPFS will die > > everywhere due to timeouts if this takes too long). > > What is quickly enough ? What does GPFS do that requires the SM/SA to be constantly available? Lustre is pretty stable (IB wise) once connected. Our SysAdmins can restart the SM almost at will without issues. As an asside, we do not run with a standby SM. We have not had many instances where OpenSM crashes (probably about 3 times in 3 years). So I think it is important to find out why GPFS needs the SM/SA and then make sure that is available. > > > Are there SA (Subnet Administrator) commands that should not be executed on > > a large "live" fabric? (ie. "saquery -p") > > Should GPFS be configured "off" on the SM node(s)? > > Do you know of any other OpenSM implementations that have 5 (or more) 288 > > port IB switches that might have already encountered/resolved some of these > > issues? > > There are some deployments with multiple large switches deployed. We have 2 clusters which currently have 4x288 port switches in them. Plus many more 24 port "leafs" off of those cores. OpenSM, while not perfect, does work quite well for us. > > Not sure what you mean by issues; I see questions above. I am not sure what the questions are either. Are you having problems with any particular diag or with OpenSM not running (routing?) correctly? > > > The following problem that is being encountered may also be SA/SM related. A > > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G). > > A ping from those node (NodesA-G) to NodeX returns "Destination Host > > Unreachable". A ping from NodeX to NodesA-G works. > > Sounds like perhaps those nodes were unable to join the broadcast > group perhaps due to a rate issue. Hal is correct, and saquery is your friend here. If you use "genders" and "whatsup" (https://computing.llnl.gov/linux/downloads.html) I have a series of tools "Pragmatic InfiniBand Utilities (PIU)" (https://computing.llnl.gov/linux/piu.html) which includes a tool called "ibnodeinmcast" which can help debug this. What it does is use saquery [-g|-m] to find nodes in the multicast groups. With the addition of other LLNL tools this can be boiled down to which nodes "should" be in the group but are not. You are welcome to download that package and adapt it to your environment. Another cause could be that OpenSM is not routing something correctly. That will require some more debuging with dump_lfts.sh and dump_mfts.sh. Ira > > -- Hal > > > -------------------------------------------------------------------------------------------------- > > > > System Information > > > > Here is the current opensm.conf file: (See attached file: opensm.conf) > > > > It is the default configuration from the OFED 1.3 build with "priority" > > added at the bottom. Note that the /etc/init.d/opensmd sources > > /etc/sysconfig/opensm not etc/sysconfig/opensm.conf (opensm.conf was just > > copied to opensm). There are a couple of "proposed" settings that are > > commented out, that were found them on the web. > > > > Following are the present settings that may affect the Fabric: > > > > /etc/infiniband/openib.conf > > SET_IPOIB_CM=no > > > > /etc/modprobe.conf.local > > options ib_ipoib send_queue_size=512 recv_queue_size=512 > > options ib_sa paths_per_dest=0x7f > > > > /etc/sysctl.conf > > net.ipv4.neigh.ib0.base_reachable_time = 1200 > > net.ipv4.neigh.default.gc_thresh3 = 3072 > > net.ipv4.neigh.default.gc_thresh2 = 2500 > > net.ipv4.neigh.default.gc_thresh1 = 2048 > > > > /etc/sysconfig/opensm > > All defaults as supplied with OFED 1.3 OpenSM > > > > > > ------------------------------------------------------- > > > > > > Basic Fabric Diagram > > > > +----------+ > > |Top Level |-------------------+ 20 IO nodes > > +-----------------| 288 port |----------------+ 16 Viual nodes > > | | IB Sw |------------+ | 2 Admin nodes > > | +------| |---+ | | (SM nodes) > > | | +----------+ | | | 4 Support nodes > > | | | | | | > > | | | | | | > > 24 24 24 24 24 24 <--uplinks > > | | | | | | > > | | | | | +------+ > > | | | | | | > > |(BASE) |(SCU1) |(SCU2) |(SCU3) |(SCU4) |(SCU5) > > +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ > > |288-port| |288-port| |288-port| |288-port| |288-port| |288-port| > > | IB Sw | | IB Sw | | IB Sw | | IB Sw | | IB Sw | | IB Sw | > > +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ > > 140-nodes 264-nodes 264-nodes 264-nodes 264-nodes 264-nodes > > WhiteBox Dell Dell IBM IBM IBM (future) > > > > NOTE: SCU4 is not currently connected to the Top Level Switch. > > We'd like to address these issues before making that connection. > > > > Subnet Managers are configured on nodes connected to the > > Top Leval Switch. > > > > Let me know if you need any more information. > > > > Any help you could provide would be most appreciated. > > > > Thanks. > > > > Matt Trzyna > > IBM Linux Cluster Enablement > > 3039 Cornwallis Rd. > > RTP, NC 27709 > > e-mail: trzyna at us.ibm.com > > Office: (919) 254-9917 Tie Line: 444 > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http:// openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > From sweitzen at cisco.com Tue Sep 9 12:52:44 2008 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 9 Sep 2008 12:52:44 -0700 Subject: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD75979A@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD75979A@mtlexch01.mtl.com> Message-ID: I am unable to build MVAPICH2 for multiple compilers: Building the MVAPICH2 RPM [OFA]... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'di st %{nil}' --target x86_64 --define '_name mvapich2_gcc' --define 'impl ofa' --d efine 'rdma --with-rdma=gen2' --define 'ib_include --with-ib-include=/usr/includ e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define 'shared_libs 1' - -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ F77=gfortran F90=gfortran' - -define 'auto_req 0' --define 'mpi_selector /usr/bin/mpi-selector' --define '_pr efix /usr/mpi/gcc/mvapich2-1.2rc2' /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src .rpm Install mvapich2_gcc RPM: Running rpm -iv --nodeps /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv apich2_gcc-1.2rc2-4.x86_64.rpm Build mvapich2_pgi RPM Building the MVAPICH2 RPM [OFA]... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'di st %{nil}' --target x86_64 --define '_name mvapich2_pgi' --define 'impl ofa' --d efine 'rdma --with-rdma=gen2' --define 'ib_include --with-ib-include=/usr/includ e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define 'shared_libs 1' - -define 'romio 1' --define 'comp_env CC=pgcc CXX=pgCC F77=pgf77 F90=pgf90' --def ine 'auto_req 0' --define 'mpi_selector /usr/bin/mpi-selector' --define '_prefix /usr/mpi/pgi/mvapich2-1.2rc2' /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm Install mvapich2_pgi RPM: Running rpm -iv --nodeps /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv apich2_pgi-1.2rc2-4.x86_64.rpm Failed to install mvapich2_pgi RPM See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log Preparing packages for installation... file /etc/mpe_graphics.conf from install of mvapich2_pgi-1.2rc2-4 confli cts with file from package mvapich2_gcc-1.2rc2-4 file /etc/mpe_log.conf from install of mvapich2_pgi-1.2rc2-4 conflicts w ith file from package mvapich2_gcc-1.2rc2-4 file /etc/mpe_mpianim.conf from install of mvapich2_pgi-1.2rc2-4 conflic ts with file from package mvapich2_gcc-1.2rc2-4 file /etc/mpe_mpicheck.conf from install of mvapich2_pgi-1.2rc2-4 confli cts with file from package mvapich2_gcc-1.2rc2-4 file /etc/mpe_mpilog.conf from install of mvapich2_pgi-1.2rc2-4 conflict s with file from package mvapich2_gcc-1.2rc2-4 file /etc/mpe_mpitrace.conf from install of mvapich2_pgi-1.2rc2-4 confli cts with file from package mvapich2_gcc-1.2rc2-4 file /etc/mpe_nolog.conf from install of mvapich2_pgi-1.2rc2-4 conflicts with file from package mvapich2_gcc-1.2rc2-4 file /etc/mpicc.conf from install of mvapich2_pgi-1.2rc2-4 conflicts wit h file from package mvapich2_gcc-1.2rc2-4 file /etc/mpicxx.conf from install of mvapich2_pgi-1.2rc2-4 conflicts wi th file from package mvapich2_gcc-1.2rc2-4 file /etc/mpif77.conf from install of mvapich2_pgi-1.2rc2-4 conflicts wi th file from package mvapich2_gcc-1.2rc2-4 file /etc/mpif90.conf from install of mvapich2_pgi-1.2rc2-4 conflicts wi th file from package mvapich2_gcc-1.2rc2-4 Scott Weitzenkamp SQA and Release Manager Server Access Virtualization Business Unit Cisco Systems > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Tziporet Koren > Sent: Tuesday, September 09, 2008 8:20 AM > To: ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: [ofa-general] OFED 1.4-RC1 is available > > Hi, > OFED 1.4-RC1 release is available on > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > To get BUILD_ID run ofed_info > > Please report any issues in bugzilla https://bugs.openfabrics.org/ for > OFED 1.4 > > Tziporet & Vladimir > > ============================================================== > ========== > > Release information: > -------------------- > Linux Operating Systems: > - RedHat EL4 up4: 2.6.9-42.ELsmp * > - RedHat EL4 up5: 2.6.9-55.ELsmp > - RedHat EL4 up6: 2.6.9-67.ELsmp > - RedHat EL4 up7: 2.6.9-78.ELsmp > - RedHat EL5: 2.6.18-8.el5 > - RedHat EL5 up1: 2.6.18-53.el5 > - RedHat EL5 up2: 2.6.18-92.el5 > - CentOS 5.2: 2.6.18-92.el5 > - Fedora C9: 2.6.25-14.fc9 * > - SLES10: 2.6.16.21-0.8-smp > - SLES10 SP1: 2.6.16.46-0.12-smp > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > - SLES10 SP2: 2.6.16.60-0.21-smp > - OpenSuSE 10.3: 2.6.22.5-31 * > - kernel.org: 2.6.26 and 2.6.27-rc5 > > * Minimal QA for these versions > > Systems: > * x86_64 > * x86 > * ia64 > * ppc64 > > > Main Changes from OFED 1.4-beta > =============================== > o Kernel code based on 2.6.27-rc5 > o Added NFS-RDMA support for SLES10 SP2 and kernel 2.6.26 and 27 > o iSER backports added and its now available > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and MVAPICH2 1.1 > o New DAPL libraries > o 37 bugs fixed (see attached for details) > > > Tasks that should be completed for the RC2: > =========================================== > 1. NFS-RDMA to work on RHEL 5.1 > 2. OSM: Cashed routing > 3. Cleanup compilation warning > 4. Bug fixes > From rdreier at cisco.com Tue Sep 9 12:45:28 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Sep 2008 12:45:28 -0700 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: <48C6A9C1.5070108@gmail.com> (Yossi Etigin's message of "Tue, 09 Sep 2008 19:52:17 +0300") References: <48C6A9C1.5070108@gmail.com> Message-ID: > If creation parameters are not sent a sender will not trigger > mcast group creation. Fixes bug #1153 in bugzilla. If there are no receivers and only senders, why would we want to create a multicast group? - R. From hal.rosenstock at gmail.com Tue Sep 9 13:16:28 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 9 Sep 2008 16:16:28 -0400 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: References: <48C6A9C1.5070108@gmail.com> Message-ID: On Tue, Sep 9, 2008 at 3:45 PM, Roland Dreier wrote: > > If creation parameters are not sent a sender will not trigger > > mcast group creation. Fixes bug #1153 in bugzilla. > > If there are no receivers and only senders, why would we want to create > a multicast group? IBA states for a MC group to be present there must be at least one "full" member (sender and receiver). So it's not only just senders (SendOnlyNonMembers) but also only just receivers (NonMembers) which won't cause group creation. -- Hal > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From perkinjo at cse.ohio-state.edu Tue Sep 9 13:25:21 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Tue, 9 Sep 2008 16:25:21 -0400 Subject: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD75979A@mtlexch01.mtl.com> Message-ID: <20080909202521.GG3716@cse.ohio-state.edu> Thanks for the note. We are taking a look at this. On Tue, Sep 09, 2008 at 12:52:44PM -0700, Scott Weitzenkamp (sweitzen) wrote: > I am unable to build MVAPICH2 for multiple compilers: > > Building the MVAPICH2 RPM [OFA]... > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' > --define 'di > st %{nil}' --target x86_64 --define '_name mvapich2_gcc' --define 'impl > ofa' --d > efine 'rdma --with-rdma=gen2' --define 'ib_include > --with-ib-include=/usr/includ > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > 'shared_libs 1' - > -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ F77=gfortran > F90=gfortran' - > -define 'auto_req 0' --define 'mpi_selector /usr/bin/mpi-selector' > --define '_pr > efix /usr/mpi/gcc/mvapich2-1.2rc2' > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src > .rpm > Install mvapich2_gcc RPM: > Running rpm -iv --nodeps > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > apich2_gcc-1.2rc2-4.x86_64.rpm > Build mvapich2_pgi RPM > Building the MVAPICH2 RPM [OFA]... > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' > --define 'di > st %{nil}' --target x86_64 --define '_name mvapich2_pgi' --define 'impl > ofa' --d > efine 'rdma --with-rdma=gen2' --define 'ib_include > --with-ib-include=/usr/includ > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > 'shared_libs 1' - > -define 'romio 1' --define 'comp_env CC=pgcc CXX=pgCC F77=pgf77 > F90=pgf90' --def > ine 'auto_req 0' --define 'mpi_selector /usr/bin/mpi-selector' --define > '_prefix > /usr/mpi/pgi/mvapich2-1.2rc2' > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm > Install mvapich2_pgi RPM: > Running rpm -iv --nodeps > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > apich2_pgi-1.2rc2-4.x86_64.rpm > Failed to install mvapich2_pgi RPM > See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > Preparing packages for installation... > file /etc/mpe_graphics.conf from install of > mvapich2_pgi-1.2rc2-4 confli > cts with file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpe_log.conf from install of mvapich2_pgi-1.2rc2-4 > conflicts w > ith file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpe_mpianim.conf from install of mvapich2_pgi-1.2rc2-4 > conflic > ts with file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpe_mpicheck.conf from install of > mvapich2_pgi-1.2rc2-4 confli > cts with file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpe_mpilog.conf from install of mvapich2_pgi-1.2rc2-4 > conflict > s with file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpe_mpitrace.conf from install of > mvapich2_pgi-1.2rc2-4 confli > cts with file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpe_nolog.conf from install of mvapich2_pgi-1.2rc2-4 > conflicts > with file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpicc.conf from install of mvapich2_pgi-1.2rc2-4 > conflicts wit > h file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpicxx.conf from install of mvapich2_pgi-1.2rc2-4 > conflicts wi > th file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpif77.conf from install of mvapich2_pgi-1.2rc2-4 > conflicts wi > th file from package mvapich2_gcc-1.2rc2-4 > file /etc/mpif90.conf from install of mvapich2_pgi-1.2rc2-4 > conflicts wi > th file from package mvapich2_gcc-1.2rc2-4 > > Scott Weitzenkamp > SQA and Release Manager > Server Access Virtualization Business Unit > Cisco Systems > > > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > Tziporet Koren > > Sent: Tuesday, September 09, 2008 8:20 AM > > To: ewg at lists.openfabrics.org > > Cc: general at lists.openfabrics.org > > Subject: [ofa-general] OFED 1.4-RC1 is available > > > > Hi, > > OFED 1.4-RC1 release is available on > > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > > > To get BUILD_ID run ofed_info > > > > Please report any issues in bugzilla https://bugs.openfabrics.org/ for > > OFED 1.4 > > > > Tziporet & Vladimir > > > > ============================================================== > > ========== > > > > Release information: > > -------------------- > > Linux Operating Systems: > > - RedHat EL4 up4: 2.6.9-42.ELsmp * > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > - RedHat EL4 up6: 2.6.9-67.ELsmp > > - RedHat EL4 up7: 2.6.9-78.ELsmp > > - RedHat EL5: 2.6.18-8.el5 > > - RedHat EL5 up1: 2.6.18-53.el5 > > - RedHat EL5 up2: 2.6.18-92.el5 > > - CentOS 5.2: 2.6.18-92.el5 > > - Fedora C9: 2.6.25-14.fc9 * > > - SLES10: 2.6.16.21-0.8-smp > > - SLES10 SP1: 2.6.16.46-0.12-smp > > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > > - SLES10 SP2: 2.6.16.60-0.21-smp > > - OpenSuSE 10.3: 2.6.22.5-31 * > > - kernel.org: 2.6.26 and 2.6.27-rc5 > > > > * Minimal QA for these versions > > > > Systems: > > * x86_64 > > * x86 > > * ia64 > > * ppc64 > > > > > > Main Changes from OFED 1.4-beta > > =============================== > > o Kernel code based on 2.6.27-rc5 > > o Added NFS-RDMA support for SLES10 SP2 and kernel 2.6.26 and 27 > > o iSER backports added and its now available > > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and MVAPICH2 1.1 > > o New DAPL libraries > > o 37 bugs fixed (see attached for details) > > > > > > Tasks that should be completed for the RC2: > > =========================================== > > 1. NFS-RDMA to work on RHEL 5.1 > > 2. OSM: Cashed routing > > 3. Cleanup compilation warning > > 4. Bug fixes > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From rdreier at cisco.com Tue Sep 9 13:30:16 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Sep 2008 13:30:16 -0700 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: (Hal Rosenstock's message of "Tue, 9 Sep 2008 16:16:28 -0400") References: <48C6A9C1.5070108@gmail.com> Message-ID: > IBA states for a MC group to be present there must be at least one > "full" member (sender and receiver). So it's not only just senders > (SendOnlyNonMembers) but also only just receivers (NonMembers) which > won't cause group creation. Sure, but the same question still applies. More pedantically, if there are only non-members and send-only non-members of a group, why would we expect it to be created? The patch in question wouldn't even work if we actually used the send-only join status in IPoIB. - R. From chu11 at llnl.gov Tue Sep 9 13:46:44 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 09 Sep 2008 13:46:44 -0700 Subject: [ofa-general] [OpenSM][Trivial] remove old comments Message-ID: <1220993204.27074.61.camel@cardanus.llnl.gov> Hey Sasha, I assume some legacy comment that is no longer relevant (variable does not exist in the source). Al -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-remove-old-comment.patch Type: text/x-patch Size: 924 bytes Desc: not available URL: From chu11 at llnl.gov Tue Sep 9 13:46:44 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 09 Sep 2008 13:46:44 -0700 Subject: [ofa-general] [OpenSM][Trivial] remove old comments Message-ID: <1220993204.27074.61.camel@cardanus.llnl.gov> Hey Sasha, I assume some legacy comment that is no longer relevant (variable does not exist in the source). Al -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-remove-old-comment.patch Type: text/x-patch Size: 924 bytes Desc: not available URL: From rdreier at cisco.com Tue Sep 9 14:32:44 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Sep 2008 14:32:44 -0700 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: <20080909145435.GO2316@sgi.com> (akepner@sgi.com's message of "Tue, 9 Sep 2008 07:54:35 -0700") References: <20080909145435.GO2316@sgi.com> Message-ID: thanks, looks like a good fix. Good debugging too. I'll try to get this into 2.6.27. By the way, looking at this stuff again, it seems we have (a possibly quite unlikely) race where a send can complete before the xmit method finishes, and we end up running skb_orphan on an skb that another context has already freed. I'll have to think about how we can fix that -- but any good ideas are appreciated... - R. From rdreier at cisco.com Tue Sep 9 14:42:00 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 09 Sep 2008 14:42:00 -0700 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: (Roland Dreier's message of "Tue, 09 Sep 2008 14:32:44 -0700") References: <20080909145435.GO2316@sgi.com> Message-ID: Actually I see this is not a regression from 2.6.26 (the bad patch was already in 2.6.26). So I'll queue this for 2.6.28 and hope to come up with a fix for the race in time too. - R From twbowman at gmail.com Tue Sep 9 14:53:56 2008 From: twbowman at gmail.com (Todd Bowman) Date: Tue, 9 Sep 2008 15:53:56 -0600 Subject: [ofa-general] ***SPAM*** opensm failure Message-ID: OpenSM Rev:openib-3.0.13 The opensm segfaulted during an initialization that seems to have been the result of a link state trap (type 1 num12) 09:49:51 914967 [41001960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x011A TID:0x00000000000016cc 09:49:51 948014 [41001960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x011A GID:0xfe80000000000000,0x0008f104003f0ab5 09:49:51 948477 [41802960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x00FD GID:0xfe80000000000000,0x0002c902002064ad 09:49:51 948497 [41802960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x00FD GID:0xfe80000000000000,0x0002c902002064ad 09:49:51 948502 [41802960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c90200207801 LID range [0x89,0x89] of node:n1008 09:49:51 948519 [41802960] -> osm_report_notice: Reporting Generic Notice type:3 num:67 from LID:0x00FD GID:0xfe80000000000000,0x0002c902002064ad 09:49:51 948529 [41802960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x00FD GID:0xfe80000000000000,0x0002c902002064ad ... ... ... 09:49:51 962126 [41802960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c902002064ad LID range [0xFD,0xFD] of node:hn HCA-1 09:49:52 044097 [41802960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's port object, GUID 0x0002c902002064ad 09:49:52 098558 [41001960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state OSM_SM_STATE_SET_SUBNET_UCAST_LIDS_WAIT 09:49:52 098917 [41001960] -> __osm_state_mgr_check_tbl_consistency: ERR 3322: lid 0x6E is wrongly assigned to port 0x0008f104003f2cdb in port_lid_tbl 09:49:52 098936 [41001960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x00FD GID:0xfe80000000000000,0x0002c902002064ad 09:49:52 098944 [41001960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0008f104003f2cdb LID range [0x0,0x0] of node:ISR9288/ISR9096 Voltaire sLB-24 09:49:52 098957 [41001960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches 09:49:52 098992 [41001960] -> __osm_ucast_mgr_process_port: ERR 3A04: Port 0x8f104003f2cdb has LID 0. An initialization error occurred. Ignoring port 09:49:52 103405 [41802960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state OSM_SM_STATE_SET_LINK_PORTS_WAIT 09:49:52 103626 [41001960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state OSM_SM_STATE_SET_LINK_PORTS_WAIT 09:49:52 103856 [41001960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state OSM_SM_STATE_SET_LINK_PORTS_WAIT 09:49:52 104077 [41802960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state OSM_SM_STATE_SET_LINK_PORTS_WAIT ... ... ... 1) Why does the link down trap, start the long chain of __osm_drop_mgr_remove_port? 2) Which of the errors may have caused the the segfault? Thanks, Todd -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramachandra.kuchimanchi at qlogic.com Tue Sep 9 23:56:45 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Wed, 10 Sep 2008 12:26:45 +0530 Subject: ***SPAM*** Re: [ofa-general] Compiled IB packages In-Reply-To: References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> Message-ID: <71d336490809092356y93ba6bcx304f119c496f0fcf@mail.gmail.com> On Wed, Sep 10, 2008 at 12:23 AM, Christopher Tanner wrote: > $ locate autoconf.h > /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h Though it may not be related to this error, one issue that I see is that this kernel version may not work with OFED-1.4. OFED-1.4 has a list of kernels it supports and I don't think 2.6.24-19 is supported. One option could be to upgrade the kernel to kernel.org 2.6.26. Regards, Ram From vlad at mellanox.co.il Wed Sep 10 00:28:55 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 10 Sep 2008 10:28:55 +0300 Subject: [ofa-general] Compiled IB packages In-Reply-To: References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> Message-ID: <1221031735.6948.12.camel@vlad-laptop> Hi, >From the log file, I see the mismatch between the sources you are passing to configure command and autoconf.h/auto.conf below: /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf >From the log file: Kernel version: 2.6.24-16-server Modules directory: //lib/modules/2.6.24-16-server/updates Kernel sources: /usr/src/linux-source-2.6.24 Check that you have corresponding (matching the running kernel) linux-headers package installed and then you don't have to pass --kernel-sources and --kernel parameters to the configure script. E.g. for kernel 2.6.24-19-generic it is linux-headers-2.6.24-19-generic Regards, Vladimir On Tue, 2008-09-09 at 14:53 -0400, Christopher Tanner wrote: > Thanks Vladimir - very helpful. However, I'm running into a problem > with compiling the ofa package. First, I had to specify the source > location on the command line (Ubuntu puts it in a different place than > RedHat or SUSE): > > $ ./configure --kernel-sources=/usr/src/linux-source-2.6.24 ... (other > stuff) > > I'm getting this error: > > ERROR: Kernel configuration is invalid. > include/linux/autoconf.h or include/config/auto.conf are > missing. > Run 'make oldconfig && make prepare' on kernel src to fix it. > > This is confusing b/c both of those files exist. > $ locate autoconf.h > /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h > > $ locate auto.conf > /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf > > There's a whole bunch more errors that I assume spawn because of this > initial error. The output from 'make' is attached (it's pretty long). > Let me know what you think. Thanks! > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner at gatech.edu > ------------------------------------------- > > > > On Sep 9, 2008, at 11:03 AM, Vladimir Sokolovsky wrote: > > > Christopher Tanner wrote: > >> I am setting up a 16-node (homogeneous) cluster running Ubuntu 8.04 > >> server with Mellanox Infiniband cards. I downloaded (from the > >> OpenFabrics website), compiled, and installed the following IB > >> packages on the master node into the /usr/local/lib directory. The / > >> usr/local directory is being shared to all of the nodes via NFS. > >> All packages seemed to compile and install fine. > >> libibverbs > >> librdmacm > >> libibcm > >> libipathverbs > >> dapl > >> compat-dapl > >> libmlx4 > >> libmthca > >> libcxgb3 > >> libibcommon > >> libibumad > >> libibmad > >> opensm > >> infiniband-diags > >> I have a few questions: > >> a) Do I need to run 'make install' on each node or just the master > >> node? All of the libraries in /usr/local/lib are visible to all > >> nodes... Stated another way, does 'make install' put files > >> elsewhere beside the /usr/local/lib directory? Does it alter OS > >> configuration files to tell it to look for certain files in /usr/ > >> local/lib? > > > > No, all the packages above will put their files under /usr/local > > > >> b) I know I need to load the IB kernel modules (mlx4_core, > >> mlx4_ib, rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) > >> in order for the IB cards to work. Are these compiled and installed > >> with the above packages? Where does the kernel know where to look > >> for modules? (Sorry, this question is very similar to the first one). > > > > The packages above are user space libraries/binaries. To install > > kernel > > modules you should download the latest version of the ofa_1_4_kernel > > tgz file from: > > > > http://www.openfabrics.org/downloads/ofa_1_4_kernel/ > > To install, run: > > ./configure --with-core-mod --with-user_mad-mod --with-user_access- > > mod --with-addr_trans-mod --with-mthca-mod --with-mthca_debug-mod -- > > with-mlx4-mod --with-mlx4_en-mod --with-mlx4_debug-mod --with-cxgb3- > > mod --with-ehca-mod --with-ipoib-mod --with-ipoib_debug-mod (... , > > see --help) > > make > > make install > > > > > >> c) The OFED software stack contains some stuff that isn't available > >> for source download (e.g. ib-bonding, ibsim, libsdp). Are these > >> necessary for the IB network to operate correctly? Since I'm > >> running Ubuntu, obviously the src.rpm file won't work... > > > > All OFED tgz files that are available under: > > http://www.openfabrics.org/~vlad/ofed_1_4/SOURCES/ > > > > ib-bonding source RPM can be downloaded from (you can open it to get > > tgz file using cpio, if you need): > > http://www.openfabrics.org/~monis/ofed_1_4/ > > > > This packages are not necessary for the IB network to operate > > correctly, but > > it depends on what are you planning to do. > > > > Regards, > > Vladimir > > > >> Thanks to all for you help. Previous responses regarding issues > >> with OpenSM worked great. > >> ------------------------------------------- > >> Chris Tanner > >> Space Systems Design Lab > >> Georgia Institute of Technology > >> christopher.tanner at gatech.edu > >> ------------------------------------------- > From kliteyn at dev.mellanox.co.il Wed Sep 10 01:25:12 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 10 Sep 2008 11:25:12 +0300 Subject: [ofa-general] ***SPAM*** opensm failure In-Reply-To: References: Message-ID: <48C78468.9060101@dev.mellanox.co.il> Hi Todd, Todd Bowman wrote: > OpenSM Rev:openib-3.0.13 Can you upgrade to OFED 1.3.1? We had some bug that was causing opensm to drop the wrong transactions, and the errors in your log could be caused by that. The bug was fixed in OFED 1.3 -- Yevgeny > The opensm segfaulted during an initialization that seems to have been > the result of a link state trap (type 1 num12) > > > 09:49:51 914967 [41001960] -> __osm_trap_rcv_process_ > request: Received Generic Notice type:0x01 num:128 Producer:2 from > LID:0x011A TID:0x00000000000016cc > 09:49:51 948014 [41001960] -> osm_report_notice: Reporting Generic > Notice type:1 num:128 from LID:0x011A > GID:0xfe80000000000000,0x0008f104003f0ab5 > 09:49:51 948477 [41802960] -> osm_report_notice: Reporting Generic > Notice type:3 num:67 from LID:0x00FD > GID:0xfe80000000000000,0x0002c902002064ad > 09:49:51 948497 [41802960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x00FD > GID:0xfe80000000000000,0x0002c902002064ad > 09:49:51 948502 [41802960] -> __osm_drop_mgr_remove_port: Removed port > with GUID:0x0002c90200207801 LID range [0x89,0x89] of node:n1008 > 09:49:51 948519 [41802960] -> osm_report_notice: Reporting Generic > Notice type:3 num:67 from LID:0x00FD > GID:0xfe80000000000000,0x0002c902002064ad > 09:49:51 948529 [41802960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x00FD > GID:0xfe80000000000000,0x0002c902002064ad > ... > ... > ... > > 09:49:51 962126 [41802960] -> __osm_drop_mgr_remove_port: Removed port > with GUID:0x0002c902002064ad LID range [0xFD,0xFD] of node:hn HCA-1 > 09:49:52 044097 [41802960] -> __osm_lid_mgr_process_our_sm_node: ERR > 0308: Can't acquire SM's port object, GUID 0x0002c902002064ad > 09:49:52 098558 [41001960] -> __osm_state_mgr_signal_error: ERR 3303: > Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state > OSM_SM_STATE_SET_SUBNET_UCAST_LIDS_WAIT > 09:49:52 098917 [41001960] -> __osm_state_mgr_check_tbl_consistency: ERR > 3322: lid 0x6E is wrongly assigned to port 0x0008f104003f2cdb in > port_lid_tbl > 09:49:52 098936 [41001960] -> osm_report_notice: Reporting Generic > Notice type:3 num:64 from LID:0x00FD > GID:0xfe80000000000000,0x0002c902002064ad > 09:49:52 098944 [41001960] -> __osm_state_mgr_report_new_ports: > Discovered new port with GUID:0x0008f104003f2cdb LID range [0x0,0x0] of > node:ISR9288/ISR9096 Voltaire sLB-24 > 09:49:52 098957 [41001960] -> osm_ucast_mgr_process: null (min-hop) > tables configured on all switches > 09:49:52 098992 [41001960] -> __osm_ucast_mgr_process_port: ERR 3A04: > Port 0x8f104003f2cdb has LID 0. An initialization error occurred. > Ignoring port > 09:49:52 103405 [41802960] -> __osm_state_mgr_signal_error: ERR 3303: > Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state > OSM_SM_STATE_SET_LINK_PORTS_WAIT > 09:49:52 103626 [41001960] -> __osm_state_mgr_signal_error: ERR 3303: > Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state > OSM_SM_STATE_SET_LINK_PORTS_WAIT > 09:49:52 103856 [41001960] -> __osm_state_mgr_signal_error: ERR 3303: > Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state > OSM_SM_STATE_SET_LINK_PORTS_WAIT > 09:49:52 104077 [41802960] -> __osm_state_mgr_signal_error: ERR 3303: > Invalid signal OSM_SIGNAL_CHANGE_DETECTED(2) in state > OSM_SM_STATE_SET_LINK_PORTS_WAIT > ... > ... > ... > > > 1) Why does the link down trap, start the long chain of > __osm_drop_mgr_remove_port? > > 2) Which of the errors may have caused the the segfault? > > > > Thanks, > Todd > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yossi.openib at gmail.com Wed Sep 10 02:18:07 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Wed, 10 Sep 2008 12:18:07 +0300 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: References: <48C6A9C1.5070108@gmail.com> Message-ID: <48C790CF.4050505@gmail.com> Roland Dreier wrote: > > IBA states for a MC group to be present there must be at least one > > "full" member (sender and receiver). So it's not only just senders > > (SendOnlyNonMembers) but also only just receivers (NonMembers) which > > won't cause group creation. > > Sure, but the same question still applies. More pedantically, if there > are only non-members and send-only non-members of a group, why would we > expect it to be created? ipoib senders are FullMembers so sm will try to create the group event if there are only ipoib senders. > > The patch in question wouldn't even work if we actually used the > send-only join status in IPoIB. > > - R. > But we don't, so it's required. Please see bug #1153 for case description. From vlad at lists.openfabrics.org Wed Sep 10 03:09:31 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 10 Sep 2008 03:09:31 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080910-0200 daily build status Message-ID: <20080910100931.A84D0E60D74@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From Sumit.Gaur at Sun.COM Wed Sep 10 03:32:28 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Wed, 10 Sep 2008 16:02:28 +0530 Subject: [ofa-general] upgrade from 1.2.5* to 1.3.1 In-Reply-To: References: <20080905015440.9C33FE60D8F@openfabrics.org> <48C13182.3020500@Sun.COM> Message-ID: <48C7A23C.8070209@Sun.COM> Hi Hal, I did some more debugging and find that only request with hopcount 1 or more are failing with recv packet status 110. I search for this status in error.h but find no value for it. Any idea ? sumit Hal Rosenstock wrote: > On Fri, Sep 5, 2008 at 9:17 AM, Sumit Gaur - Sun Microsystem > wrote: > >>Hi >>I have upgraded my OFED version from 1.2.5* to 1.3.1, Now application could >>not communicate with OFED libraries using umad_send and umad_recv function >>call for IB_SMI_DIRECT_CLASS (with DR path) requests. Is there any major change in umad lib >>for such requests. Any help or info is appreciated. > > > What kernel is being used ? On what machine architecture are you > running ? Is it perhaps big endian ? I think there was a change that > could affect those machines at a minimum. > > -- Hal > > >>sumit >> >>_______________________________________________ >>general mailing list >>general at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general >> From hal.rosenstock at gmail.com Wed Sep 10 06:07:06 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 10 Sep 2008 09:07:06 -0400 Subject: [ofa-general] upgrade from 1.2.5* to 1.3.1 In-Reply-To: <48C7A23C.8070209@Sun.COM> References: <20080905015440.9C33FE60D8F@openfabrics.org> <48C13182.3020500@Sun.COM> <48C7A23C.8070209@Sun.COM> Message-ID: Hi Sumit, On Wed, Sep 10, 2008 at 6:32 AM, Sumit Gaur - Sun Microsystem wrote: > Hi Hal, > I did some more debugging and find that only request with hopcount 1 or more > are failing with recv packet status 110. I search for this status in error.h > but find no value for it. Any idea ? 110 is ETIMEDOUT -- Hal > sumit > > Hal Rosenstock wrote: >> >> On Fri, Sep 5, 2008 at 9:17 AM, Sumit Gaur - Sun Microsystem >> wrote: >> >>> Hi >>> I have upgraded my OFED version from 1.2.5* to 1.3.1, Now application >>> could >>> not communicate with OFED libraries using umad_send and umad_recv >>> function >>> call for IB_SMI_DIRECT_CLASS (with DR path) requests. Is there any major >>> change in umad lib >>> for such requests. Any help or info is appreciated. >> >> >> What kernel is being used ? On what machine architecture are you >> running ? Is it perhaps big endian ? I think there was a change that >> could affect those machines at a minimum. >> >> -- Hal >> >> >>> sumit >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> > From halr at obsidianresearch.com Wed Sep 10 06:19:20 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 10 Sep 2008 07:19:20 -0600 Subject: [ofa-general] [PATCH][TRIVIAL]osm_(helper trap_rcv).c: Change output format of notice type to unsigned decimal Message-ID: <48C7C958.6080100@obsidianresearch.com> Sasha, Attached is a trivial patch to modify the output format of notice type to unsigned decimal. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-notice-type1 URL: From eli at dev.mellanox.co.il Wed Sep 10 06:51:16 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 10 Sep 2008 16:51:16 +0300 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: References: <20080909145435.GO2316@sgi.com> Message-ID: <20080910135116.GB26881@mtls03> On Tue, Sep 09, 2008 at 02:32:44PM -0700, Roland Dreier wrote: > By the way, looking at this stuff again, it seems we have (a possibly > quite unlikely) race where a send can complete before the xmit method > finishes, and we end up running skb_orphan on an skb that another > context has already freed. I'll have to think about how we can fix > that -- but any good ideas are appreciated... > We can check if there are outstanding WRs after poll_tx is called. If there are no outstanding WRs, it means that the SKB has been freed. If there are outstanding WRs, it means that the last post has not been freed so we can call skb_orphan(). Like the following patch (on top of Arthur's): diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 711a3ac..332526a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -532,7 +532,7 @@ int ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_tx_buf *tx_req; int hlen; void *phead; - int ret = 1; /* assume the worst */ + int sent; if (skb_is_gso(skb)) { hlen = skb_transport_offset(skb) + tcp_hdrlen(skb); @@ -594,7 +594,7 @@ int ipoib_send(struct net_device *dev, struct sk_buff *skb, --priv->tx_outstanding; ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); - ret = 1; + sent = 0; if (netif_queue_stopped(dev)) netif_wake_queue(dev); } else { @@ -602,14 +602,14 @@ int ipoib_send(struct net_device *dev, struct sk_buff *skb, address->last_send = priv->tx_head; ++priv->tx_head; - ret = 0; + sent = 1; } if (unlikely(priv->tx_outstanding > MAX_SEND_CQE)) while (poll_tx(priv)) ; /* nothing */ - return ret; + return !(sent && priv->tx_outstanding); } static void __ipoib_reap_ah(struct net_device *dev) From christopher.tanner at gatech.edu Wed Sep 10 06:59:45 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Wed, 10 Sep 2008 09:59:45 -0400 Subject: [ofa-general] Compiled IB packages In-Reply-To: <1221031735.6948.12.camel@vlad-laptop> References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> <1221031735.6948.12.camel@vlad-laptop> Message-ID: <94325E85-9403-4264-A4FE-90A567A8655B@gatech.edu> Vladimir - Good catch on the linux headers version - I fixed that now. The problem persisted after fixing the headers... but I finally figured out what the issues were. On the configure line: a) the --kernel-sources option needs the path to the linux HEADERS (linux-headers-), not the linux SOURCE (linux-source-). Terminology there is confusing... b) If I didn't specify anything for the --modules-dir option, it defaults to /lib/modules/2.6.24-16-server/updates. I don't know what the 'updates' gets appended onto the end, but that is not correct. So I had to specify --modules-dir=/lib/modules/2.6.24-16-server It compiled and installed just fine! My final question - how do I install the kernel modules on the rest of the nodes? The source was compiled in the /home directory, which is shared to all nodes via NFS. However, the kernel headers are NOT shared to the rest of the nodes. Do you recommend I: a) Install the linux headers on all of the nodes and execute 'make install' on all nodes b) Look at where the modules installed to (from the make install output) and copy the files manually Thanks! ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- On Sep 10, 2008, at 3:28 AM, Vladimir Sokolovsky wrote: > Hi, >> From the log file, I see the mismatch between the sources you are > passing to configure command and autoconf.h/auto.conf below: > > /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h > /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf > >> From the log file: > Kernel version: 2.6.24-16-server > Modules directory: //lib/modules/2.6.24-16-server/updates > Kernel sources: /usr/src/linux-source-2.6.24 > > Check that you have corresponding (matching the running kernel) > linux-headers package installed and then you don't have to pass > --kernel-sources and --kernel parameters to the configure script. > > E.g. > for kernel 2.6.24-19-generic it is linux-headers-2.6.24-19-generic > > Regards, > Vladimir > > On Tue, 2008-09-09 at 14:53 -0400, Christopher Tanner wrote: >> Thanks Vladimir - very helpful. However, I'm running into a problem >> with compiling the ofa package. First, I had to specify the source >> location on the command line (Ubuntu puts it in a different place >> than >> RedHat or SUSE): >> >> $ ./configure --kernel-sources=/usr/src/linux-source-2.6.24 ... >> (other >> stuff) >> >> I'm getting this error: >> >> ERROR: Kernel configuration is invalid. >> include/linux/autoconf.h or include/config/auto.conf are >> missing. >> Run 'make oldconfig && make prepare' on kernel src to fix >> it. >> >> This is confusing b/c both of those files exist. >> $ locate autoconf.h >> /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h >> >> $ locate auto.conf >> /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf >> >> There's a whole bunch more errors that I assume spawn because of this >> initial error. The output from 'make' is attached (it's pretty long). >> Let me know what you think. Thanks! >> >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner at gatech.edu >> ------------------------------------------- >> >> >> >> On Sep 9, 2008, at 11:03 AM, Vladimir Sokolovsky wrote: >> >>> Christopher Tanner wrote: >>>> I am setting up a 16-node (homogeneous) cluster running Ubuntu 8.04 >>>> server with Mellanox Infiniband cards. I downloaded (from the >>>> OpenFabrics website), compiled, and installed the following IB >>>> packages on the master node into the /usr/local/lib directory. >>>> The / >>>> usr/local directory is being shared to all of the nodes via NFS. >>>> All packages seemed to compile and install fine. >>>> libibverbs >>>> librdmacm >>>> libibcm >>>> libipathverbs >>>> dapl >>>> compat-dapl >>>> libmlx4 >>>> libmthca >>>> libcxgb3 >>>> libibcommon >>>> libibumad >>>> libibmad >>>> opensm >>>> infiniband-diags >>>> I have a few questions: >>>> a) Do I need to run 'make install' on each node or just the master >>>> node? All of the libraries in /usr/local/lib are visible to all >>>> nodes... Stated another way, does 'make install' put files >>>> elsewhere beside the /usr/local/lib directory? Does it alter OS >>>> configuration files to tell it to look for certain files in /usr/ >>>> local/lib? >>> >>> No, all the packages above will put their files under /usr/local >>> >>>> b) I know I need to load the IB kernel modules (mlx4_core, >>>> mlx4_ib, rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) >>>> in order for the IB cards to work. Are these compiled and installed >>>> with the above packages? Where does the kernel know where to look >>>> for modules? (Sorry, this question is very similar to the first >>>> one). >>> >>> The packages above are user space libraries/binaries. To install >>> kernel >>> modules you should download the latest version of the ofa_1_4_kernel >>> tgz file from: >>> >>> http://www.openfabrics.org/downloads/ofa_1_4_kernel/ >>> To install, run: >>> ./configure --with-core-mod --with-user_mad-mod --with-user_access- >>> mod --with-addr_trans-mod --with-mthca-mod --with-mthca_debug-mod -- >>> with-mlx4-mod --with-mlx4_en-mod --with-mlx4_debug-mod --with-cxgb3- >>> mod --with-ehca-mod --with-ipoib-mod --with-ipoib_debug-mod (... , >>> see --help) >>> make >>> make install >>> >>> >>>> c) The OFED software stack contains some stuff that isn't available >>>> for source download (e.g. ib-bonding, ibsim, libsdp). Are these >>>> necessary for the IB network to operate correctly? Since I'm >>>> running Ubuntu, obviously the src.rpm file won't work... >>> >>> All OFED tgz files that are available under: >>> http://www.openfabrics.org/~vlad/ofed_1_4/SOURCES/ >>> >>> ib-bonding source RPM can be downloaded from (you can open it to get >>> tgz file using cpio, if you need): >>> http://www.openfabrics.org/~monis/ofed_1_4/ >>> >>> This packages are not necessary for the IB network to operate >>> correctly, but >>> it depends on what are you planning to do. >>> >>> Regards, >>> Vladimir >>> >>>> Thanks to all for you help. Previous responses regarding issues >>>> with OpenSM worked great. >>>> ------------------------------------------- >>>> Chris Tanner >>>> Space Systems Design Lab >>>> Georgia Institute of Technology >>>> christopher.tanner at gatech.edu >>>> ------------------------------------------- >> From vlad at mellanox.co.il Wed Sep 10 07:14:18 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 10 Sep 2008 17:14:18 +0300 Subject: [ofa-general] Compiled IB packages In-Reply-To: <94325E85-9403-4264-A4FE-90A567A8655B@gatech.edu> References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> <1221031735.6948.12.camel@vlad-laptop> <94325E85-9403-4264-A4FE-90A567A8655B@gatech.edu> Message-ID: <48C7D63A.8090005@mellanox.co.il> Christopher Tanner wrote: > Vladimir - > > Good catch on the linux headers version - I fixed that now. The problem > persisted after fixing the headers... but I finally figured out what the > issues were. On the configure line: > > a) the --kernel-sources option needs the path to the linux HEADERS > (linux-headers-), not the linux SOURCE (linux-source-). > Terminology there is confusing... > If you compiling for the running kernel then configure will find kernel sources using /lib/modules/`uname -r`/build link. So, you don't have to pass '--kernel-sources' and '--kernel'. > b) If I didn't specify anything for the --modules-dir option, it > defaults to /lib/modules/2.6.24-16-server/updates. I don't know what the > 'updates' gets appended onto the end, but that is not correct. So I had > to specify --modules-dir=/lib/modules/2.6.24-16-server Why you think that updates is wrong? modprobe works with /lib/modules/`uname -r`/updates directory in the following way: if kernel module with the same name is present under /lib/modules/`uname -r`/kernel and under /lib/modules/`uname -r`/updates then the module from updates will be loaded. > > It compiled and installed just fine! > > My final question - how do I install the kernel modules on the rest of > the nodes? The source was compiled in the /home directory, which is > shared to all nodes via NFS. However, the kernel headers are NOT shared > to the rest of the nodes. Do you recommend I: > > a) Install the linux headers on all of the nodes and execute 'make > install' on all nodes > b) Look at where the modules installed to (from the make install output) > and copy the files manually > Both options are good. Note, if you use option b) then you need to run "depmod" after copying kernel modules. Regards, Vladimir > Thanks! > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner at gatech.edu > ------------------------------------------- > > > > On Sep 10, 2008, at 3:28 AM, Vladimir Sokolovsky wrote: > >> Hi, >>> From the log file, I see the mismatch between the sources you are >> passing to configure command and autoconf.h/auto.conf below: >> >> /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h >> /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf >> >>> From the log file: >> Kernel version: 2.6.24-16-server >> Modules directory: //lib/modules/2.6.24-16-server/updates >> Kernel sources: /usr/src/linux-source-2.6.24 >> >> Check that you have corresponding (matching the running kernel) >> linux-headers package installed and then you don't have to pass >> --kernel-sources and --kernel parameters to the configure script. >> >> E.g. >> for kernel 2.6.24-19-generic it is linux-headers-2.6.24-19-generic >> >> Regards, >> Vladimir >> >> On Tue, 2008-09-09 at 14:53 -0400, Christopher Tanner wrote: >>> Thanks Vladimir - very helpful. However, I'm running into a problem >>> with compiling the ofa package. First, I had to specify the source >>> location on the command line (Ubuntu puts it in a different place than >>> RedHat or SUSE): >>> >>> $ ./configure --kernel-sources=/usr/src/linux-source-2.6.24 ... (other >>> stuff) >>> >>> I'm getting this error: >>> >>> ERROR: Kernel configuration is invalid. >>> include/linux/autoconf.h or include/config/auto.conf are >>> missing. >>> Run 'make oldconfig && make prepare' on kernel src to fix it. >>> >>> This is confusing b/c both of those files exist. >>> $ locate autoconf.h >>> /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h >>> >>> $ locate auto.conf >>> /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf >>> >>> There's a whole bunch more errors that I assume spawn because of this >>> initial error. The output from 'make' is attached (it's pretty long). >>> Let me know what you think. Thanks! >>> >>> ------------------------------------------- >>> Chris Tanner >>> Space Systems Design Lab >>> Georgia Institute of Technology >>> christopher.tanner at gatech.edu >>> ------------------------------------------- >>> >>> >>> >>> On Sep 9, 2008, at 11:03 AM, Vladimir Sokolovsky wrote: >>> >>>> Christopher Tanner wrote: >>>>> I am setting up a 16-node (homogeneous) cluster running Ubuntu 8.04 >>>>> server with Mellanox Infiniband cards. I downloaded (from the >>>>> OpenFabrics website), compiled, and installed the following IB >>>>> packages on the master node into the /usr/local/lib directory. The / >>>>> usr/local directory is being shared to all of the nodes via NFS. >>>>> All packages seemed to compile and install fine. >>>>> libibverbs >>>>> librdmacm >>>>> libibcm >>>>> libipathverbs >>>>> dapl >>>>> compat-dapl >>>>> libmlx4 >>>>> libmthca >>>>> libcxgb3 >>>>> libibcommon >>>>> libibumad >>>>> libibmad >>>>> opensm >>>>> infiniband-diags >>>>> I have a few questions: >>>>> a) Do I need to run 'make install' on each node or just the master >>>>> node? All of the libraries in /usr/local/lib are visible to all >>>>> nodes... Stated another way, does 'make install' put files >>>>> elsewhere beside the /usr/local/lib directory? Does it alter OS >>>>> configuration files to tell it to look for certain files in /usr/ >>>>> local/lib? >>>> >>>> No, all the packages above will put their files under /usr/local >>>> >>>>> b) I know I need to load the IB kernel modules (mlx4_core, >>>>> mlx4_ib, rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) >>>>> in order for the IB cards to work. Are these compiled and installed >>>>> with the above packages? Where does the kernel know where to look >>>>> for modules? (Sorry, this question is very similar to the first one). >>>> >>>> The packages above are user space libraries/binaries. To install >>>> kernel >>>> modules you should download the latest version of the ofa_1_4_kernel >>>> tgz file from: >>>> >>>> http://www.openfabrics.org/downloads/ofa_1_4_kernel/ >>>> To install, run: >>>> ./configure --with-core-mod --with-user_mad-mod --with-user_access- >>>> mod --with-addr_trans-mod --with-mthca-mod --with-mthca_debug-mod -- >>>> with-mlx4-mod --with-mlx4_en-mod --with-mlx4_debug-mod --with-cxgb3- >>>> mod --with-ehca-mod --with-ipoib-mod --with-ipoib_debug-mod (... , >>>> see --help) >>>> make >>>> make install >>>> >>>> >>>>> c) The OFED software stack contains some stuff that isn't available >>>>> for source download (e.g. ib-bonding, ibsim, libsdp). Are these >>>>> necessary for the IB network to operate correctly? Since I'm >>>>> running Ubuntu, obviously the src.rpm file won't work... >>>> >>>> All OFED tgz files that are available under: >>>> http://www.openfabrics.org/~vlad/ofed_1_4/SOURCES/ >>>> >>>> ib-bonding source RPM can be downloaded from (you can open it to get >>>> tgz file using cpio, if you need): >>>> http://www.openfabrics.org/~monis/ofed_1_4/ >>>> >>>> This packages are not necessary for the IB network to operate >>>> correctly, but >>>> it depends on what are you planning to do. >>>> >>>> Regards, >>>> Vladimir >>>> >>>>> Thanks to all for you help. Previous responses regarding issues >>>>> with OpenSM worked great. >>>>> ------------------------------------------- >>>>> Chris Tanner >>>>> Space Systems Design Lab >>>>> Georgia Institute of Technology >>>>> christopher.tanner at gatech.edu >>>>> ------------------------------------------- >>> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From yossi.openib at gmail.com Wed Sep 10 07:32:27 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Wed, 10 Sep 2008 17:32:27 +0300 Subject: ***SPAM*** Fwd: [ofa-general] [PATCH] ipoib: fix hang while bringing down uninitialized interface Message-ID: <48C7DA7B.3050706@gmail.com> Roland, Can you comment on this? It fixes a soft lockup during ipoib stop. -------- Original Message -------- Subject: [ofa-general] ***SPAM*** [PATCH] ipoib: fix hang while bringing down uninitialized interface Date: Fri, 05 Sep 2008 18:00:46 +0300 From: Yossi Etigin To: Roland Dreier CC: Olga Shern , general list Fix bug #1172: If a pkey for an interface is not found during initialization, then poll_timer is left uninitialized. When the device is brought down, ipoib tries to del_timer_sync() it. This call hangs in an infinite loop in lock_timer_base(), because timer_base is NULL. We should check whether the timer was really initialized. Signed-off-by: Yossi Etigin -- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 66cafa2..3bbf46d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -850,7 +850,10 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) ipoib_dbg(priv, "All sends and receives done.\n"); timeout: - del_timer_sync(&priv->poll_timer); + /* Make sure the timer is initialized */ + if (priv->poll_timer.function) + del_timer_sync(&priv->poll_timer); + qp_attr.qp_state = IB_QPS_RESET; if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to RESET state\n"); --Yossi _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From alexs at linux.vnet.ibm.com Wed Sep 10 07:23:56 2008 From: alexs at linux.vnet.ibm.com (Alexander Schmidt) Date: Wed, 10 Sep 2008 16:23:56 +0200 Subject: [ofa-general] [PATCH] ib/ehca: add flush CQE generation Message-ID: <20080910162356.7294fe87@BL3D1974.boeblingen.de.ibm.com> When a QP goes into error state, it is required that flush CQEs are delivered to the application for any outstanding work requests. eHCA does not do this in hardware, so this patch adds software flush CQE generation to the ehca driver. Whenever a QP gets into error state, it is added to the QP error list of its respective CQ. If the error QP list of a CQ is not empty, poll_cq() generates flush CQEs before polling the actual CQ. Signed-off-by: Alexander Schmidt --- Applies on top of 2.6.27-rc3, please consider this for 2.6.28. drivers/infiniband/hw/ehca/ehca_classes.h | 14 + drivers/infiniband/hw/ehca/ehca_cq.c | 3 drivers/infiniband/hw/ehca/ehca_iverbs.h | 2 drivers/infiniband/hw/ehca/ehca_qp.c | 225 ++++++++++++++++++++++++++++-- drivers/infiniband/hw/ehca/ehca_reqs.c | 211 ++++++++++++++++++++++++---- 5 files changed, 412 insertions(+), 43 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_classes.h +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_classes.h @@ -164,6 +164,13 @@ struct ehca_qmap_entry { u16 reported; }; +struct ehca_queue_map { + struct ehca_qmap_entry *map; + unsigned int entries; + unsigned int tail; + unsigned int left_to_poll; +}; + struct ehca_qp { union { struct ib_qp ib_qp; @@ -173,8 +180,9 @@ struct ehca_qp { enum ehca_ext_qp_type ext_type; enum ib_qp_state state; struct ipz_queue ipz_squeue; - struct ehca_qmap_entry *sq_map; + struct ehca_queue_map sq_map; struct ipz_queue ipz_rqueue; + struct ehca_queue_map rq_map; struct h_galpas galpas; u32 qkey; u32 real_qp_num; @@ -204,6 +212,8 @@ struct ehca_qp { atomic_t nr_events; /* events seen */ wait_queue_head_t wait_completion; int mig_armed; + struct list_head sq_err_node; + struct list_head rq_err_node; }; #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ) @@ -233,6 +243,8 @@ struct ehca_cq { /* mmap counter for resources mapped into user space */ u32 mm_count_queue; u32 mm_count_galpa; + struct list_head sqp_err_list; + struct list_head rqp_err_list; }; enum ehca_mr_flag { --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -53,9 +53,25 @@ /* in RC traffic, insert an empty RDMA READ every this many packets */ #define ACK_CIRC_THRESHOLD 2000000 +static u64 replace_wr_id(u64 wr_id, u16 idx) +{ + u64 ret; + + ret = wr_id & ~QMAP_IDX_MASK; + ret |= idx & QMAP_IDX_MASK; + + return ret; +} + +static u16 get_app_wr_id(u64 wr_id) +{ + return wr_id & QMAP_IDX_MASK; +} + static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue, struct ehca_wqe *wqe_p, - struct ib_recv_wr *recv_wr) + struct ib_recv_wr *recv_wr, + u32 rq_map_idx) { u8 cnt_ds; if (unlikely((recv_wr->num_sge < 0) || @@ -69,7 +85,7 @@ static inline int ehca_write_rwqe(struct /* clear wqe header until sglist */ memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); - wqe_p->work_request_id = recv_wr->wr_id; + wqe_p->work_request_id = replace_wr_id(recv_wr->wr_id, rq_map_idx); wqe_p->nr_of_data_seg = recv_wr->num_sge; for (cnt_ds = 0; cnt_ds < recv_wr->num_sge; cnt_ds++) { @@ -146,6 +162,7 @@ static inline int ehca_write_swqe(struct u64 dma_length; struct ehca_av *my_av; u32 remote_qkey = send_wr->wr.ud.remote_qkey; + struct ehca_qmap_entry *qmap_entry = &qp->sq_map.map[sq_map_idx]; if (unlikely((send_wr->num_sge < 0) || (send_wr->num_sge > qp->ipz_squeue.act_nr_of_sg))) { @@ -158,11 +175,10 @@ static inline int ehca_write_swqe(struct /* clear wqe header until sglist */ memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); - wqe_p->work_request_id = send_wr->wr_id & ~QMAP_IDX_MASK; - wqe_p->work_request_id |= sq_map_idx & QMAP_IDX_MASK; + wqe_p->work_request_id = replace_wr_id(send_wr->wr_id, sq_map_idx); - qp->sq_map[sq_map_idx].app_wr_id = send_wr->wr_id & QMAP_IDX_MASK; - qp->sq_map[sq_map_idx].reported = 0; + qmap_entry->app_wr_id = get_app_wr_id(send_wr->wr_id); + qmap_entry->reported = 0; switch (send_wr->opcode) { case IB_WR_SEND: @@ -496,7 +512,9 @@ static int internal_post_recv(struct ehc struct ehca_wqe *wqe_p; int wqe_cnt = 0; int ret = 0; + u32 rq_map_idx; unsigned long flags; + struct ehca_qmap_entry *qmap_entry; if (unlikely(!HAS_RQ(my_qp))) { ehca_err(dev, "QP has no RQ ehca_qp=%p qp_num=%x ext_type=%d", @@ -524,8 +542,15 @@ static int internal_post_recv(struct ehc } goto post_recv_exit0; } + /* + * Get the index of the WQE in the recv queue. The same index + * is used for writing into the rq_map. + */ + rq_map_idx = start_offset / my_qp->ipz_rqueue.qe_size; + /* write a RECV WQE into the QUEUE */ - ret = ehca_write_rwqe(&my_qp->ipz_rqueue, wqe_p, cur_recv_wr); + ret = ehca_write_rwqe(&my_qp->ipz_rqueue, wqe_p, cur_recv_wr, + rq_map_idx); /* * if something failed, * reset the free entry pointer to the start value @@ -540,6 +565,11 @@ static int internal_post_recv(struct ehc } goto post_recv_exit0; } + + qmap_entry = &my_qp->rq_map.map[rq_map_idx]; + qmap_entry->app_wr_id = get_app_wr_id(cur_recv_wr->wr_id); + qmap_entry->reported = 0; + wqe_cnt++; } /* eof for cur_recv_wr */ @@ -596,10 +626,12 @@ static const u8 ib_wc_opcode[255] = { /* internal function to poll one entry of cq */ static inline int ehca_poll_cq_one(struct ib_cq *cq, struct ib_wc *wc) { - int ret = 0; + int ret = 0, qmap_tail_idx; struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); struct ehca_cqe *cqe; struct ehca_qp *my_qp; + struct ehca_qmap_entry *qmap_entry; + struct ehca_queue_map *qmap; int cqe_count = 0, is_error; repoll: @@ -674,27 +706,52 @@ repoll: goto repoll; wc->qp = &my_qp->ib_qp; - if (!(cqe->w_completion_flags & WC_SEND_RECEIVE_BIT)) { - struct ehca_qmap_entry *qmap_entry; + if (is_error) { /* - * We got a send completion and need to restore the original - * wr_id. + * set left_to_poll to 0 because in error state, we will not + * get any additional CQEs */ - qmap_entry = &my_qp->sq_map[cqe->work_request_id & - QMAP_IDX_MASK]; + ehca_add_to_err_list(my_qp, 1); + my_qp->sq_map.left_to_poll = 0; - if (qmap_entry->reported) { - ehca_warn(cq->device, "Double cqe on qp_num=%#x", - my_qp->real_qp_num); - /* found a double cqe, discard it and read next one */ - goto repoll; - } - wc->wr_id = cqe->work_request_id & ~QMAP_IDX_MASK; - wc->wr_id |= qmap_entry->app_wr_id; - qmap_entry->reported = 1; - } else + if (HAS_RQ(my_qp)) + ehca_add_to_err_list(my_qp, 0); + my_qp->rq_map.left_to_poll = 0; + } + + qmap_tail_idx = get_app_wr_id(cqe->work_request_id); + if (!(cqe->w_completion_flags & WC_SEND_RECEIVE_BIT)) + /* We got a send completion. */ + qmap = &my_qp->sq_map; + else /* We got a receive completion. */ - wc->wr_id = cqe->work_request_id; + qmap = &my_qp->rq_map; + + qmap_entry = &qmap->map[qmap_tail_idx]; + if (qmap_entry->reported) { + ehca_warn(cq->device, "Double cqe on qp_num=%#x", + my_qp->real_qp_num); + /* found a double cqe, discard it and read next one */ + goto repoll; + } + + wc->wr_id = replace_wr_id(cqe->work_request_id, qmap_entry->app_wr_id); + qmap_entry->reported = 1; + + /* this is a proper completion, we need to advance the tail pointer */ + if (++qmap->tail == qmap->entries) + qmap->tail = 0; + + /* if left_to_poll is decremented to 0, add the QP to the error list */ + if (qmap->left_to_poll > 0) { + qmap->left_to_poll--; + if ((my_qp->sq_map.left_to_poll == 0) && + (my_qp->rq_map.left_to_poll == 0)) { + ehca_add_to_err_list(my_qp, 1); + if (HAS_RQ(my_qp)) + ehca_add_to_err_list(my_qp, 0); + } + } /* eval ib_wc_opcode */ wc->opcode = ib_wc_opcode[cqe->optype]-1; @@ -733,13 +790,88 @@ poll_cq_one_exit0: return ret; } +static int generate_flush_cqes(struct ehca_qp *my_qp, struct ib_cq *cq, + struct ib_wc *wc, int num_entries, + struct ipz_queue *ipz_queue, int on_sq) +{ + int nr = 0; + struct ehca_wqe *wqe; + u64 offset; + struct ehca_queue_map *qmap; + struct ehca_qmap_entry *qmap_entry; + + if (on_sq) + qmap = &my_qp->sq_map; + else + qmap = &my_qp->rq_map; + + qmap_entry = &qmap->map[qmap->tail]; + + while ((nr < num_entries) && (qmap_entry->reported == 0)) { + /* generate flush CQE */ + memset(wc, 0, sizeof(*wc)); + + offset = qmap->tail * ipz_queue->qe_size; + wqe = (struct ehca_wqe *)ipz_qeit_calc(ipz_queue, offset); + if (!wqe) { + ehca_err(cq->device, "Invalid wqe offset=%#lx on " + "qp_num=%#x", offset, my_qp->real_qp_num); + return nr; + } + + wc->wr_id = replace_wr_id(wqe->work_request_id, + qmap_entry->app_wr_id); + + if (on_sq) { + switch (wqe->optype) { + case WQE_OPTYPE_SEND: + wc->opcode = IB_WC_SEND; + break; + case WQE_OPTYPE_RDMAWRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case WQE_OPTYPE_RDMAREAD: + wc->opcode = IB_WC_RDMA_READ; + break; + default: + ehca_err(cq->device, "Invalid optype=%x", + wqe->optype); + return nr; + } + } else + wc->opcode = IB_WC_RECV; + + if (wqe->wr_flag & WQE_WRFLAG_IMM_DATA_PRESENT) { + wc->ex.imm_data = wqe->immediate_data; + wc->wc_flags |= IB_WC_WITH_IMM; + } + + wc->status = IB_WC_WR_FLUSH_ERR; + + wc->qp = &my_qp->ib_qp; + + /* mark as reported and advance tail pointer */ + qmap_entry->reported = 1; + if (++qmap->tail == qmap->entries) + qmap->tail = 0; + qmap_entry = &qmap->map[qmap->tail]; + + wc++; nr++; + } + + return nr; + +} + int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); int nr; + struct ehca_qp *err_qp; struct ib_wc *current_wc = wc; int ret = 0; unsigned long flags; + int entries_left = num_entries; if (num_entries < 1) { ehca_err(cq->device, "Invalid num_entries=%d ehca_cq=%p " @@ -749,15 +881,40 @@ int ehca_poll_cq(struct ib_cq *cq, int n } spin_lock_irqsave(&my_cq->spinlock, flags); - for (nr = 0; nr < num_entries; nr++) { + + /* generate flush cqes for send queues */ + list_for_each_entry(err_qp, &my_cq->sqp_err_list, sq_err_node) { + nr = generate_flush_cqes(err_qp, cq, current_wc, entries_left, + &err_qp->ipz_squeue, 1); + entries_left -= nr; + current_wc += nr; + + if (entries_left == 0) + break; + } + + /* generate flush cqes for receive queues */ + list_for_each_entry(err_qp, &my_cq->rqp_err_list, rq_err_node) { + nr = generate_flush_cqes(err_qp, cq, current_wc, entries_left, + &err_qp->ipz_rqueue, 0); + entries_left -= nr; + current_wc += nr; + + if (entries_left == 0) + break; + } + + for (nr = 0; nr < entries_left; nr++) { ret = ehca_poll_cq_one(cq, current_wc); if (ret) break; current_wc++; } /* eof for nr */ + entries_left -= nr; + spin_unlock_irqrestore(&my_cq->spinlock, flags); if (ret == -EAGAIN || !ret) - ret = nr; + ret = num_entries - entries_left; poll_cq_exit0: return ret; --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_cq.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_cq.c @@ -276,6 +276,9 @@ struct ib_cq *ehca_create_cq(struct ib_d for (i = 0; i < QP_HASHTAB_LEN; i++) INIT_HLIST_HEAD(&my_cq->qp_hashtab[i]); + INIT_LIST_HEAD(&my_cq->sqp_err_list); + INIT_LIST_HEAD(&my_cq->rqp_err_list); + if (context) { struct ipz_queue *ipz_queue = &my_cq->ipz_queue; struct ehca_create_cq_resp resp; --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_qp.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_qp.c @@ -396,6 +396,50 @@ static void ehca_determine_small_queue(s queue->is_small = (queue->page_size != 0); } +/* needs to be called with cq->spinlock held */ +void ehca_add_to_err_list(struct ehca_qp *qp, int on_sq) +{ + struct list_head *list, *node; + + /* TODO: support low latency QPs */ + if (qp->ext_type == EQPT_LLQP) + return; + + if (on_sq) { + list = &qp->send_cq->sqp_err_list; + node = &qp->sq_err_node; + } else { + list = &qp->recv_cq->rqp_err_list; + node = &qp->rq_err_node; + } + + if (list_empty(node)) + list_add_tail(node, list); + + return; +} + +static void del_from_err_list(struct ehca_cq *cq, struct list_head *node) +{ + unsigned long flags; + + spin_lock_irqsave(&cq->spinlock, flags); + + if (!list_empty(node)) + list_del_init(node); + + spin_unlock_irqrestore(&cq->spinlock, flags); +} + +static void reset_queue_map(struct ehca_queue_map *qmap) +{ + int i; + + qmap->tail = 0; + for (i = 0; i < qmap->entries; i++) + qmap->map[i].reported = 1; +} + /* * Create an ib_qp struct that is either a QP or an SRQ, depending on * the value of the is_srq parameter. If init_attr and srq_init_attr share @@ -407,12 +451,11 @@ static struct ehca_qp *internal_create_q struct ib_srq_init_attr *srq_init_attr, struct ib_udata *udata, int is_srq) { - struct ehca_qp *my_qp; + struct ehca_qp *my_qp, *my_srq = NULL; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); struct ehca_shca *shca = container_of(pd->device, struct ehca_shca, ib_device); struct ib_ucontext *context = NULL; - u32 nr_qes; u64 h_ret; int is_llqp = 0, has_srq = 0; int qp_type, max_send_sge, max_recv_sge, ret; @@ -457,8 +500,7 @@ static struct ehca_qp *internal_create_q /* handle SRQ base QPs */ if (init_attr->srq) { - struct ehca_qp *my_srq = - container_of(init_attr->srq, struct ehca_qp, ib_srq); + my_srq = container_of(init_attr->srq, struct ehca_qp, ib_srq); has_srq = 1; parms.ext_type = EQPT_SRQBASE; @@ -716,15 +758,19 @@ static struct ehca_qp *internal_create_q "and pages ret=%i", ret); goto create_qp_exit2; } - nr_qes = my_qp->ipz_squeue.queue_length / + + my_qp->sq_map.entries = my_qp->ipz_squeue.queue_length / my_qp->ipz_squeue.qe_size; - my_qp->sq_map = vmalloc(nr_qes * + my_qp->sq_map.map = vmalloc(my_qp->sq_map.entries * sizeof(struct ehca_qmap_entry)); - if (!my_qp->sq_map) { + if (!my_qp->sq_map.map) { ehca_err(pd->device, "Couldn't allocate squeue " "map ret=%i", ret); goto create_qp_exit3; } + INIT_LIST_HEAD(&my_qp->sq_err_node); + /* to avoid the generation of bogus flush CQEs */ + reset_queue_map(&my_qp->sq_map); } if (HAS_RQ(my_qp)) { @@ -736,6 +782,25 @@ static struct ehca_qp *internal_create_q "and pages ret=%i", ret); goto create_qp_exit4; } + + my_qp->rq_map.entries = my_qp->ipz_rqueue.queue_length / + my_qp->ipz_rqueue.qe_size; + my_qp->rq_map.map = vmalloc(my_qp->rq_map.entries * + sizeof(struct ehca_qmap_entry)); + if (!my_qp->rq_map.map) { + ehca_err(pd->device, "Couldn't allocate squeue " + "map ret=%i", ret); + goto create_qp_exit5; + } + INIT_LIST_HEAD(&my_qp->rq_err_node); + /* to avoid the generation of bogus flush CQEs */ + reset_queue_map(&my_qp->rq_map); + } else if (init_attr->srq) { + /* this is a base QP, use the queue map of the SRQ */ + my_qp->rq_map = my_srq->rq_map; + INIT_LIST_HEAD(&my_qp->rq_err_node); + + my_qp->ipz_rqueue = my_srq->ipz_rqueue; } if (is_srq) { @@ -799,7 +864,7 @@ static struct ehca_qp *internal_create_q if (ret) { ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%i", ret); - goto create_qp_exit6; + goto create_qp_exit7; } } @@ -825,25 +890,29 @@ static struct ehca_qp *internal_create_q if (ib_copy_to_udata(udata, &resp, sizeof resp)) { ehca_err(pd->device, "Copy to udata failed"); ret = -EINVAL; - goto create_qp_exit7; + goto create_qp_exit8; } } return my_qp; -create_qp_exit7: +create_qp_exit8: ehca_cq_unassign_qp(my_qp->send_cq, my_qp->real_qp_num); -create_qp_exit6: +create_qp_exit7: kfree(my_qp->mod_qp_parm); +create_qp_exit6: + if (HAS_RQ(my_qp)) + vfree(my_qp->rq_map.map); + create_qp_exit5: if (HAS_RQ(my_qp)) ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue); create_qp_exit4: if (HAS_SQ(my_qp)) - vfree(my_qp->sq_map); + vfree(my_qp->sq_map.map); create_qp_exit3: if (HAS_SQ(my_qp)) @@ -1035,6 +1104,101 @@ static int prepare_sqe_rts(struct ehca_q return 0; } +static int calc_left_cqes(u64 wqe_p, struct ipz_queue *ipz_queue, + struct ehca_queue_map *qmap) +{ + void *wqe_v; + u64 q_ofs; + u32 wqe_idx; + + /* convert real to abs address */ + wqe_p = wqe_p & (~(1UL << 63)); + + wqe_v = abs_to_virt(wqe_p); + + if (ipz_queue_abs_to_offset(ipz_queue, wqe_p, &q_ofs)) { + ehca_gen_err("Invalid offset for calculating left cqes " + "wqe_p=%#lx wqe_v=%p\n", wqe_p, wqe_v); + return -EFAULT; + } + + wqe_idx = q_ofs / ipz_queue->qe_size; + if (wqe_idx < qmap->tail) + qmap->left_to_poll = (qmap->entries - qmap->tail) + wqe_idx; + else + qmap->left_to_poll = wqe_idx - qmap->tail; + + return 0; +} + +static int check_for_left_cqes(struct ehca_qp *my_qp, struct ehca_shca *shca) +{ + u64 h_ret; + void *send_wqe_p, *recv_wqe_p; + int ret; + unsigned long flags; + int qp_num = my_qp->ib_qp.qp_num; + + /* this hcall is not supported on base QPs */ + if (my_qp->ext_type != EQPT_SRQBASE) { + /* get send and receive wqe pointer */ + h_ret = hipz_h_disable_and_get_wqe(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, &my_qp->pf, + &send_wqe_p, &recv_wqe_p, 4); + if (h_ret != H_SUCCESS) { + ehca_err(&shca->ib_device, "disable_and_get_wqe() " + "failed ehca_qp=%p qp_num=%x h_ret=%li", + my_qp, qp_num, h_ret); + return ehca2ib_return_code(h_ret); + } + + /* + * acquire lock to ensure that nobody is polling the cq which + * could mean that the qmap->tail pointer is in an + * inconsistent state. + */ + spin_lock_irqsave(&my_qp->send_cq->spinlock, flags); + ret = calc_left_cqes((u64)send_wqe_p, &my_qp->ipz_squeue, + &my_qp->sq_map); + spin_unlock_irqrestore(&my_qp->send_cq->spinlock, flags); + if (ret) + return ret; + + + spin_lock_irqsave(&my_qp->recv_cq->spinlock, flags); + ret = calc_left_cqes((u64)recv_wqe_p, &my_qp->ipz_rqueue, + &my_qp->rq_map); + spin_unlock_irqrestore(&my_qp->recv_cq->spinlock, flags); + if (ret) + return ret; + } else { + spin_lock_irqsave(&my_qp->send_cq->spinlock, flags); + my_qp->sq_map.left_to_poll = 0; + spin_unlock_irqrestore(&my_qp->send_cq->spinlock, flags); + + spin_lock_irqsave(&my_qp->recv_cq->spinlock, flags); + my_qp->rq_map.left_to_poll = 0; + spin_unlock_irqrestore(&my_qp->recv_cq->spinlock, flags); + } + + /* this assures flush cqes being generated only for pending wqes */ + if ((my_qp->sq_map.left_to_poll == 0) && + (my_qp->rq_map.left_to_poll == 0)) { + spin_lock_irqsave(&my_qp->send_cq->spinlock, flags); + ehca_add_to_err_list(my_qp, 1); + spin_unlock_irqrestore(&my_qp->send_cq->spinlock, flags); + + if (HAS_RQ(my_qp)) { + spin_lock_irqsave(&my_qp->recv_cq->spinlock, flags); + ehca_add_to_err_list(my_qp, 0); + spin_unlock_irqrestore(&my_qp->recv_cq->spinlock, + flags); + } + } + + return 0; +} + /* * internal_modify_qp with circumvention to handle aqp0 properly * smi_reset2init indicates if this is an internal reset-to-init-call for @@ -1539,10 +1703,27 @@ static int internal_modify_qp(struct ib_ goto modify_qp_exit2; } } + if ((qp_new_state == IB_QPS_ERR) && (qp_cur_state != IB_QPS_ERR)) { + ret = check_for_left_cqes(my_qp, shca); + if (ret) + goto modify_qp_exit2; + } if (statetrans == IB_QPST_ANY2RESET) { ipz_qeit_reset(&my_qp->ipz_rqueue); ipz_qeit_reset(&my_qp->ipz_squeue); + + if (qp_cur_state == IB_QPS_ERR) { + del_from_err_list(my_qp->send_cq, &my_qp->sq_err_node); + + if (HAS_RQ(my_qp)) + del_from_err_list(my_qp->recv_cq, + &my_qp->rq_err_node); + } + reset_queue_map(&my_qp->sq_map); + + if (HAS_RQ(my_qp)) + reset_queue_map(&my_qp->rq_map); } if (attr_mask & IB_QP_QKEY) @@ -1958,6 +2139,16 @@ static int internal_destroy_qp(struct ib idr_remove(&ehca_qp_idr, my_qp->token); write_unlock_irqrestore(&ehca_qp_idr_lock, flags); + /* + * SRQs will never get into an error list and do not have a recv_cq, + * so we need to skip them here. + */ + if (HAS_RQ(my_qp) && !IS_SRQ(my_qp)) + del_from_err_list(my_qp->recv_cq, &my_qp->rq_err_node); + + if (HAS_SQ(my_qp)) + del_from_err_list(my_qp->send_cq, &my_qp->sq_err_node); + /* now wait until all pending events have completed */ wait_event(my_qp->wait_completion, !atomic_read(&my_qp->nr_events)); @@ -1983,7 +2174,7 @@ static int internal_destroy_qp(struct ib if (qp_type == IB_QPT_GSI) { struct ib_event event; ehca_info(dev, "device %s: port %x is inactive.", - shca->ib_device.name, port_num); + shca->ib_device.name, port_num); event.device = &shca->ib_device; event.event = IB_EVENT_PORT_ERR; event.element.port_num = port_num; @@ -1991,11 +2182,15 @@ static int internal_destroy_qp(struct ib ib_dispatch_event(&event); } - if (HAS_RQ(my_qp)) + if (HAS_RQ(my_qp)) { ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue); + + vfree(my_qp->rq_map.map); + } if (HAS_SQ(my_qp)) { ipz_queue_dtor(my_pd, &my_qp->ipz_squeue); - vfree(my_qp->sq_map); + + vfree(my_qp->sq_map.map); } kmem_cache_free(qp_cache, my_qp); atomic_dec(&shca->num_qps); --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -197,6 +197,8 @@ void ehca_poll_eqs(unsigned long data); int ehca_calc_ipd(struct ehca_shca *shca, int port, enum ib_rate path_rate, u32 *ipd); +void ehca_add_to_err_list(struct ehca_qp *qp, int on_sq); + #ifdef CONFIG_PPC_64K_PAGES void *ehca_alloc_fw_ctrlblock(gfp_t flags); void ehca_free_fw_ctrlblock(void *ptr); From sashak at voltaire.com Wed Sep 10 09:14:41 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 10 Sep 2008 19:14:41 +0300 Subject: [ofa-general] Re: [OpenSM][Trivial] Fix comment typo In-Reply-To: <1220979703.27074.56.camel@cardanus.llnl.gov> References: <1220979703.27074.56.camel@cardanus.llnl.gov> Message-ID: <20080910161441.GC11923@sashak.voltaire.com> On 10:01 Tue 09 Sep , Al Chu wrote: > > Noticed it while looking at some other code in the header file. > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Wed Sep 10 09:27:38 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 10 Sep 2008 19:27:38 +0300 Subject: [ofa-general] Re: [OpenSM][Trivial] remove old comments In-Reply-To: <1220993204.27074.61.camel@cardanus.llnl.gov> References: <1220993204.27074.61.camel@cardanus.llnl.gov> Message-ID: <20080910162738.GD11923@sashak.voltaire.com> On 13:46 Tue 09 Sep , Al Chu wrote: > Hey Sasha, > > I assume some legacy comment that is no longer relevant (variable does > not exist in the source). > > Al > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Wed Sep 10 10:02:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 10 Sep 2008 20:02:00 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL]osm_(helper trap_rcv).c: Change output format of notice type to unsigned decimal In-Reply-To: <48C7C958.6080100@obsidianresearch.com> References: <48C7C958.6080100@obsidianresearch.com> Message-ID: <20080910170200.GG11923@sashak.voltaire.com> On 07:19 Wed 10 Sep , Hal Rosenstock wrote: > Sasha, > > Attached is a trivial patch to modify the output format of notice type to > unsigned decimal. > > -- Hal > > opensm/osm_(helper trap_rcv).c: Display type in unsigned decimal rather > than hex for better clarity and to be consistent with format in osm_inform.c > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Sep 10 10:17:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 10 Sep 2008 20:17:16 +0300 Subject: [ofa-general] Re: [PATCH] ibnetdiscover.c: continue processing other ports even if smpquery fails on one port In-Reply-To: <20080905154716.54d82f0e.weiny2@llnl.gov> References: <20080905154716.54d82f0e.weiny2@llnl.gov> Message-ID: <20080910171716.GH11923@sashak.voltaire.com> On 15:47 Fri 05 Sep , Ira Weiny wrote: > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From rdreier at cisco.com Wed Sep 10 11:26:16 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Sep 2008 11:26:16 -0700 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: <20080910135116.GB26881@mtls03> (Eli Cohen's message of "Wed, 10 Sep 2008 16:51:16 +0300") References: <20080909145435.GO2316@sgi.com> <20080910135116.GB26881@mtls03> Message-ID: On Tue, Sep 09, 2008 at 02:32:44PM -0700, Roland Dreier wrote: > By the way, looking at this stuff again, it seems we have (a possibly > quite unlikely) race where a send can complete before the xmit method > finishes, and we end up running skb_orphan on an skb that another > context has already freed. I'll have to think about how we can fix > that -- but any good ideas are appreciated... Actually it looks like Arthur's patch introduces this race. The current code is OK because skb_orphan is called under tx_lock, which is also held when we poll the send CQ. But of course the status quo is no good exactly because of the locking issue Arthur found. > We can check if there are outstanding WRs after poll_tx is called. If > there are no outstanding WRs, it means that the SKB has been freed. If > there are outstanding WRs, it means that the last post has not been > freed so we can call skb_orphan(). Like the following patch (on top of > Arthur's): I don't think this closes the race completely: at the point skb_orphan is called (after Arthur's patch, by design), we have no locks held. And so the timer-driven send completion handling could already have run and freed the skb between when we drop tx_lock and when we call skb_orphan. - R. From rdreier at cisco.com Wed Sep 10 11:29:36 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Sep 2008 11:29:36 -0700 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: <48C790CF.4050505@gmail.com> (Yossi Etigin's message of "Wed, 10 Sep 2008 12:18:07 +0300") References: <48C6A9C1.5070108@gmail.com> <48C790CF.4050505@gmail.com> Message-ID: > But we don't, so it's required. Please see bug #1153 for case description. Yes, I looked at the bug and I don't see the actual problem that is caused by the current code. OK, the group doesn't get created if there are only senders -- so what? It seems a better fix would be just to get rid of the #if 0 and use send-only membership after all these years? - R. From rdreier at cisco.com Wed Sep 10 11:31:01 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 10 Sep 2008 11:31:01 -0700 Subject: Fwd: [ofa-general] [PATCH] ipoib: fix hang while bringing down uninitialized interface In-Reply-To: <48C7DA7B.3050706@gmail.com> (Yossi Etigin's message of "Wed, 10 Sep 2008 17:32:27 +0300") References: <48C7DA7B.3050706@gmail.com> Message-ID: > Subject: [ofa-general] ***SPAM*** [PATCH] ipoib: fix hang while bringing down uninitialized interface Didn't see this the first time around, I guess because some mail server flagged it as spam. Looks like a real issue. Is this a regression from 2.6.26? (ie what introduced this bug?) From akepner at sgi.com Wed Sep 10 13:21:25 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 10 Sep 2008 13:21:25 -0700 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: References: <20080909145435.GO2316@sgi.com> <20080910135116.GB26881@mtls03> Message-ID: <20080910202125.GD31435@sgi.com> On Wed, Sep 10, 2008 at 11:26:16AM -0700, Roland Dreier wrote: > .... > I don't think this closes the race completely: at the point skb_orphan > is called (after Arthur's patch, by design), we have no locks held. And > so the timer-driven send completion handling could already have run and > freed the skb between when we drop tx_lock and when we call skb_orphan. > Suppose we could just remove the skb_orphan() call from ipoib_send() entirely, and wait for net_tx_action() to do it for us. But I imagine there must be a (performance-related) reason why it's done the way it is. -- Arthur From christopher.tanner at gatech.edu Wed Sep 10 14:13:55 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Wed, 10 Sep 2008 17:13:55 -0400 Subject: [ofa-general] Compiled IB packages In-Reply-To: <48C7D63A.8090005@mellanox.co.il> References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> <1221031735.6948.12.camel@vlad-laptop> <94325E85-9403-4264-A4FE-90A567A8655B@gatech.edu> <48C7D63A.8090005@mellanox.co.il> Message-ID: > If you compiling for the running kernel then configure will find > kernel sources using /lib/modules/`uname -r`/build link. > So, you don't have to pass '--kernel-sources' and '--kernel'. Ah, I see now. > Why you think that updates is wrong? > > modprobe works with /lib/modules/`uname -r`/updates directory in the > following way: > if kernel module with the same name is present under /lib/modules/ > `uname -r`/kernel and > under /lib/modules/`uname -r`/updates then the module from updates > will be loaded. I only said this because, on my system, the /lib/modules/2.6.24-16- server/updates directory doesn't exist; thus the make process was having an error. However, the /lib/modules/2.6.24-16-server/kernel does exist, but this directory wasn't searched by the make process (as far as I can tell). > Both options are good. > Note, if you use option b) then you need to run "depmod" after > copying kernel modules. Ah, thanks for the heads up on depmod. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- On Sep 10, 2008, at 10:14 AM, Vladimir Sokolovsky wrote: > Christopher Tanner wrote: >> Vladimir - >> Good catch on the linux headers version - I fixed that now. The >> problem persisted after fixing the headers... but I finally figured >> out what the issues were. On the configure line: >> a) the --kernel-sources option needs the path to the linux HEADERS >> (linux-headers-), not the linux SOURCE (linux-source-). >> Terminology there is confusing... > > If you compiling for the running kernel then configure will find > kernel sources using /lib/modules/`uname -r`/build link. > So, you don't have to pass '--kernel-sources' and '--kernel'. > >> b) If I didn't specify anything for the --modules-dir option, it >> defaults to /lib/modules/2.6.24-16-server/updates. I don't know >> what the 'updates' gets appended onto the end, but that is not >> correct. So I had to specify --modules-dir=/lib/modules/2.6.24-16- >> server > > Why you think that updates is wrong? > > modprobe works with /lib/modules/`uname -r`/updates directory in the > following way: > if kernel module with the same name is present under /lib/modules/ > `uname -r`/kernel and > under /lib/modules/`uname -r`/updates then the module from updates > will be loaded. > >> It compiled and installed just fine! >> My final question - how do I install the kernel modules on the rest >> of the nodes? The source was compiled in the /home directory, which >> is shared to all nodes via NFS. However, the kernel headers are NOT >> shared to the rest of the nodes. Do you recommend I: >> a) Install the linux headers on all of the nodes and execute 'make >> install' on all nodes >> b) Look at where the modules installed to (from the make install >> output) and copy the files manually > > Both options are good. > Note, if you use option b) then you need to run "depmod" after > copying kernel modules. > > Regards, > Vladimir > >> Thanks! >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner at gatech.edu >> ------------------------------------------- >> On Sep 10, 2008, at 3:28 AM, Vladimir Sokolovsky wrote: >>> Hi, >>>> From the log file, I see the mismatch between the sources you are >>> passing to configure command and autoconf.h/auto.conf below: >>> >>> /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h >>> /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf >>> >>>> From the log file: >>> Kernel version: 2.6.24-16-server >>> Modules directory: //lib/modules/2.6.24-16-server/updates >>> Kernel sources: /usr/src/linux-source-2.6.24 >>> >>> Check that you have corresponding (matching the running kernel) >>> linux-headers package installed and then you don't have to pass >>> --kernel-sources and --kernel parameters to the configure script. >>> >>> E.g. >>> for kernel 2.6.24-19-generic it is linux-headers-2.6.24-19-generic >>> >>> Regards, >>> Vladimir >>> >>> On Tue, 2008-09-09 at 14:53 -0400, Christopher Tanner wrote: >>>> Thanks Vladimir - very helpful. However, I'm running into a problem >>>> with compiling the ofa package. First, I had to specify the source >>>> location on the command line (Ubuntu puts it in a different place >>>> than >>>> RedHat or SUSE): >>>> >>>> $ ./configure --kernel-sources=/usr/src/linux-source-2.6.24 ... >>>> (other >>>> stuff) >>>> >>>> I'm getting this error: >>>> >>>> ERROR: Kernel configuration is invalid. >>>> include/linux/autoconf.h or include/config/auto.conf are >>>> missing. >>>> Run 'make oldconfig && make prepare' on kernel src to fix >>>> it. >>>> >>>> This is confusing b/c both of those files exist. >>>> $ locate autoconf.h >>>> /usr/src/linux-headers-2.6.24-19-generic/include/linux/autoconf.h >>>> >>>> $ locate auto.conf >>>> /usr/src/linux-headers-2.6.24-19-generic/include/config/auto.conf >>>> >>>> There's a whole bunch more errors that I assume spawn because of >>>> this >>>> initial error. The output from 'make' is attached (it's pretty >>>> long). >>>> Let me know what you think. Thanks! >>>> >>>> ------------------------------------------- >>>> Chris Tanner >>>> Space Systems Design Lab >>>> Georgia Institute of Technology >>>> christopher.tanner at gatech.edu >>>> ------------------------------------------- >>>> >>>> >>>> >>>> On Sep 9, 2008, at 11:03 AM, Vladimir Sokolovsky wrote: >>>> >>>>> Christopher Tanner wrote: >>>>>> I am setting up a 16-node (homogeneous) cluster running Ubuntu >>>>>> 8.04 >>>>>> server with Mellanox Infiniband cards. I downloaded (from the >>>>>> OpenFabrics website), compiled, and installed the following IB >>>>>> packages on the master node into the /usr/local/lib directory. >>>>>> The / >>>>>> usr/local directory is being shared to all of the nodes via NFS. >>>>>> All packages seemed to compile and install fine. >>>>>> libibverbs >>>>>> librdmacm >>>>>> libibcm >>>>>> libipathverbs >>>>>> dapl >>>>>> compat-dapl >>>>>> libmlx4 >>>>>> libmthca >>>>>> libcxgb3 >>>>>> libibcommon >>>>>> libibumad >>>>>> libibmad >>>>>> opensm >>>>>> infiniband-diags >>>>>> I have a few questions: >>>>>> a) Do I need to run 'make install' on each node or just the >>>>>> master >>>>>> node? All of the libraries in /usr/local/lib are visible to all >>>>>> nodes... Stated another way, does 'make install' put files >>>>>> elsewhere beside the /usr/local/lib directory? Does it alter OS >>>>>> configuration files to tell it to look for certain files in /usr/ >>>>>> local/lib? >>>>> >>>>> No, all the packages above will put their files under /usr/local >>>>> >>>>>> b) I know I need to load the IB kernel modules (mlx4_core, >>>>>> mlx4_ib, rdma_ucm, ib_core, ib_mad, ib_mthca, ib_umad, ib_uverbs) >>>>>> in order for the IB cards to work. Are these compiled and >>>>>> installed >>>>>> with the above packages? Where does the kernel know where to look >>>>>> for modules? (Sorry, this question is very similar to the first >>>>>> one). >>>>> >>>>> The packages above are user space libraries/binaries. To install >>>>> kernel >>>>> modules you should download the latest version of the >>>>> ofa_1_4_kernel >>>>> tgz file from: >>>>> >>>>> http://www.openfabrics.org/downloads/ofa_1_4_kernel/ >>>>> To install, run: >>>>> ./configure --with-core-mod --with-user_mad-mod --with- >>>>> user_access- >>>>> mod --with-addr_trans-mod --with-mthca-mod --with-mthca_debug- >>>>> mod -- >>>>> with-mlx4-mod --with-mlx4_en-mod --with-mlx4_debug-mod --with- >>>>> cxgb3- >>>>> mod --with-ehca-mod --with-ipoib-mod --with-ipoib_debug-mod (... , >>>>> see --help) >>>>> make >>>>> make install >>>>> >>>>> >>>>>> c) The OFED software stack contains some stuff that isn't >>>>>> available >>>>>> for source download (e.g. ib-bonding, ibsim, libsdp). Are these >>>>>> necessary for the IB network to operate correctly? Since I'm >>>>>> running Ubuntu, obviously the src.rpm file won't work... >>>>> >>>>> All OFED tgz files that are available under: >>>>> http://www.openfabrics.org/~vlad/ofed_1_4/SOURCES/ >>>>> >>>>> ib-bonding source RPM can be downloaded from (you can open it to >>>>> get >>>>> tgz file using cpio, if you need): >>>>> http://www.openfabrics.org/~monis/ofed_1_4/ >>>>> >>>>> This packages are not necessary for the IB network to operate >>>>> correctly, but >>>>> it depends on what are you planning to do. >>>>> >>>>> Regards, >>>>> Vladimir >>>>> >>>>>> Thanks to all for you help. Previous responses regarding issues >>>>>> with OpenSM worked great. >>>>>> ------------------------------------------- >>>>>> Chris Tanner >>>>>> Space Systems Design Lab >>>>>> Georgia Institute of Technology >>>>>> christopher.tanner at gatech.edu >>>>>> ------------------------------------------- >>>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sweitzen at cisco.com Wed Sep 10 14:14:44 2008 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 10 Sep 2008 14:14:44 -0700 Subject: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: <20080909202521.GG3716@cse.ohio-state.edu> References: <5D49E7A8952DC44FB38C38FA0D758EAD75979A@mtlexch01.mtl.com> <20080909202521.GG3716@cse.ohio-state.edu> Message-ID: I'm also not getting mpiexec built, at least on the first distro I tried (RHEL4 x86_64): # rpm -qlip mvapich2_gcc-1.2rc2-4.x86_64.rpm | fgrep mpiexec /usr/mpi/gcc/mvapich2-1.2rc2/bin/mpiexec.mpd Scott > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] > Sent: Tuesday, September 09, 2008 1:25 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Tziporet Koren; ewg at lists.openfabrics.org; > general at lists.openfabrics.org > Subject: Re: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > Thanks for the note. We are taking a look at this. > > On Tue, Sep 09, 2008 at 12:52:44PM -0700, Scott Weitzenkamp > (sweitzen) wrote: > > I am unable to build MVAPICH2 for multiple compilers: > > > > Building the MVAPICH2 RPM [OFA]... > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' > > --define 'di > > st %{nil}' --target x86_64 --define '_name mvapich2_gcc' > --define 'impl > > ofa' --d > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > --with-ib-include=/usr/includ > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > 'shared_libs 1' - > > -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ F77=gfortran > > F90=gfortran' - > > -define 'auto_req 0' --define 'mpi_selector /usr/bin/mpi-selector' > > --define '_pr > > efix /usr/mpi/gcc/mvapich2-1.2rc2' > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src > > .rpm > > Install mvapich2_gcc RPM: > > Running rpm -iv --nodeps > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > apich2_gcc-1.2rc2-4.x86_64.rpm > > Build mvapich2_pgi RPM > > Building the MVAPICH2 RPM [OFA]... > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' > > --define 'di > > st %{nil}' --target x86_64 --define '_name mvapich2_pgi' > --define 'impl > > ofa' --d > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > --with-ib-include=/usr/includ > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > 'shared_libs 1' - > > -define 'romio 1' --define 'comp_env CC=pgcc CXX=pgCC F77=pgf77 > > F90=pgf90' --def > > ine 'auto_req 0' --define 'mpi_selector > /usr/bin/mpi-selector' --define > > '_prefix > > /usr/mpi/pgi/mvapich2-1.2rc2' > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm > > Install mvapich2_pgi RPM: > > Running rpm -iv --nodeps > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > apich2_pgi-1.2rc2-4.x86_64.rpm > > Failed to install mvapich2_pgi RPM > > See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > Preparing packages for installation... > > file /etc/mpe_graphics.conf from install of > > mvapich2_pgi-1.2rc2-4 confli > > cts with file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpe_log.conf from install of mvapich2_pgi-1.2rc2-4 > > conflicts w > > ith file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpe_mpianim.conf from install of > mvapich2_pgi-1.2rc2-4 > > conflic > > ts with file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpe_mpicheck.conf from install of > > mvapich2_pgi-1.2rc2-4 confli > > cts with file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpe_mpilog.conf from install of > mvapich2_pgi-1.2rc2-4 > > conflict > > s with file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpe_mpitrace.conf from install of > > mvapich2_pgi-1.2rc2-4 confli > > cts with file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpe_nolog.conf from install of > mvapich2_pgi-1.2rc2-4 > > conflicts > > with file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpicc.conf from install of mvapich2_pgi-1.2rc2-4 > > conflicts wit > > h file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpicxx.conf from install of mvapich2_pgi-1.2rc2-4 > > conflicts wi > > th file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpif77.conf from install of mvapich2_pgi-1.2rc2-4 > > conflicts wi > > th file from package mvapich2_gcc-1.2rc2-4 > > file /etc/mpif90.conf from install of mvapich2_pgi-1.2rc2-4 > > conflicts wi > > th file from package mvapich2_gcc-1.2rc2-4 > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Access Virtualization Business Unit > > Cisco Systems > > > > > > > > > > > -----Original Message----- > > > From: general-bounces at lists.openfabrics.org > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > > Tziporet Koren > > > Sent: Tuesday, September 09, 2008 8:20 AM > > > To: ewg at lists.openfabrics.org > > > Cc: general at lists.openfabrics.org > > > Subject: [ofa-general] OFED 1.4-RC1 is available > > > > > > Hi, > > > OFED 1.4-RC1 release is available on > > > > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > > > > > To get BUILD_ID run ofed_info > > > > > > Please report any issues in bugzilla > https://bugs.openfabrics.org/ for > > > OFED 1.4 > > > > > > Tziporet & Vladimir > > > > > > ============================================================== > > > ========== > > > > > > Release information: > > > -------------------- > > > Linux Operating Systems: > > > - RedHat EL4 up4: 2.6.9-42.ELsmp * > > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > > - RedHat EL4 up6: 2.6.9-67.ELsmp > > > - RedHat EL4 up7: 2.6.9-78.ELsmp > > > - RedHat EL5: 2.6.18-8.el5 > > > - RedHat EL5 up1: 2.6.18-53.el5 > > > - RedHat EL5 up2: 2.6.18-92.el5 > > > - CentOS 5.2: 2.6.18-92.el5 > > > - Fedora C9: 2.6.25-14.fc9 * > > > - SLES10: 2.6.16.21-0.8-smp > > > - SLES10 SP1: 2.6.16.46-0.12-smp > > > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > > > - SLES10 SP2: 2.6.16.60-0.21-smp > > > - OpenSuSE 10.3: 2.6.22.5-31 * > > > - kernel.org: 2.6.26 and 2.6.27-rc5 > > > > > > * Minimal QA for these versions > > > > > > Systems: > > > * x86_64 > > > * x86 > > > * ia64 > > > * ppc64 > > > > > > > > > Main Changes from OFED 1.4-beta > > > =============================== > > > o Kernel code based on 2.6.27-rc5 > > > o Added NFS-RDMA support for SLES10 SP2 and kernel 2.6.26 and 27 > > > o iSER backports added and its now available > > > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and MVAPICH2 1.1 > > > o New DAPL libraries > > > o 37 bugs fixed (see attached for details) > > > > > > > > > Tasks that should be completed for the RC2: > > > =========================================== > > > 1. NFS-RDMA to work on RHEL 5.1 > > > 2. OSM: Cashed routing > > > 3. Cleanup compilation warning > > > 4. Bug fixes > > > > > _______________________________________________ > > ewg mailing list > > ewg at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo > From panda at cse.ohio-state.edu Wed Sep 10 15:14:37 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed, 10 Sep 2008 18:14:37 -0400 (EDT) Subject: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: Message-ID: Hi Scott, Thanks for your note. Starting with MVAPICH2 1.2, a new scalable mpirun_rsh job start-up framework (similar to the one used in MVAPICH) has been introduced. This allows MVAPICH2 to start on multi-thousand core clusters with very little time (like MVAPICH). It also allows job start-up scheme to be uniform across MVAPICH and MVAPICH2. The traditional MPD/mpiexec job start-up option is still there. In the latest MVAPICH2 1.2 SRPM (1.2rc2-4), the default has been set for the new scalable start-up scheme. That's why you are not able to have it built with mpiexec. Since Jonathan is updating the SRPM to take care of the multiple compilers errors (you reported yesterday), we will also include an option to have either of these two job start-up schemes (A. the new scalable mpirun_rsh framework or B. the traditional MPD-based framework) installed. The new SRPM to be uploaded by tomorrow will have all these fixes. Let us know if this will work out for you. Thanks, DK On Wed, 10 Sep 2008, Scott Weitzenkamp (sweitzen) wrote: > I'm also not getting mpiexec built, at least on the first distro I tried > (RHEL4 x86_64): > > # rpm -qlip mvapich2_gcc-1.2rc2-4.x86_64.rpm | fgrep mpiexec > /usr/mpi/gcc/mvapich2-1.2rc2/bin/mpiexec.mpd > > Scott > > > > > -----Original Message----- > > From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] > > Sent: Tuesday, September 09, 2008 1:25 PM > > To: Scott Weitzenkamp (sweitzen) > > Cc: Tziporet Koren; ewg at lists.openfabrics.org; > > general at lists.openfabrics.org > > Subject: Re: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > > Thanks for the note. We are taking a look at this. > > > > On Tue, Sep 09, 2008 at 12:52:44PM -0700, Scott Weitzenkamp > > (sweitzen) wrote: > > > I am unable to build MVAPICH2 for multiple compilers: > > > > > > Building the MVAPICH2 RPM [OFA]... > > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' > > > --define 'di > > > st %{nil}' --target x86_64 --define '_name mvapich2_gcc' > > --define 'impl > > > ofa' --d > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > --with-ib-include=/usr/includ > > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > > 'shared_libs 1' - > > > -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ F77=gfortran > > > F90=gfortran' - > > > -define 'auto_req 0' --define 'mpi_selector /usr/bin/mpi-selector' > > > --define '_pr > > > efix /usr/mpi/gcc/mvapich2-1.2rc2' > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src > > > .rpm > > > Install mvapich2_gcc RPM: > > > Running rpm -iv --nodeps > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > apich2_gcc-1.2rc2-4.x86_64.rpm > > > Build mvapich2_pgi RPM > > > Building the MVAPICH2 RPM [OFA]... > > > Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' > > > --define 'di > > > st %{nil}' --target x86_64 --define '_name mvapich2_pgi' > > --define 'impl > > > ofa' --d > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > --with-ib-include=/usr/includ > > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > > 'shared_libs 1' - > > > -define 'romio 1' --define 'comp_env CC=pgcc CXX=pgCC F77=pgf77 > > > F90=pgf90' --def > > > ine 'auto_req 0' --define 'mpi_selector > > /usr/bin/mpi-selector' --define > > > '_prefix > > > /usr/mpi/pgi/mvapich2-1.2rc2' > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm > > > Install mvapich2_pgi RPM: > > > Running rpm -iv --nodeps > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > apich2_pgi-1.2rc2-4.x86_64.rpm > > > Failed to install mvapich2_pgi RPM > > > See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > > > # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > Preparing packages for installation... > > > file /etc/mpe_graphics.conf from install of > > > mvapich2_pgi-1.2rc2-4 confli > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpe_log.conf from install of mvapich2_pgi-1.2rc2-4 > > > conflicts w > > > ith file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpe_mpianim.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > conflic > > > ts with file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpe_mpicheck.conf from install of > > > mvapich2_pgi-1.2rc2-4 confli > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpe_mpilog.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > conflict > > > s with file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpe_mpitrace.conf from install of > > > mvapich2_pgi-1.2rc2-4 confli > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpe_nolog.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > conflicts > > > with file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpicc.conf from install of mvapich2_pgi-1.2rc2-4 > > > conflicts wit > > > h file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpicxx.conf from install of mvapich2_pgi-1.2rc2-4 > > > conflicts wi > > > th file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpif77.conf from install of mvapich2_pgi-1.2rc2-4 > > > conflicts wi > > > th file from package mvapich2_gcc-1.2rc2-4 > > > file /etc/mpif90.conf from install of mvapich2_pgi-1.2rc2-4 > > > conflicts wi > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > > Scott Weitzenkamp > > > SQA and Release Manager > > > Server Access Virtualization Business Unit > > > Cisco Systems > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: general-bounces at lists.openfabrics.org > > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > > > Tziporet Koren > > > > Sent: Tuesday, September 09, 2008 8:20 AM > > > > To: ewg at lists.openfabrics.org > > > > Cc: general at lists.openfabrics.org > > > > Subject: [ofa-general] OFED 1.4-RC1 is available > > > > > > > > Hi, > > > > OFED 1.4-RC1 release is available on > > > > > > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > > > > > > > To get BUILD_ID run ofed_info > > > > > > > > Please report any issues in bugzilla > > https://bugs.openfabrics.org/ for > > > > OFED 1.4 > > > > > > > > Tziporet & Vladimir > > > > > > > > ============================================================== > > > > ========== > > > > > > > > Release information: > > > > -------------------- > > > > Linux Operating Systems: > > > > - RedHat EL4 up4: 2.6.9-42.ELsmp * > > > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > > > - RedHat EL4 up6: 2.6.9-67.ELsmp > > > > - RedHat EL4 up7: 2.6.9-78.ELsmp > > > > - RedHat EL5: 2.6.18-8.el5 > > > > - RedHat EL5 up1: 2.6.18-53.el5 > > > > - RedHat EL5 up2: 2.6.18-92.el5 > > > > - CentOS 5.2: 2.6.18-92.el5 > > > > - Fedora C9: 2.6.25-14.fc9 * > > > > - SLES10: 2.6.16.21-0.8-smp > > > > - SLES10 SP1: 2.6.16.46-0.12-smp > > > > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > > > > - SLES10 SP2: 2.6.16.60-0.21-smp > > > > - OpenSuSE 10.3: 2.6.22.5-31 * > > > > - kernel.org: 2.6.26 and 2.6.27-rc5 > > > > > > > > * Minimal QA for these versions > > > > > > > > Systems: > > > > * x86_64 > > > > * x86 > > > > * ia64 > > > > * ppc64 > > > > > > > > > > > > Main Changes from OFED 1.4-beta > > > > =============================== > > > > o Kernel code based on 2.6.27-rc5 > > > > o Added NFS-RDMA support for SLES10 SP2 and kernel 2.6.26 and 27 > > > > o iSER backports added and its now available > > > > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and MVAPICH2 1.1 > > > > o New DAPL libraries > > > > o 37 bugs fixed (see attached for details) > > > > > > > > > > > > Tasks that should be completed for the RC2: > > > > =========================================== > > > > 1. NFS-RDMA to work on RHEL 5.1 > > > > 2. OSM: Cashed routing > > > > 3. Cleanup compilation warning > > > > 4. Bug fixes > > > > > > > _______________________________________________ > > > ewg mailing list > > > ewg at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sweitzen at cisco.com Wed Sep 10 15:15:53 2008 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 10 Sep 2008 15:15:53 -0700 Subject: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: References:

Message-ID: So I can have mpirun_rsh *or* mpiexec, but not both? Scott > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu] > Sent: Wednesday, September 10, 2008 3:15 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Jonathan Perkins; ewg at lists.openfabrics.org; > general at lists.openfabrics.org; Dhabaleswar Panda > Subject: RE: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > Hi Scott, > > Thanks for your note. Starting with MVAPICH2 1.2, a new scalable > mpirun_rsh job start-up framework (similar to the one used in > MVAPICH) has > been introduced. This allows MVAPICH2 to start on multi-thousand core > clusters with very little time (like MVAPICH). It also allows > job start-up > scheme to be uniform across MVAPICH and MVAPICH2. The traditional > MPD/mpiexec job start-up option is still there. In the latest > MVAPICH2 1.2 > SRPM (1.2rc2-4), the default has been set for the new > scalable start-up > scheme. That's why you are not able to have it built with > mpiexec. Since > Jonathan is updating the SRPM to take care of the multiple compilers > errors (you reported yesterday), we will also include an > option to have > either of these two job start-up schemes (A. the new scalable > mpirun_rsh > framework or B. the traditional MPD-based framework) > installed. The new > SRPM to be uploaded by tomorrow will have all these fixes. > > Let us know if this will work out for you. > > Thanks, > > DK > > On Wed, 10 Sep 2008, Scott Weitzenkamp (sweitzen) wrote: > > > I'm also not getting mpiexec built, at least on the first > distro I tried > > (RHEL4 x86_64): > > > > # rpm -qlip mvapich2_gcc-1.2rc2-4.x86_64.rpm | fgrep mpiexec > > /usr/mpi/gcc/mvapich2-1.2rc2/bin/mpiexec.mpd > > > > Scott > > > > > > > > > -----Original Message----- > > > From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] > > > Sent: Tuesday, September 09, 2008 1:25 PM > > > To: Scott Weitzenkamp (sweitzen) > > > Cc: Tziporet Koren; ewg at lists.openfabrics.org; > > > general at lists.openfabrics.org > > > Subject: Re: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > > > > Thanks for the note. We are taking a look at this. > > > > > > On Tue, Sep 09, 2008 at 12:52:44PM -0700, Scott Weitzenkamp > > > (sweitzen) wrote: > > > > I am unable to build MVAPICH2 for multiple compilers: > > > > > > > > Building the MVAPICH2 RPM [OFA]... > > > > Running rpmbuild --rebuild --define '_topdir > /var/tmp/OFED_topdir' > > > > --define 'di > > > > st %{nil}' --target x86_64 --define '_name mvapich2_gcc' > > > --define 'impl > > > > ofa' --d > > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > > --with-ib-include=/usr/includ > > > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > > > 'shared_libs 1' - > > > > -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ F77=gfortran > > > > F90=gfortran' - > > > > -define 'auto_req 0' --define 'mpi_selector > /usr/bin/mpi-selector' > > > > --define '_pr > > > > efix /usr/mpi/gcc/mvapich2-1.2rc2' > > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src > > > > .rpm > > > > Install mvapich2_gcc RPM: > > > > Running rpm -iv --nodeps > > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > > apich2_gcc-1.2rc2-4.x86_64.rpm > > > > Build mvapich2_pgi RPM > > > > Building the MVAPICH2 RPM [OFA]... > > > > Running rpmbuild --rebuild --define '_topdir > /var/tmp/OFED_topdir' > > > > --define 'di > > > > st %{nil}' --target x86_64 --define '_name mvapich2_pgi' > > > --define 'impl > > > > ofa' --d > > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > > --with-ib-include=/usr/includ > > > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > > > 'shared_libs 1' - > > > > -define 'romio 1' --define 'comp_env CC=pgcc CXX=pgCC F77=pgf77 > > > > F90=pgf90' --def > > > > ine 'auto_req 0' --define 'mpi_selector > > > /usr/bin/mpi-selector' --define > > > > '_prefix > > > > /usr/mpi/pgi/mvapich2-1.2rc2' > > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm > > > > Install mvapich2_pgi RPM: > > > > Running rpm -iv --nodeps > > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > > apich2_pgi-1.2rc2-4.x86_64.rpm > > > > Failed to install mvapich2_pgi RPM > > > > See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > > > > > # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > Preparing packages for installation... > > > > file /etc/mpe_graphics.conf from install of > > > > mvapich2_pgi-1.2rc2-4 confli > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpe_log.conf from install of > mvapich2_pgi-1.2rc2-4 > > > > conflicts w > > > > ith file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpe_mpianim.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > conflic > > > > ts with file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpe_mpicheck.conf from install of > > > > mvapich2_pgi-1.2rc2-4 confli > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpe_mpilog.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > conflict > > > > s with file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpe_mpitrace.conf from install of > > > > mvapich2_pgi-1.2rc2-4 confli > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpe_nolog.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > conflicts > > > > with file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpicc.conf from install of > mvapich2_pgi-1.2rc2-4 > > > > conflicts wit > > > > h file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpicxx.conf from install of > mvapich2_pgi-1.2rc2-4 > > > > conflicts wi > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpif77.conf from install of > mvapich2_pgi-1.2rc2-4 > > > > conflicts wi > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > file /etc/mpif90.conf from install of > mvapich2_pgi-1.2rc2-4 > > > > conflicts wi > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > > > > Scott Weitzenkamp > > > > SQA and Release Manager > > > > Server Access Virtualization Business Unit > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: general-bounces at lists.openfabrics.org > > > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > > > > Tziporet Koren > > > > > Sent: Tuesday, September 09, 2008 8:20 AM > > > > > To: ewg at lists.openfabrics.org > > > > > Cc: general at lists.openfabrics.org > > > > > Subject: [ofa-general] OFED 1.4-RC1 is available > > > > > > > > > > Hi, > > > > > OFED 1.4-RC1 release is available on > > > > > > > > > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > > > > > > > > > To get BUILD_ID run ofed_info > > > > > > > > > > Please report any issues in bugzilla > > > https://bugs.openfabrics.org/ for > > > > > OFED 1.4 > > > > > > > > > > Tziporet & Vladimir > > > > > > > > > > ============================================================== > > > > > ========== > > > > > > > > > > Release information: > > > > > -------------------- > > > > > Linux Operating Systems: > > > > > - RedHat EL4 up4: 2.6.9-42.ELsmp * > > > > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > > > > - RedHat EL4 up6: 2.6.9-67.ELsmp > > > > > - RedHat EL4 up7: 2.6.9-78.ELsmp > > > > > - RedHat EL5: 2.6.18-8.el5 > > > > > - RedHat EL5 up1: 2.6.18-53.el5 > > > > > - RedHat EL5 up2: 2.6.18-92.el5 > > > > > - CentOS 5.2: 2.6.18-92.el5 > > > > > - Fedora C9: 2.6.25-14.fc9 * > > > > > - SLES10: 2.6.16.21-0.8-smp > > > > > - SLES10 SP1: 2.6.16.46-0.12-smp > > > > > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > > > > > - SLES10 SP2: 2.6.16.60-0.21-smp > > > > > - OpenSuSE 10.3: 2.6.22.5-31 * > > > > > - kernel.org: 2.6.26 and 2.6.27-rc5 > > > > > > > > > > * Minimal QA for these versions > > > > > > > > > > Systems: > > > > > * x86_64 > > > > > * x86 > > > > > * ia64 > > > > > * ppc64 > > > > > > > > > > > > > > > Main Changes from OFED 1.4-beta > > > > > =============================== > > > > > o Kernel code based on 2.6.27-rc5 > > > > > o Added NFS-RDMA support for SLES10 SP2 and kernel > 2.6.26 and 27 > > > > > o iSER backports added and its now available > > > > > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and > MVAPICH2 1.1 > > > > > o New DAPL libraries > > > > > o 37 bugs fixed (see attached for details) > > > > > > > > > > > > > > > Tasks that should be completed for the RC2: > > > > > =========================================== > > > > > 1. NFS-RDMA to work on RHEL 5.1 > > > > > 2. OSM: Cashed routing > > > > > 3. Cleanup compilation warning > > > > > 4. Bug fixes > > > > > > > > > _______________________________________________ > > > > ewg mailing list > > > > ewg at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > -- > > > Jonathan Perkins > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > From panda at cse.ohio-state.edu Wed Sep 10 15:24:46 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed, 10 Sep 2008 18:24:46 -0400 (EDT) Subject: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: Message-ID: > So I can have mpirun_rsh *or* mpiexec, but not both? Our goal is to provide both. However, it will take us some time to make the necessary changes to the SRPM creation process and test it. For tomorrow's SRPM version, we will provide the option for having one of these two. By next week, we will have an SRPM update to have both. Will that work out? Thanks, DK > Scott > > > > > -----Original Message----- > > From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu] > > Sent: Wednesday, September 10, 2008 3:15 PM > > To: Scott Weitzenkamp (sweitzen) > > Cc: Jonathan Perkins; ewg at lists.openfabrics.org; > > general at lists.openfabrics.org; Dhabaleswar Panda > > Subject: RE: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > > Hi Scott, > > > > Thanks for your note. Starting with MVAPICH2 1.2, a new scalable > > mpirun_rsh job start-up framework (similar to the one used in > > MVAPICH) has > > been introduced. This allows MVAPICH2 to start on multi-thousand core > > clusters with very little time (like MVAPICH). It also allows > > job start-up > > scheme to be uniform across MVAPICH and MVAPICH2. The traditional > > MPD/mpiexec job start-up option is still there. In the latest > > MVAPICH2 1.2 > > SRPM (1.2rc2-4), the default has been set for the new > > scalable start-up > > scheme. That's why you are not able to have it built with > > mpiexec. Since > > Jonathan is updating the SRPM to take care of the multiple compilers > > errors (you reported yesterday), we will also include an > > option to have > > either of these two job start-up schemes (A. the new scalable > > mpirun_rsh > > framework or B. the traditional MPD-based framework) > > installed. The new > > SRPM to be uploaded by tomorrow will have all these fixes. > > > > Let us know if this will work out for you. > > > > Thanks, > > > > DK > > > > On Wed, 10 Sep 2008, Scott Weitzenkamp (sweitzen) wrote: > > > > > I'm also not getting mpiexec built, at least on the first > > distro I tried > > > (RHEL4 x86_64): > > > > > > # rpm -qlip mvapich2_gcc-1.2rc2-4.x86_64.rpm | fgrep mpiexec > > > /usr/mpi/gcc/mvapich2-1.2rc2/bin/mpiexec.mpd > > > > > > Scott > > > > > > > > > > > > > -----Original Message----- > > > > From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] > > > > Sent: Tuesday, September 09, 2008 1:25 PM > > > > To: Scott Weitzenkamp (sweitzen) > > > > Cc: Tziporet Koren; ewg at lists.openfabrics.org; > > > > general at lists.openfabrics.org > > > > Subject: Re: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > > > > > > Thanks for the note. We are taking a look at this. > > > > > > > > On Tue, Sep 09, 2008 at 12:52:44PM -0700, Scott Weitzenkamp > > > > (sweitzen) wrote: > > > > > I am unable to build MVAPICH2 for multiple compilers: > > > > > > > > > > Building the MVAPICH2 RPM [OFA]... > > > > > Running rpmbuild --rebuild --define '_topdir > > /var/tmp/OFED_topdir' > > > > > --define 'di > > > > > st %{nil}' --target x86_64 --define '_name mvapich2_gcc' > > > > --define 'impl > > > > > ofa' --d > > > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > > > --with-ib-include=/usr/includ > > > > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > > > > 'shared_libs 1' - > > > > > -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ F77=gfortran > > > > > F90=gfortran' - > > > > > -define 'auto_req 0' --define 'mpi_selector > > /usr/bin/mpi-selector' > > > > > --define '_pr > > > > > efix /usr/mpi/gcc/mvapich2-1.2rc2' > > > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src > > > > > .rpm > > > > > Install mvapich2_gcc RPM: > > > > > Running rpm -iv --nodeps > > > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > > > apich2_gcc-1.2rc2-4.x86_64.rpm > > > > > Build mvapich2_pgi RPM > > > > > Building the MVAPICH2 RPM [OFA]... > > > > > Running rpmbuild --rebuild --define '_topdir > > /var/tmp/OFED_topdir' > > > > > --define 'di > > > > > st %{nil}' --target x86_64 --define '_name mvapich2_pgi' > > > > --define 'impl > > > > > ofa' --d > > > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > > > --with-ib-include=/usr/includ > > > > > e' --define 'ib_libpath --with-ib-libpath=/usr/lib64' --define > > > > > 'shared_libs 1' - > > > > > -define 'romio 1' --define 'comp_env CC=pgcc CXX=pgCC F77=pgf77 > > > > > F90=pgf90' --def > > > > > ine 'auto_req 0' --define 'mpi_selector > > > > /usr/bin/mpi-selector' --define > > > > > '_prefix > > > > > /usr/mpi/pgi/mvapich2-1.2rc2' > > > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm > > > > > Install mvapich2_pgi RPM: > > > > > Running rpm -iv --nodeps > > > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > > > apich2_pgi-1.2rc2-4.x86_64.rpm > > > > > Failed to install mvapich2_pgi RPM > > > > > See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > > > > > > > # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > > Preparing packages for installation... > > > > > file /etc/mpe_graphics.conf from install of > > > > > mvapich2_pgi-1.2rc2-4 confli > > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpe_log.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > > > conflicts w > > > > > ith file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpe_mpianim.conf from install of > > > > mvapich2_pgi-1.2rc2-4 > > > > > conflic > > > > > ts with file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpe_mpicheck.conf from install of > > > > > mvapich2_pgi-1.2rc2-4 confli > > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpe_mpilog.conf from install of > > > > mvapich2_pgi-1.2rc2-4 > > > > > conflict > > > > > s with file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpe_mpitrace.conf from install of > > > > > mvapich2_pgi-1.2rc2-4 confli > > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpe_nolog.conf from install of > > > > mvapich2_pgi-1.2rc2-4 > > > > > conflicts > > > > > with file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpicc.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > > > conflicts wit > > > > > h file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpicxx.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > > > conflicts wi > > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpif77.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > > > conflicts wi > > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > file /etc/mpif90.conf from install of > > mvapich2_pgi-1.2rc2-4 > > > > > conflicts wi > > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > > > > > > Scott Weitzenkamp > > > > > SQA and Release Manager > > > > > Server Access Virtualization Business Unit > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: general-bounces at lists.openfabrics.org > > > > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > > > > > Tziporet Koren > > > > > > Sent: Tuesday, September 09, 2008 8:20 AM > > > > > > To: ewg at lists.openfabrics.org > > > > > > Cc: general at lists.openfabrics.org > > > > > > Subject: [ofa-general] OFED 1.4-RC1 is available > > > > > > > > > > > > Hi, > > > > > > OFED 1.4-RC1 release is available on > > > > > > > > > > > > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > > > > > > > > > > > To get BUILD_ID run ofed_info > > > > > > > > > > > > Please report any issues in bugzilla > > > > https://bugs.openfabrics.org/ for > > > > > > OFED 1.4 > > > > > > > > > > > > Tziporet & Vladimir > > > > > > > > > > > > ============================================================== > > > > > > ========== > > > > > > > > > > > > Release information: > > > > > > -------------------- > > > > > > Linux Operating Systems: > > > > > > - RedHat EL4 up4: 2.6.9-42.ELsmp * > > > > > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > > > > > - RedHat EL4 up6: 2.6.9-67.ELsmp > > > > > > - RedHat EL4 up7: 2.6.9-78.ELsmp > > > > > > - RedHat EL5: 2.6.18-8.el5 > > > > > > - RedHat EL5 up1: 2.6.18-53.el5 > > > > > > - RedHat EL5 up2: 2.6.18-92.el5 > > > > > > - CentOS 5.2: 2.6.18-92.el5 > > > > > > - Fedora C9: 2.6.25-14.fc9 * > > > > > > - SLES10: 2.6.16.21-0.8-smp > > > > > > - SLES10 SP1: 2.6.16.46-0.12-smp > > > > > > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > > > > > > - SLES10 SP2: 2.6.16.60-0.21-smp > > > > > > - OpenSuSE 10.3: 2.6.22.5-31 * > > > > > > - kernel.org: 2.6.26 and 2.6.27-rc5 > > > > > > > > > > > > * Minimal QA for these versions > > > > > > > > > > > > Systems: > > > > > > * x86_64 > > > > > > * x86 > > > > > > * ia64 > > > > > > * ppc64 > > > > > > > > > > > > > > > > > > Main Changes from OFED 1.4-beta > > > > > > =============================== > > > > > > o Kernel code based on 2.6.27-rc5 > > > > > > o Added NFS-RDMA support for SLES10 SP2 and kernel > > 2.6.26 and 27 > > > > > > o iSER backports added and its now available > > > > > > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and > > MVAPICH2 1.1 > > > > > > o New DAPL libraries > > > > > > o 37 bugs fixed (see attached for details) > > > > > > > > > > > > > > > > > > Tasks that should be completed for the RC2: > > > > > > =========================================== > > > > > > 1. NFS-RDMA to work on RHEL 5.1 > > > > > > 2. OSM: Cashed routing > > > > > > 3. Cleanup compilation warning > > > > > > 4. Bug fixes > > > > > > > > > > > _______________________________________________ > > > > > ewg mailing list > > > > > ewg at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > > > -- > > > > Jonathan Perkins > > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > From sweitzen at cisco.com Wed Sep 10 15:27:17 2008 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 10 Sep 2008 15:27:17 -0700 Subject: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available In-Reply-To: References:

Message-ID: Sure, that's fine. I don't really have a strong opinion, just wanted to know what you were up to. Scott > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu] > Sent: Wednesday, September 10, 2008 3:25 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Jonathan Perkins; ewg at lists.openfabrics.org; > general at lists.openfabrics.org; Dhabaleswar Panda > Subject: RE: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > So I can have mpirun_rsh *or* mpiexec, but not both? > > Our goal is to provide both. However, it will take us some > time to make > the necessary changes to the SRPM creation process and test it. For > tomorrow's SRPM version, we will provide the option for having one of > these two. By next week, we will have an SRPM update to have both. > > Will that work out? > > Thanks, > > DK > > > Scott > > > > > > > > > -----Original Message----- > > > From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu] > > > Sent: Wednesday, September 10, 2008 3:15 PM > > > To: Scott Weitzenkamp (sweitzen) > > > Cc: Jonathan Perkins; ewg at lists.openfabrics.org; > > > general at lists.openfabrics.org; Dhabaleswar Panda > > > Subject: RE: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > > > > Hi Scott, > > > > > > Thanks for your note. Starting with MVAPICH2 1.2, a new scalable > > > mpirun_rsh job start-up framework (similar to the one used in > > > MVAPICH) has > > > been introduced. This allows MVAPICH2 to start on > multi-thousand core > > > clusters with very little time (like MVAPICH). It also allows > > > job start-up > > > scheme to be uniform across MVAPICH and MVAPICH2. The traditional > > > MPD/mpiexec job start-up option is still there. In the latest > > > MVAPICH2 1.2 > > > SRPM (1.2rc2-4), the default has been set for the new > > > scalable start-up > > > scheme. That's why you are not able to have it built with > > > mpiexec. Since > > > Jonathan is updating the SRPM to take care of the > multiple compilers > > > errors (you reported yesterday), we will also include an > > > option to have > > > either of these two job start-up schemes (A. the new scalable > > > mpirun_rsh > > > framework or B. the traditional MPD-based framework) > > > installed. The new > > > SRPM to be uploaded by tomorrow will have all these fixes. > > > > > > Let us know if this will work out for you. > > > > > > Thanks, > > > > > > DK > > > > > > On Wed, 10 Sep 2008, Scott Weitzenkamp (sweitzen) wrote: > > > > > > > I'm also not getting mpiexec built, at least on the first > > > distro I tried > > > > (RHEL4 x86_64): > > > > > > > > # rpm -qlip mvapich2_gcc-1.2rc2-4.x86_64.rpm | fgrep mpiexec > > > > /usr/mpi/gcc/mvapich2-1.2rc2/bin/mpiexec.mpd > > > > > > > > Scott > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] > > > > > Sent: Tuesday, September 09, 2008 1:25 PM > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > Cc: Tziporet Koren; ewg at lists.openfabrics.org; > > > > > general at lists.openfabrics.org > > > > > Subject: Re: [ewg] RE: [ofa-general] OFED 1.4-RC1 is available > > > > > > > > > > Thanks for the note. We are taking a look at this. > > > > > > > > > > On Tue, Sep 09, 2008 at 12:52:44PM -0700, Scott Weitzenkamp > > > > > (sweitzen) wrote: > > > > > > I am unable to build MVAPICH2 for multiple compilers: > > > > > > > > > > > > Building the MVAPICH2 RPM [OFA]... > > > > > > Running rpmbuild --rebuild --define '_topdir > > > /var/tmp/OFED_topdir' > > > > > > --define 'di > > > > > > st %{nil}' --target x86_64 --define '_name mvapich2_gcc' > > > > > --define 'impl > > > > > > ofa' --d > > > > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > > > > --with-ib-include=/usr/includ > > > > > > e' --define 'ib_libpath > --with-ib-libpath=/usr/lib64' --define > > > > > > 'shared_libs 1' - > > > > > > -define 'romio 1' --define 'comp_env CC=gcc CXX=g++ > F77=gfortran > > > > > > F90=gfortran' - > > > > > > -define 'auto_req 0' --define 'mpi_selector > > > /usr/bin/mpi-selector' > > > > > > --define '_pr > > > > > > efix /usr/mpi/gcc/mvapich2-1.2rc2' > > > > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src > > > > > > .rpm > > > > > > Install mvapich2_gcc RPM: > > > > > > Running rpm -iv --nodeps > > > > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > > > > apich2_gcc-1.2rc2-4.x86_64.rpm > > > > > > Build mvapich2_pgi RPM > > > > > > Building the MVAPICH2 RPM [OFA]... > > > > > > Running rpmbuild --rebuild --define '_topdir > > > /var/tmp/OFED_topdir' > > > > > > --define 'di > > > > > > st %{nil}' --target x86_64 --define '_name mvapich2_pgi' > > > > > --define 'impl > > > > > > ofa' --d > > > > > > efine 'rdma --with-rdma=gen2' --define 'ib_include > > > > > > --with-ib-include=/usr/includ > > > > > > e' --define 'ib_libpath > --with-ib-libpath=/usr/lib64' --define > > > > > > 'shared_libs 1' - > > > > > > -define 'romio 1' --define 'comp_env CC=pgcc > CXX=pgCC F77=pgf77 > > > > > > F90=pgf90' --def > > > > > > ine 'auto_req 0' --define 'mpi_selector > > > > > /usr/bin/mpi-selector' --define > > > > > > '_prefix > > > > > > /usr/mpi/pgi/mvapich2-1.2rc2' > > > > > > /tmp/OFED-1.4-rc1/SRPMS/mvapich2-1.2rc2-4.src.rpm > > > > > > Install mvapich2_pgi RPM: > > > > > > Running rpm -iv --nodeps > > > > > > /tmp/OFED-1.4-rc1/RPMS/redhat-release-4AS-4.1/x86_64/mv > > > > > > apich2_pgi-1.2rc2-4.x86_64.rpm > > > > > > Failed to install mvapich2_pgi RPM > > > > > > See /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > > > > > > > > > # more /tmp/OFED.12539.logs/mvapich2_pgi.rpminstall.log > > > > > > Preparing packages for installation... > > > > > > file /etc/mpe_graphics.conf from install of > > > > > > mvapich2_pgi-1.2rc2-4 confli > > > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpe_log.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflicts w > > > > > > ith file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpe_mpianim.conf from install of > > > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflic > > > > > > ts with file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpe_mpicheck.conf from install of > > > > > > mvapich2_pgi-1.2rc2-4 confli > > > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpe_mpilog.conf from install of > > > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflict > > > > > > s with file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpe_mpitrace.conf from install of > > > > > > mvapich2_pgi-1.2rc2-4 confli > > > > > > cts with file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpe_nolog.conf from install of > > > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflicts > > > > > > with file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpicc.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflicts wit > > > > > > h file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpicxx.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflicts wi > > > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpif77.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflicts wi > > > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > > file /etc/mpif90.conf from install of > > > mvapich2_pgi-1.2rc2-4 > > > > > > conflicts wi > > > > > > th file from package mvapich2_gcc-1.2rc2-4 > > > > > > > > > > > > Scott Weitzenkamp > > > > > > SQA and Release Manager > > > > > > Server Access Virtualization Business Unit > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: general-bounces at lists.openfabrics.org > > > > > > > [mailto:general-bounces at lists.openfabrics.org] On > Behalf Of > > > > > > > Tziporet Koren > > > > > > > Sent: Tuesday, September 09, 2008 8:20 AM > > > > > > > To: ewg at lists.openfabrics.org > > > > > > > Cc: general at lists.openfabrics.org > > > > > > > Subject: [ofa-general] OFED 1.4-RC1 is available > > > > > > > > > > > > > > Hi, > > > > > > > OFED 1.4-RC1 release is available on > > > > > > > > > > > > > > > > http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc1.tgz > > > > > > > > > > > > > > To get BUILD_ID run ofed_info > > > > > > > > > > > > > > Please report any issues in bugzilla > > > > > https://bugs.openfabrics.org/ for > > > > > > > OFED 1.4 > > > > > > > > > > > > > > Tziporet & Vladimir > > > > > > > > > > > > > > > ============================================================== > > > > > > > ========== > > > > > > > > > > > > > > Release information: > > > > > > > -------------------- > > > > > > > Linux Operating Systems: > > > > > > > - RedHat EL4 up4: 2.6.9-42.ELsmp * > > > > > > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > > > > > > - RedHat EL4 up6: 2.6.9-67.ELsmp > > > > > > > - RedHat EL4 up7: 2.6.9-78.ELsmp > > > > > > > - RedHat EL5: 2.6.18-8.el5 > > > > > > > - RedHat EL5 up1: 2.6.18-53.el5 > > > > > > > - RedHat EL5 up2: 2.6.18-92.el5 > > > > > > > - CentOS 5.2: 2.6.18-92.el5 > > > > > > > - Fedora C9: 2.6.25-14.fc9 * > > > > > > > - SLES10: 2.6.16.21-0.8-smp > > > > > > > - SLES10 SP1: 2.6.16.46-0.12-smp > > > > > > > - SLES10 SP1 up1: 2.6.16.53-0.16-smp > > > > > > > - SLES10 SP2: 2.6.16.60-0.21-smp > > > > > > > - OpenSuSE 10.3: 2.6.22.5-31 * > > > > > > > - kernel.org: 2.6.26 and 2.6.27-rc5 > > > > > > > > > > > > > > * Minimal QA for these versions > > > > > > > > > > > > > > Systems: > > > > > > > * x86_64 > > > > > > > * x86 > > > > > > > * ia64 > > > > > > > * ppc64 > > > > > > > > > > > > > > > > > > > > > Main Changes from OFED 1.4-beta > > > > > > > =============================== > > > > > > > o Kernel code based on 2.6.27-rc5 > > > > > > > o Added NFS-RDMA support for SLES10 SP2 and kernel > > > 2.6.26 and 27 > > > > > > > o iSER backports added and its now available > > > > > > > o New MPI packages: Open MPI 1.2.7, MVAPICH 1.1 and > > > MVAPICH2 1.1 > > > > > > > o New DAPL libraries > > > > > > > o 37 bugs fixed (see attached for details) > > > > > > > > > > > > > > > > > > > > > Tasks that should be completed for the RC2: > > > > > > > =========================================== > > > > > > > 1. NFS-RDMA to work on RHEL 5.1 > > > > > > > 2. OSM: Cashed routing > > > > > > > 3. Cleanup compilation warning > > > > > > > 4. Bug fixes > > > > > > > > > > > > > _______________________________________________ > > > > > > ewg mailing list > > > > > > ewg at lists.openfabrics.org > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > > > > > -- > > > > > Jonathan Perkins > > > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > From christopher.tanner at gatech.edu Wed Sep 10 17:52:12 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Wed, 10 Sep 2008 20:52:12 -0400 Subject: [ofa-general] Permission denied Message-ID: I'm receiving this error when I try to execute a mpi executable: [node2][0,1,1][btl_openib_component.c:466:init_one_hca] error obtaining device context for mthca0 errno says Permission denied -------------------------------------------------------------------------- WARNING: There were errors during IB HCA initialization on host 'node2'. -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There is at least on IB HCA found on host 'node2', but there is no active ports detected. This is most certainly not what you wanted. Check your cables and SM configuration. -------------------------------------------------------------------------- I'm confused about the 'Permission denied'. My user is part of the group 'rdma', which I thought was supposed to give them permission to access the Infiniband devices. I'm also confused because the trivial test cases such as 'Hello World' and 'hostname' execute on all nodes without errors. The 'no active ports' is also curious. On the master node, I am running OpenSM and it indicates that the port is active (using ibv_devinfo). However, I notice that the 'ibv_devinfo' command can only be run by root. Is this an indication that permissions are not set correctly? As another note, my cluster is running Ubuntu 8.04, so I couldn't use the OFED scripts to install the Infiniband drivers, so I had to compile and install everything from source (which seemed to go fine). Thanks for your help! ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- From vlad at dev.mellanox.co.il Wed Sep 10 21:52:24 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 11 Sep 2008 07:52:24 +0300 Subject: [ofa-general] Compiled IB packages In-Reply-To: References: <0709481C-38BC-4598-870F-44FE8AE44FCE@gatech.edu> <48C6904E.1020606@mellanox.co.il> <1221031735.6948.12.camel@vlad-laptop> <94325E85-9403-4264-A4FE-90A567A8655B@gatech.edu> <48C7D63A.8090005@mellanox.co.il> Message-ID: <48C8A408.9060300@dev.mellanox.co.il> Christopher Tanner wrote: > > I only said this because, on my system, the > /lib/modules/2.6.24-16-server/updates directory doesn't exist; thus the > make process was having an error. However, the > /lib/modules/2.6.24-16-server/kernel does exist, but this directory > wasn't searched by the make process (as far as I can tell). > /lib/modules/`uname -r`/updates directory will be created by the "make install" command. kernel and updates directories (under /lib/modules/`uname -r`) are the target directories for modules installation and they are not searched by the make process. Regards, Vladimir From vlad at lists.openfabrics.org Thu Sep 11 03:07:55 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 11 Sep 2008 03:07:55 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080911-0200 daily build status Message-ID: <20080911100755.B1354E60DEB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From eli at dev.mellanox.co.il Thu Sep 11 04:37:46 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 11 Sep 2008 14:37:46 +0300 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: References: <20080909145435.GO2316@sgi.com> <20080910135116.GB26881@mtls03> Message-ID: <20080911113746.GA6298@mtls03> On Wed, Sep 10, 2008 at 11:26:16AM -0700, Roland Dreier wrote: > On Tue, Sep 09, 2008 at 02:32:44PM -0700, Roland Dreier wrote: > > By the way, looking at this stuff again, it seems we have (a possibly > > quite unlikely) race where a send can complete before the xmit method > > finishes, and we end up running skb_orphan on an skb that another > > context has already freed. I'll have to think about how we can fix > > that -- but any good ideas are appreciated... > > Actually it looks like Arthur's patch introduces this race. The current > code is OK because skb_orphan is called under tx_lock, which is also > held when we poll the send CQ. But of course the status quo is no good > exactly because of the locking issue Arthur found. > > > We can check if there are outstanding WRs after poll_tx is called. If > > there are no outstanding WRs, it means that the SKB has been freed. If > > there are outstanding WRs, it means that the last post has not been > > freed so we can call skb_orphan(). Like the following patch (on top of > > Arthur's): > > I don't think this closes the race completely: at the point skb_orphan > is called (after Arthur's patch, by design), we have no locks held. And > so the timer-driven send completion handling could already have run and > freed the skb between when we drop tx_lock and when we call skb_orphan. > I don't think there is a problem. The only SKB which is subject to this race, is the one we that we posted right after stopping the net queue. But the interrupt handler (resulting from arming the CQ) and possibly the following timer invocations, will drain the CQ up to the point where there are half the queue outstanding WRs. But and this one is at the other half of the queue. From olga.shern at gmail.com Thu Sep 11 05:12:18 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Thu, 11 Sep 2008 15:12:18 +0300 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: References: <48C6A9C1.5070108@gmail.com> <48C790CF.4050505@gmail.com> Message-ID: > Yes, I looked at the bug and I don't see the actual problem that is > caused by the current code. OK, the group doesn't get created if there > are only senders -- so what? The issue accrues when senders are at Infiniband side and receivers are at IP side, when setup includes IP to IB Gateways. > It seems a better fix would be just to get rid of the #if 0 and use > send-only membership after all these years? > So we are back to the same issue we have raised in the following thread and didn't get your reply http://lists.openfabrics.org/pipermail/general/2008-July/053037.html Olga From julia at diku.dk Thu Sep 11 05:33:01 2008 From: julia at diku.dk (Julia Lawall) Date: Thu, 11 Sep 2008 14:33:01 +0200 (CEST) Subject: [ofa-general] [PATCH 1/5] drivers/infiniband/hw: Drop code after return Message-ID: From: Julia Lawall The break after the return serves no purpose. Signed-off-by: Julia Lawall --- drivers/infiniband/hw/amso1100/c2_provider.c | 1 - drivers/infiniband/hw/nes/nes_verbs.c | 3 --- 2 files changed, 4 deletions(-) diff -u -p a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -272,7 +272,6 @@ static struct ib_qp *c2_create_qp(struct pr_debug("%s: Invalid QP type: %d\n", __func__, init_attr->qp_type); return ERR_PTR(-EINVAL); - break; } if (err) { diff -u -p a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1467,7 +1467,6 @@ static struct ib_qp *nes_create_qp(struc default: nes_debug(NES_DBG_QP, "Invalid QP type: %d\n", init_attr->qp_type); return ERR_PTR(-EINVAL); - break; } /* update the QP table */ @@ -2498,7 +2497,6 @@ static struct ib_mr *nes_reg_user_mr(str nes_debug(NES_DBG_MR, "Leaving, ibmr=%p", ibmr); return ibmr; - break; case IWNES_MEMREG_TYPE_QP: case IWNES_MEMREG_TYPE_CQ: nespbl = kzalloc(sizeof(*nespbl), GFP_KERNEL); @@ -2572,7 +2570,6 @@ static struct ib_mr *nes_reg_user_mr(str nesmr->ibmr.lkey = -1; nesmr->mode = req.reg_type; return &nesmr->ibmr; - break; } return ERR_PTR(-ENOSYS); From richard.genoud at gmail.com Thu Sep 11 06:46:11 2008 From: richard.genoud at gmail.com (Richard Genoud) Date: Thu, 11 Sep 2008 15:46:11 +0200 Subject: [ofa-general] Re: [PATCH 1/5] drivers/infiniband/hw: Drop code after return In-Reply-To: References: Message-ID: <80b317760809110646j1e9c5171v3fe78546cd62c9a7@mail.gmail.com> 2008/9/11 Julia Lawall : > From: Julia Lawall > > The break after the return serves no purpose. > > Signed-off-by: Julia Lawall Reviewed-by: Richard Genoud > --- > drivers/infiniband/hw/amso1100/c2_provider.c | 1 - > drivers/infiniband/hw/nes/nes_verbs.c | 3 --- > 2 files changed, 4 deletions(-) > > diff -u -p a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c > --- a/drivers/infiniband/hw/amso1100/c2_provider.c > +++ b/drivers/infiniband/hw/amso1100/c2_provider.c > @@ -272,7 +272,6 @@ static struct ib_qp *c2_create_qp(struct > pr_debug("%s: Invalid QP type: %d\n", __func__, > init_attr->qp_type); > return ERR_PTR(-EINVAL); > - break; > } > > if (err) { > diff -u -p a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c > --- a/drivers/infiniband/hw/nes/nes_verbs.c > +++ b/drivers/infiniband/hw/nes/nes_verbs.c > @@ -1467,7 +1467,6 @@ static struct ib_qp *nes_create_qp(struc > default: > nes_debug(NES_DBG_QP, "Invalid QP type: %d\n", init_attr->qp_type); > return ERR_PTR(-EINVAL); > - break; > } > > /* update the QP table */ > @@ -2498,7 +2497,6 @@ static struct ib_mr *nes_reg_user_mr(str > nes_debug(NES_DBG_MR, "Leaving, ibmr=%p", ibmr); > > return ibmr; > - break; > case IWNES_MEMREG_TYPE_QP: > case IWNES_MEMREG_TYPE_CQ: > nespbl = kzalloc(sizeof(*nespbl), GFP_KERNEL); > @@ -2572,7 +2570,6 @@ static struct ib_mr *nes_reg_user_mr(str > nesmr->ibmr.lkey = -1; > nesmr->mode = req.reg_type; > return &nesmr->ibmr; > - break; > } > > return ERR_PTR(-ENOSYS); > From yossi.openib at gmail.com Thu Sep 11 09:09:34 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Thu, 11 Sep 2008 19:09:34 +0300 Subject: ***SPAM*** Re: Fwd: [ofa-general] [PATCH] ipoib: fix hang while bringing down uninitialized interface In-Reply-To: References: <48C7DA7B.3050706@gmail.com> Message-ID: <48C942BE.7010606@gmail.com> > Looks like a real issue. Is this a regression from 2.6.26? (ie what > introduced this bug?) > Commit http://www.openfabrics.org/git/?p=ofed_1_4/linux-2.6.git;a=commit;h=57ce41d1d18279cc90223f3deadca70c7de1cfca put the bug in ipoib, but maybe this causes a hang only in recent kernels due to modifications in timer code. --Yossi From tdhanu_2000 at yahoo.com Thu Sep 11 11:07:05 2008 From: tdhanu_2000 at yahoo.com (dhananjay tembe) Date: Thu, 11 Sep 2008 23:37:05 +0530 (IST) Subject: [ofa-general] ***SPAM*** Where can I find the topology file? Message-ID: <803776.49372.qm@web94206.mail.in2.yahoo.com> Hi, I am using ofed stack and opensm. I was running some ibtools like ibdiagnet and ibdiagpath. man page for ibdiagnet shows -t optiong using which you can specify the topology file. Will you please tell me what does this topology file contain and how can create/generate this topology file? Thanks in advance. ---Dhananjay. Add more friends to your messenger and enjoy! Go to http://in.messenger.yahoo.com/invite/ From sashak at voltaire.com Thu Sep 11 13:11:26 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 11 Sep 2008 23:11:26 +0300 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup Message-ID: <20080911201126.GK25831@sashak.voltaire.com> This addresses bug#1181. Comment out opensm service auto-startup setup at %post section. Signed-off-by: Sasha Khapyorsky --- I don't really know why it was done this way originally. So please send any comments and/or objections. opensm/opensm.spec.in | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index 2e3abfc..fc7677d 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -104,11 +104,11 @@ install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh rm -rf $RPM_BUILD_ROOT %post -if [ $1 = 1 ]; then - /sbin/chkconfig --add opensmd -else - /sbin/service opensmd condrestart -fi +#if [ $1 = 1 ]; then +# /sbin/chkconfig --add opensmd +#else +# /sbin/service opensmd condrestart +#fi %preun if [ $1 = 0 ]; then -- 1.5.4.rc2.60.gb2e62 From sashak at voltaire.com Thu Sep 11 13:36:27 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 11 Sep 2008 23:36:27 +0300 Subject: ***SPAM*** Re: [ofa-general] OpenSM Problems/Questions In-Reply-To: <20080909121140.1ec7838b.weiny2@llnl.gov> References: <20080909121140.1ec7838b.weiny2@llnl.gov> Message-ID: <20080911203627.GN25831@sashak.voltaire.com> On 12:11 Tue 09 Sep , Ira Weiny wrote: > > > > > The following problem that is being encountered may also be SA/SM related. A > > > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G). > > > A ping from those node (NodesA-G) to NodeX returns "Destination Host > > > Unreachable". A ping from NodeX to NodesA-G works. > > > > Sounds like perhaps those nodes were unable to join the broadcast > > group perhaps due to a rate issue. > > Hal is correct, and saquery is your friend here. If you use "genders" and > "whatsup" (https://computing.llnl.gov/linux/downloads.html) I have a series of > tools "Pragmatic InfiniBand Utilities (PIU)" > (https://computing.llnl.gov/linux/piu.html) which includes a tool called > "ibnodeinmcast" which can help debug this. What it does is use saquery [-g|-m] > to find nodes in the multicast groups. With the addition of other LLNL tools > this can be boiled down to which nodes "should" be in the group but are not. > You are welcome to download that package and adapt it to your environment. Also there was your fix (after OFED 1.3) which is pretty related to unstable links. Sasha commit e40NB597af556fce55e3b205b0cc4ffa6805aeaa Author: Ira Weiny Date: Thu Apr 24 18:16:57 2008 -0700 opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.) I did not get any output with multicast_debug_level! But I added some more debugging and finally realized that the set was not being sent. :-( I put a debug statement in OpenSM where the flag was set and therefore thought that OpenSM had set the rereg bit. However, since no other data had changed the "set" MAD was not sent. (I am getting a bit tongue tied reading this back. I hope that all makes sense.) Here is a patch which fixes the problem. (At least with the partial sub-nets configuration I explained before.) I will have to verify this fixes the problem I originally reported. Ira From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 24 Apr 2008 18:05:01 -0700 Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit Signed-off-by: Ira K. Weiny Signed-off-by: Sasha Khapyorsky diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index ab23929..4d628d2 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -1099,9 +1099,14 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr, if ((p_mgr->p_subn->first_time_master_sweep == TRUE || p_port->is_new) && !p_mgr->p_subn->opt.no_clients_rereg && ((p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != - 0)) + 0)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Seting client rereg on %s, port %d\n", + p_port->p_node->print_desc, + p_port->p_physp->port_num); ib_port_info_set_client_rereg(p_pi, 1); - else + send_set = TRUE; + } else ib_port_info_set_client_rereg(p_pi, 0); /* We need to send the PortInfo Set request with the new sm_lid From weiny2 at llnl.gov Thu Sep 11 14:13:01 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 11 Sep 2008 14:13:01 -0700 Subject: ***SPAM*** Re: [ofa-general] OpenSM Problems/Questions In-Reply-To: <20080911203627.GN25831@sashak.voltaire.com> References: <20080909121140.1ec7838b.weiny2@llnl.gov> <20080911203627.GN25831@sashak.voltaire.com> Message-ID: <20080911141301.4823682d.weiny2@llnl.gov> On Thu, 11 Sep 2008 23:36:27 +0300 Sasha Khapyorsky wrote: > On 12:11 Tue 09 Sep , Ira Weiny wrote: > > > > > > > The following problem that is being encountered may also be SA/SM related. A > > > > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G). > > > > A ping from those node (NodesA-G) to NodeX returns "Destination Host > > > > Unreachable". A ping from NodeX to NodesA-G works. > > > > > > Sounds like perhaps those nodes were unable to join the broadcast > > > group perhaps due to a rate issue. > > > > Hal is correct, and saquery is your friend here. If you use "genders" and > > "whatsup" (https:// computing.llnl.gov/linux/downloads.html) I have a series of > > tools "Pragmatic InfiniBand Utilities (PIU)" > > (https:// computing.llnl.gov/linux/piu.html) which includes a tool called > > "ibnodeinmcast" which can help debug this. What it does is use saquery [-g|-m] > > to find nodes in the multicast groups. With the addition of other LLNL tools > > this can be boiled down to which nodes "should" be in the group but are not. > > You are welcome to download that package and adapt it to your environment. > > Also there was your fix (after OFED 1.3) which is pretty related to > unstable links. True, but as I understood this is happening right after boot. Is this true? Ira > > Sasha > > > commit e40NB597af556fce55e3b205b0cc4ffa6805aeaa > Author: Ira Weiny > Date: Thu Apr 24 18:16:57 2008 -0700 > > opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit > > (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.) > > I did not get any output with multicast_debug_level! But I added some more > debugging and finally realized that the set was not being sent. :-( I put a > debug statement in OpenSM where the flag was set and therefore thought that > OpenSM had set the rereg bit. However, since no other data had changed the > "set" MAD was not sent. (I am getting a bit tongue tied reading this back. I > hope that all makes sense.) > > Here is a patch which fixes the problem. (At least with the partial sub-nets > configuration I explained before.) I will have to verify this fixes the problem > I originally reported. > > Ira > > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 24 Apr 2008 18:05:01 -0700 > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit > > Signed-off-by: Ira K. Weiny > Signed-off-by: Sasha Khapyorsky > > diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c > index ab23929..4d628d2 100644 > --- a/opensm/opensm/osm_lid_mgr.c > +++ b/opensm/opensm/osm_lid_mgr.c > @@ -1099,9 +1099,14 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr, > if ((p_mgr->p_subn->first_time_master_sweep == TRUE || p_port->is_new) > && !p_mgr->p_subn->opt.no_clients_rereg > && ((p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != > - 0)) > + 0)) { > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > + "Seting client rereg on %s, port %d\n", > + p_port->p_node->print_desc, > + p_port->p_physp->port_num); > ib_port_info_set_client_rereg(p_pi, 1); > - else > + send_set = TRUE; > + } else > ib_port_info_set_client_rereg(p_pi, 0); > > /* We need to send the PortInfo Set request with the new sm_lid > From rdreier at cisco.com Thu Sep 11 14:19:24 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 11 Sep 2008 14:19:24 -0700 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: <20080911113746.GA6298@mtls03> (Eli Cohen's message of "Thu, 11 Sep 2008 14:37:46 +0300") References: <20080909145435.GO2316@sgi.com> <20080910135116.GB26881@mtls03> <20080911113746.GA6298@mtls03> Message-ID: > I don't think there is a problem. The only SKB which is subject to > this race, is the one we that we posted right after stopping the net > queue. But the interrupt handler (resulting from arming the CQ) and > possibly the following timer invocations, will drain the CQ up to the > point where there are half the queue outstanding WRs. But and this one > is at the other half of the queue. Maybe I'm missing something but where is the logic that stops draining the CQ? I just see static int poll_tx(struct ipoib_dev_priv *priv) { int n, i; n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc); for (i = 0; i < n; ++i) ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i); return n == MAX_SEND_CQE; } and static void drain_tx_cq(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned long flags; spin_lock_irqsave(&priv->tx_lock, flags); while (poll_tx(priv)) ; /* nothing */ which seem like they could easily poll that last completion. From rdreier at cisco.com Thu Sep 11 14:20:39 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 11 Sep 2008 14:20:39 -0700 Subject: Fwd: [ofa-general] [PATCH] ipoib: fix hang while bringing down uninitialized interface In-Reply-To: <48C942BE.7010606@gmail.com> (Yossi Etigin's message of "Thu, 11 Sep 2008 19:09:34 +0300") References: <48C7DA7B.3050706@gmail.com> <48C942BE.7010606@gmail.com> Message-ID: > Commit http://www.openfabrics.org/git/?p=ofed_1_4/linux-2.6.git;a=commit;h=57ce41d1d18279cc90223f3deadca70c7de1cfca > put the bug in ipoib, but maybe this causes a hang only in recent kernels > due to modifications in timer code. So it looks like not a regression from 2.6.26... I'll queue this for 2.6.28 From chu11 at llnl.gov Thu Sep 11 14:27:59 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 11 Sep 2008 14:27:59 -0700 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup In-Reply-To: <20080911201126.GK25831@sashak.voltaire.com> References: <20080911201126.GK25831@sashak.voltaire.com> Message-ID: <1221168479.19185.135.camel@cardanus.llnl.gov> Hey Sasha, Although the %post script below may not be 100% portable, I think it's pretty typical for system daemon rpms. A quick "rpm -q --scripts " shows its pretty common for system daemons on RHEL. It should be tweaked for portability rather than being removed. Personally, I've never done "/sbin/service FOO condrestart" in rpm scripts. I do "%{initrddir}/FOO condrestart". Maybe that's more portable?? Al On Thu, 2008-09-11 at 23:11 +0300, Sasha Khapyorsky wrote: > This addresses bug#1181. > > Comment out opensm service auto-startup setup at %post section. > > Signed-off-by: Sasha Khapyorsky > --- > > I don't really know why it was done this way originally. So please send > any comments and/or objections. > > opensm/opensm.spec.in | 10 +++++----- > 1 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in > index 2e3abfc..fc7677d 100644 > --- a/opensm/opensm.spec.in > +++ b/opensm/opensm.spec.in > @@ -104,11 +104,11 @@ install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh > rm -rf $RPM_BUILD_ROOT > > %post > -if [ $1 = 1 ]; then > - /sbin/chkconfig --add opensmd > -else > - /sbin/service opensmd condrestart > -fi > +#if [ $1 = 1 ]; then > +# /sbin/chkconfig --add opensmd > +#else > +# /sbin/service opensmd condrestart > +#fi > > %preun > if [ $1 = 0 ]; then -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From rdreier at cisco.com Thu Sep 11 19:59:32 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 11 Sep 2008 19:59:32 -0700 Subject: [ofa-general] Re: [PATCH] ipoib: send creation parameters when doing send-only join In-Reply-To: (Olga Shern's message of "Thu, 11 Sep 2008 15:12:18 +0300") References: <48C6A9C1.5070108@gmail.com> <48C790CF.4050505@gmail.com> Message-ID: > The issue accrues when senders are at Infiniband side and receivers > are at IP side, when setup includes IP to IB Gateways. Shouldn't the IP to IB gateway be a full member of any multicast groups it wants to forward? And figure out which groups to forward by snooping IGMP? > So we are back to the same issue we have raised in the following > thread and didn't get your reply > http://lists.openfabrics.org/pipermail/general/2008-July/053037.html Sorry I let that drop, but I don't see what that issue has to do with the question of whether IPoIB should finally use send-only membership? - R. From vlad at lists.openfabrics.org Fri Sep 12 03:07:27 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 12 Sep 2008 03:07:27 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080912-0200 daily build status Message-ID: <20080912100727.BEFD9E60E09@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From yossi.openib at gmail.com Fri Sep 12 04:22:09 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Fri, 12 Sep 2008 14:22:09 +0300 Subject: [ofa-general] [PATCH v2] ipiob: fix rtnl deadlock In-Reply-To: References: <4899CF0A.1060509@Voltaire.COM> <32cb786f0808081155o19f8fb9dm217cd6996dffa3e5@mail.gmail.com> <32cb786f0808090538j272842b1r5117547cccde0d06@mail.gmail.com> <32cb786f0808161218o417553b5w1738a517f0eb468a@mail.gmail.com> Message-ID: <48CA50E1.2090309@gmail.com> Seems like taking rtnl_lock in ipoib_mcast_join_complete() also causes a deadlock. See bug #1186. Roland Dreier wrote: > > What if you bring the device down, while you get a join completion event? > > ipoib_stop() can run in parellel with ipoib_mcast_join_complete(), and you > > will just wait for ipoib_stop() to finish to do netif_carrier_on() afterwards. > > Yes, but after ipoib_stop() finishes, netif_carrier_on() doesn't do > anything that could cause a problem, since the netdev is down. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eli at dev.mellanox.co.il Fri Sep 12 07:25:05 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 12 Sep 2008 17:25:05 +0300 Subject: [ofa-general] [PATCH] ipoib: defer skb_orphan() until irqs enabled In-Reply-To: References: <20080909145435.GO2316@sgi.com> <20080910135116.GB26881@mtls03> <20080911113746.GA6298@mtls03> Message-ID: <1221229505.6869.11.camel@eli-lt> On Thu, 2008-09-11 at 14:19 -0700, Roland Dreier wrote: > Maybe I'm missing something but where is the logic that stops draining > the CQ? I just see > Well, there is a hole after all... you're right. > static int poll_tx(struct ipoib_dev_priv *priv) > { > int n, i; > > n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc); > for (i = 0; i < n; ++i) > ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i); > > return n == MAX_SEND_CQE; > } > > and > > static void drain_tx_cq(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > unsigned long flags; > > spin_lock_irqsave(&priv->tx_lock, flags); > while (poll_tx(priv)) > ; /* nothing */ > > which seem like they could easily poll that last completion. There is no problem to poll the last completion. It's a problem if this code polls the last completion before the transmit function calls skb_orphan() on it, and that I think does not have much chances to happen. But, if we agree that the SKB posted just before the queue is stopped, is the problematic one, we can extend the following condition to be: if (unlikely(priv->tx_outstanding > MAX_SEND_CQE)) while (poll_tx(priv)) ; /* nothing */ - return ret; + return !(sent && priv->tx_outstanding && !netif_queue_stopped(dev)); } what do you think? From rdreier at cisco.com Fri Sep 12 08:20:29 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Sep 2008 08:20:29 -0700 Subject: [ofa-general] [PATCH v2] ipiob: fix rtnl deadlock In-Reply-To: <48CA50E1.2090309@gmail.com> (Yossi Etigin's message of "Fri, 12 Sep 2008 14:22:09 +0300") References: <4899CF0A.1060509@Voltaire.COM> <32cb786f0808081155o19f8fb9dm217cd6996dffa3e5@mail.gmail.com> <32cb786f0808090538j272842b1r5117547cccde0d06@mail.gmail.com> <32cb786f0808161218o417553b5w1738a517f0eb468a@mail.gmail.com> <48CA50E1.2090309@gmail.com> Message-ID: > Seems like taking rtnl_lock in ipoib_mcast_join_complete() also causes a deadlock. > See bug #1186. I have to admit the deadlock isn't obvious to me... ipoib_mcast_join_complete() runs in the ib_mad1 thread, so I'm not sure how that thread is getting flushed. Can you reproduce this deadlock with lockdep enabled and get the output from that? - R. From yossi.openib at gmail.com Fri Sep 12 08:25:26 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Fri, 12 Sep 2008 18:25:26 +0300 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2] ipiob: fix rtnl deadlock In-Reply-To: References: <4899CF0A.1060509@Voltaire.COM> <32cb786f0808081155o19f8fb9dm217cd6996dffa3e5@mail.gmail.com> <32cb786f0808090538j272842b1r5117547cccde0d06@mail.gmail.com> <32cb786f0808161218o417553b5w1738a517f0eb468a@mail.gmail.com> <48CA50E1.2090309@gmail.com> Message-ID: <48CA89E6.8030301@gmail.com> ipoib_stop() calls ipoib_ib_dev_down() which calls ipoib_mcast_dev_flush() which calls ipoib_mcast_free(), which calls ipoib_mcast_leave(). The latter calls ib_sa_free_multicast(), and this wait until the multicast completion handler finishes. This happens to be ipoib_mcast_join_complete(), which waits for the rtnl_lock(), whcih was already taken by ipoib_stop(). Roland Dreier wrote: > > Seems like taking rtnl_lock in ipoib_mcast_join_complete() also causes a deadlock. > > See bug #1186. > > I have to admit the deadlock isn't obvious to > me... ipoib_mcast_join_complete() runs in the ib_mad1 thread, so I'm not > sure how that thread is getting flushed. Can you reproduce this > deadlock with lockdep enabled and get the output from that? > > - R. > From ctung at neteffect.com Fri Sep 12 09:22:15 2008 From: ctung at neteffect.com (Chien Tung) Date: Fri, 12 Sep 2008 11:22:15 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] RDMA/nes: client side QP destroy Message-ID: <200809121622.m8CGMFZ6001609@velma.neteffect.com> Author: Faisal Latif * Fixed QP not destroyed properly on the client. * Misc cleanup in nes_cm.c patch verified with rping. Signed-off-by: Faisal Latif -- Roland, Please consider this for 2.6.27. It has been applied and tested against 2.6.27-rc5. drivers/infiniband/hw/nes/nes_cm.c | 20 +++++++------------- 1 files changed, 7 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 9f0b964..8793aa4 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -1145,7 +1145,7 @@ static int rem_ref_cm_node(struct nes_cm_core *cm_core, struct nes_timer_entry *recv_entry; struct iw_cm_id *cm_id; struct list_head *list_core, *list_node_temp; - struct nes_qp *nesqp; + struct nes_qp *nesqp = NULL; if (!cm_node) return -EINVAL; @@ -1826,7 +1826,7 @@ static struct nes_cm_listener *mini_cm_listen(struct nes_cm_core *cm_core, /** * mini_cm_connect - make a connection node with params */ -struct nes_cm_node *mini_cm_connect(struct nes_cm_core *cm_core, +static struct nes_cm_node *mini_cm_connect(struct nes_cm_core *cm_core, struct nes_vnic *nesvnic, u16 private_data_len, void *private_data, struct nes_cm_info *cm_info) { @@ -1835,7 +1835,7 @@ struct nes_cm_node *mini_cm_connect(struct nes_cm_core *cm_core, struct nes_cm_listener *loopbackremotelistener; struct nes_cm_node *loopbackremotenode; struct nes_cm_info loopback_cm_info; - u16 mpa_frame_size = sizeof(struct ietf_mpa_frame) + private_data_len; + u16 mpa_frame_size = 0; struct ietf_mpa_frame *mpa_frame = NULL; /* create a CM connection node */ @@ -1847,7 +1847,8 @@ struct nes_cm_node *mini_cm_connect(struct nes_cm_core *cm_core, mpa_frame->flags = IETF_MPA_FLAGS_CRC; mpa_frame->rev = IETF_MPA_VERSION; mpa_frame->priv_data_len = htons(private_data_len); - + mpa_frame_size = sizeof(struct ietf_mpa_frame) + + private_data_len; /* set our node side to client (active) side */ cm_node->tcp_cntxt.client = 1; cm_node->tcp_cntxt.rcv_wscale = NES_CM_DEFAULT_RCV_WND_SCALE; @@ -1956,13 +1957,6 @@ static int mini_cm_reject(struct nes_cm_core *cm_core, return ret; cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; - ret = send_fin(cm_node, NULL); - - if (cm_node->accept_pend) { - BUG_ON(!cm_node->listener); - atomic_dec(&cm_node->listener->pend_accepts_cnt); - BUG_ON(atomic_read(&cm_node->listener->pend_accepts_cnt) < 0); - } ret = send_reset(cm_node, NULL); return ret; @@ -2383,6 +2377,7 @@ static int nes_cm_disconn_true(struct nes_qp *nesqp) atomic_inc(&cm_disconnects); cm_event.event = IW_CM_EVENT_DISCONNECT; if (last_ae == NES_AEQE_AEID_LLP_CONNECTION_RESET) { + issued_disconnect_reset = 1; cm_event.status = IW_CM_EVENT_STATUS_RESET; nes_debug(NES_DBG_CM, "Generating a CM " "Disconnect Event (status reset) for " @@ -2508,7 +2503,6 @@ static int nes_disconnect(struct nes_qp *nesqp, int abrupt) nes_debug(NES_DBG_CM, "Call close API\n"); g_cm_core->api->close(g_cm_core, nesqp->cm_node); - nesqp->cm_node = NULL; } return ret; @@ -2837,6 +2831,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) cm_node->apbvt_set = 1; nesqp->cm_node = cm_node; cm_node->nesqp = nesqp; + nes_add_ref(&nesqp->ibqp); return 0; } @@ -3167,7 +3162,6 @@ static void cm_event_connect_error(struct nes_cm_event *event) if (ret) printk(KERN_ERR "%s[%u] OFA CM event_handler returned, " "ret=%d\n", __func__, __LINE__, ret); - nes_rem_ref(&nesqp->ibqp); cm_id->rem_ref(cm_id); rem_ref_cm_node(event->cm_node->cm_core, event->cm_node); From ctung at neteffect.com Fri Sep 12 09:22:15 2008 From: ctung at neteffect.com (Chien Tung) Date: Fri, 12 Sep 2008 11:22:15 -0500 Subject: [ofa-general] ***SPAM*** [PATCH] RDMA/nes: 4 port 1G HP blade card support Message-ID: <200809121622.m8CGMFVS001611@velma.neteffect.com> * Adding support for NetEffect 4 port 1G HP blade card. The mapping between physical port and MAC is different from the standup card. Signed-off-by: Chien Tung -- Roland, Please consider this for 2.6.27. It has been applied and tested against 2.6.27-rc5. drivers/infiniband/hw/nes/nes.c | 29 +++++++++++++--- drivers/infiniband/hw/nes/nes_hw.c | 66 +++++++++++++++++++++++++++-------- drivers/infiniband/hw/nes/nes_hw.h | 1 + 3 files changed, 76 insertions(+), 20 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c index b0cab64..a539685 100644 --- a/drivers/infiniband/hw/nes/nes.c +++ b/drivers/infiniband/hw/nes/nes.c @@ -562,7 +562,26 @@ static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_i nesdev->nesadapter->pd_config_base[PCI_FUNC(nesdev->pcidev->devfn)]; */ nesdev->base_doorbell_index = 1; nesdev->doorbell_start = nesdev->nesadapter->doorbell_start; - nesdev->mac_index = PCI_FUNC(nesdev->pcidev->devfn) % nesdev->nesadapter->port_count; + if (nesdev->nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { + switch (PCI_FUNC(nesdev->pcidev->devfn) % + nesdev->nesadapter->port_count) { + case 1: + nesdev->mac_index = 2; + break; + case 2: + nesdev->mac_index = 1; + break; + case 3: + nesdev->mac_index = 3; + break; + case 0: + default: + nesdev->mac_index = 0; + } + } else { + nesdev->mac_index = PCI_FUNC(nesdev->pcidev->devfn) % + nesdev->nesadapter->port_count; + } tasklet_init(&nesdev->dpc_tasklet, nes_dpc, (unsigned long)nesdev); @@ -581,7 +600,7 @@ static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_i nesdev->int_req = (0x101 << PCI_FUNC(nesdev->pcidev->devfn)) | (1 << (PCI_FUNC(nesdev->pcidev->devfn)+16)); if (PCI_FUNC(nesdev->pcidev->devfn) < 4) { - nesdev->int_req |= (1 << (PCI_FUNC(nesdev->pcidev->devfn)+24)); + nesdev->int_req |= (1 << (PCI_FUNC(nesdev->mac_index)+24)); } /* TODO: This really should be the first driver to load, not function 0 */ @@ -772,14 +791,14 @@ static ssize_t nes_show_adapter(struct device_driver *ddp, char *buf) list_for_each_entry(nesdev, &nes_dev_list, list) { if (i == ee_flsh_adapter) { - devfn = nesdev->nesadapter->devfn; - bus_number = nesdev->nesadapter->bus_number; + devfn = nesdev->pcidev->devfn; + bus_number = nesdev->pcidev->bus->number; break; } i++; } - return snprintf(buf, PAGE_SIZE, "%x:%x", bus_number, devfn); + return snprintf(buf, PAGE_SIZE, "%x:%x\n", bus_number, devfn); } static ssize_t nes_store_adapter(struct device_driver *ddp, diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 1513d40..bdd98e6 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -61,7 +61,7 @@ u32 int_mod_cq_depth_1; static void nes_cqp_ce_handler(struct nes_device *nesdev, struct nes_hw_cq *cq); static void nes_init_csr_ne020(struct nes_device *nesdev, u8 hw_rev, u8 port_count); static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, - u8 OneG_Mode); + struct nes_adapter *nesadapter, u8 OneG_Mode); static void nes_nic_napi_ce_handler(struct nes_device *nesdev, struct nes_hw_nic_cq *cq); static void nes_process_aeq(struct nes_device *nesdev, struct nes_hw_aeq *aeq); static void nes_process_ceq(struct nes_device *nesdev, struct nes_hw_ceq *ceq); @@ -292,9 +292,6 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { if ((port_count = nes_reset_adapter_ne020(nesdev, &OneG_Mode)) == 0) return NULL; - if (nes_init_serdes(nesdev, hw_rev, port_count, OneG_Mode)) - return NULL; - nes_init_csr_ne020(nesdev, hw_rev, port_count); max_qp = nes_read_indexed(nesdev, NES_IDX_QP_CTX_SIZE); nes_debug(NES_DBG_INIT, "QP_CTX_SIZE=%u\n", max_qp); @@ -353,6 +350,19 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { nes_debug(NES_DBG_INIT, "Allocating new nesadapter @ %p, size = %u (actual size = %u).\n", nesadapter, (u32)sizeof(struct nes_adapter), adapter_size); + if (nes_read_eeprom_values(nesdev, nesadapter)) { + printk(KERN_ERR PFX "Unable to read EEPROM data.\n"); + kfree(nesadapter); + return NULL; + } + + if (nes_init_serdes(nesdev, hw_rev, port_count, nesadapter, + OneG_Mode)) { + kfree(nesadapter); + return NULL; + } + nes_init_csr_ne020(nesdev, hw_rev, port_count); + /* populate the new nesadapter */ nesadapter->devfn = nesdev->pcidev->devfn; nesadapter->bus_number = nesdev->pcidev->bus->number; @@ -468,20 +478,25 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { /* setup port configuration */ if (nesadapter->port_count == 1) { - u32temp = 0x00000000; + nesadapter->log_port = 0x00000000; if (nes_drv_opt & NES_DRV_OPT_DUAL_LOGICAL_PORT) nes_write_indexed(nesdev, NES_IDX_TX_POOL_SIZE, 0x00000002); else nes_write_indexed(nesdev, NES_IDX_TX_POOL_SIZE, 0x00000003); } else { - if (nesadapter->port_count == 2) - u32temp = 0x00000044; - else - u32temp = 0x000000e4; + if (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { + nesadapter->log_port = 0x000000D8; + } else { + if (nesadapter->port_count == 2) + nesadapter->log_port = 0x00000044; + else + nesadapter->log_port = 0x000000e4; + } nes_write_indexed(nesdev, NES_IDX_TX_POOL_SIZE, 0x00000003); } - nes_write_indexed(nesdev, NES_IDX_NIC_LOGPORT_TO_PHYPORT, u32temp); + nes_write_indexed(nesdev, NES_IDX_NIC_LOGPORT_TO_PHYPORT, + nesadapter->log_port); nes_debug(NES_DBG_INIT, "Probe time, LOG2PHY=%u\n", nes_read_indexed(nesdev, NES_IDX_NIC_LOGPORT_TO_PHYPORT)); @@ -706,23 +721,43 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_ * nes_init_serdes */ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, - u8 OneG_Mode) + struct nes_adapter *nesadapter, u8 OneG_Mode) { int i; u32 u32temp; + u32 serdes_common_control; if (hw_rev != NE020_REV) { /* init serdes 0 */ nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000000FF); - if (!OneG_Mode) + if (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { + serdes_common_control = nes_read_indexed(nesdev, + NES_IDX_ETH_SERDES_COMMON_CONTROL0); + serdes_common_control |= 0x000000100; + nes_write_indexed(nesdev, + NES_IDX_ETH_SERDES_COMMON_CONTROL0, + serdes_common_control); + } else if (!OneG_Mode) { nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE0, 0x11110000); - if (port_count > 1) { + } + if (((port_count > 1) && + (nesadapter->phy_type[0] != NES_PHY_TYPE_PUMA_1G)) || + ((port_count > 2) && + (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G))) { /* init serdes 1 */ nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); - if (!OneG_Mode) + if (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { + serdes_common_control = nes_read_indexed(nesdev, + NES_IDX_ETH_SERDES_COMMON_CONTROL1); + serdes_common_control |= 0x000000100; + nes_write_indexed(nesdev, + NES_IDX_ETH_SERDES_COMMON_CONTROL1, + serdes_common_control); + } else if (!OneG_Mode) { nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE1, 0x11110000); } + } } else { /* init serdes 0 */ nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, 0x00000008); @@ -2258,7 +2293,8 @@ static void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) spin_unlock_irqrestore(&nesadapter->phy_lock, flags); } /* read the PHY interrupt status register */ - if (nesadapter->OneG_Mode) { + if ((nesadapter->OneG_Mode) && + (nesadapter->phy_type[mac_index] != NES_PHY_TYPE_PUMA_1G)) { do { nes_read_1G_phy_reg(nesdev, 0x1a, nesadapter->phy_index[mac_index], &phy_data); diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index 7b81e0a..fc0f063 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -1100,6 +1100,7 @@ struct nes_adapter { u8 mac_sw_state[4]; u8 mac_link_down[4]; u8 phy_type[4]; + u8 log_port; /* PCI information */ unsigned int devfn; From AHKumar at odu.edu Fri Sep 12 09:43:09 2008 From: AHKumar at odu.edu (Kumar, Amit H.) Date: Fri, 12 Sep 2008 12:43:09 -0400 Subject: [ofa-general] Usage of Infiniband Protocol Stack ? Message-ID: I have some applications(mvapich2, pvfs2 ...) compiled to use the OFED Infiniband protocol stack. May be a stupid question ..: Is it valid to see at the "ifconfig ib0" stats to report the usage of IB protocol stack, regardless what application I making use of the IB stack.? Thank you, Amit -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Fri Sep 12 09:46:44 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 12 Sep 2008 12:46:44 -0400 Subject: [ofa-general] Usage of Infiniband Protocol Stack ? In-Reply-To: References: Message-ID: <48CA9CF4.3070502@scalableinformatics.com> Kumar, Amit H. wrote: > I have some applications(mvapich2, pvfs2 …) compiled to use the OFED > Infiniband protocol stack. > > May be a stupid question ..: > Is it valid to see at the “ifconfig ib0” stats to report the usage of IB > protocol stack, regardless what application I making use of the IB stack.? Hi Amit: Only if they have loaded/configured IPoIB. If they haven't configured it, you might be able to try ibnodes and see if this reports anything on the network. We do tend to configure this for our customers clusters precisely as a diagnostics/testing tool (and as a way to enable infiniband-ignoring MPI stacks such as MPICH1/2 to have a fighting chance of using infiniband). Joe > > Thank you, > Amit > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From chu11 at llnl.gov Fri Sep 12 09:59:50 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 12 Sep 2008 09:59:50 -0700 Subject: [ofa-general] [OpenSM][Trivial] fix routing algorithm description Message-ID: <1221238790.6274.7.camel@cardanus.llnl.gov> Hey Sasha, I think the text was just old. There are more algorithms than just minhop, updn, and file nowadays. Al -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-minhop-algorithm-description.patch Type: text/x-patch Size: 1698 bytes Desc: not available URL: From hal.rosenstock at gmail.com Fri Sep 12 10:08:53 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 12 Sep 2008 13:08:53 -0400 Subject: ***SPAM*** Re: [ofa-general] Usage of Infiniband Protocol Stack ? In-Reply-To: References: Message-ID: On Fri, Sep 12, 2008 at 12:43 PM, Kumar, Amit H. wrote: > I have some applications(mvapich2, pvfs2 …) compiled to use the OFED > Infiniband protocol stack. > > May be a stupid question ..: > Is it valid to see at the "ifconfig ib0" stats to report the usage of IB > protocol stack, regardless what application I making use of the IB stack.? ifconfig for IB interfaces shows the IPoIB stats. "Pure" IB stats are available from the PMA. These stats (bytes*4,packets x in/out) are total (across all applications being run). They can be obtained by the perfquery diagnostic tool or via a Performance Manager. -- Hal > Thank you, > Amit > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Fri Sep 12 11:24:10 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 12 Sep 2008 21:24:10 +0300 Subject: [ofa-general] Re: [OpenSM][Trivial] fix routing algorithm description In-Reply-To: <1221238790.6274.7.camel@cardanus.llnl.gov> References: <1221238790.6274.7.camel@cardanus.llnl.gov> Message-ID: <20080912182410.GA17315@sashak.voltaire.com> On 09:59 Fri 12 Sep , Al Chu wrote: > Hey Sasha, > > I think the text was just old. There are more algorithms than just > minhop, updn, and file nowadays. > > Al > > -- > Albert Chu > chu11 at llnl.gov > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 6014c58bc3e63df98135bbb987e89e9b3ae4f706 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Fri, 12 Sep 2008 09:55:30 -0700 > Subject: [PATCH] fix minhop algorithm description > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From rdreier at cisco.com Fri Sep 12 11:34:09 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 12 Sep 2008 11:34:09 -0700 Subject: [ofa-general] [PATCH v2] ipiob: fix rtnl deadlock In-Reply-To: <48CA89E6.8030301@gmail.com> (Yossi Etigin's message of "Fri, 12 Sep 2008 18:25:26 +0300") References: <4899CF0A.1060509@Voltaire.COM> <32cb786f0808081155o19f8fb9dm217cd6996dffa3e5@mail.gmail.com> <32cb786f0808090538j272842b1r5117547cccde0d06@mail.gmail.com> <32cb786f0808161218o417553b5w1738a517f0eb468a@mail.gmail.com> <48CA50E1.2090309@gmail.com> <48CA89E6.8030301@gmail.com> Message-ID: > ipoib_stop() calls ipoib_ib_dev_down() which calls ipoib_mcast_dev_flush() > which calls ipoib_mcast_free(), which calls ipoib_mcast_leave(). The latter > calls ib_sa_free_multicast(), and this wait until the multicast > completion handler finishes. This happens to be > ipoib_mcast_join_complete(), which > waits for the rtnl_lock(), whcih was already taken by ipoib_stop(). I see... I wonder why lockdep didn't warn about this in my testing. Anyway, any ideas how we want to fix this? - R. From sashak at voltaire.com Fri Sep 12 11:34:07 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 12 Sep 2008 21:34:07 +0300 Subject: ***SPAM*** Re: [ofa-general] OpenSM Problems/Questions In-Reply-To: <20080911141301.4823682d.weiny2@llnl.gov> References: <20080909121140.1ec7838b.weiny2@llnl.gov> <20080911203627.GN25831@sashak.voltaire.com> <20080911141301.4823682d.weiny2@llnl.gov> Message-ID: <20080912183407.GB17315@sashak.voltaire.com> On 14:13 Thu 11 Sep , Ira Weiny wrote: > > > > Also there was your fix (after OFED 1.3) which is pretty related to > > unstable links. > > True, but as I understood this is happening right after boot. Is this true? I think it could be related to a group of nodes where links are unstable. Of course I may be wrong about it - Matt should know better. And if it is - the fix should be relevant. Sasha From AHKumar at odu.edu Fri Sep 12 12:24:49 2008 From: AHKumar at odu.edu (Kumar, Amit H.) Date: Fri, 12 Sep 2008 15:24:49 -0400 Subject: [ofa-general] Usage of Infiniband Protocol Stack ? In-Reply-To: References: Message-ID: Thank you Hal & Joe for your prompt reply. Two more questions: If ifconfig for IB is just for IPoIB, is it okay to bring down this interface(ib0) and still be able to run applications like Mvapich2 and pvfs2 ?? And I also assume that we At Least Need 1 Ethernet Interface Up for the correct operation of IB compiled Applications, without which IB compiled Applications will fail. Is this correct ?? Thank you, Amit > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Friday, September 12, 2008 1:09 PM > To: Kumar, Amit H. > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Usage of Infiniband Protocol Stack ? > > On Fri, Sep 12, 2008 at 12:43 PM, Kumar, Amit H. > wrote: > > I have some applications(mvapich2, pvfs2 ...) compiled to use the OFED > > Infiniband protocol stack. > > > > May be a stupid question ..: > > Is it valid to see at the "ifconfig ib0" stats to report the usage of > IB > > protocol stack, regardless what application I making use of the IB > stack.? > > ifconfig for IB interfaces shows the IPoIB stats. > > "Pure" IB stats are available from the PMA. These stats > (bytes*4,packets x in/out) are total (across all applications being > run). They can be obtained by the perfquery diagnostic tool or via a > Performance Manager. > > -- Hal > > > Thank you, > > Amit > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From aj.guillon at gmail.com Fri Sep 12 16:52:19 2008 From: aj.guillon at gmail.com (Adrien Guillon) Date: Fri, 12 Sep 2008 19:52:19 -0400 Subject: [ofa-general] Sharing CQs across multiple connections with librdmacm Message-ID: <9870a2060809121652g3a54f817x6d92ad953bcf863f@mail.gmail.com> Hey... I want to allocate a send CQ and receive CQ for each HCA, to be shared by all connections using that HCA. This seems possible according to the Infiniband standard, but I can't see how to do this in practice using the ibverbs. I'm using librdmacm for the actual connections. My problem is that ibv_create_cq() takes an ibv_context* as an argument. With librdmacm, I can get this through rdma_cm_id->verbs. However it looks like ibv_context objects are associated with particular connections, not particular HCAs which is what is confusing me. It seems to me that ibv_create_cq() should be associated with a handle to the HCA itself, as the "Infiniband Network Architecture" book says. Ideally I would allocate a data structure with HCA specific data for each device (e.g. PD, CQ, etc.) and use the kernel name (e.g. mctha0) to lookup the HCA specific data. That way I can check the ibv_context to see if I can use existing specific data or create new. Whew. So the question is... how do I do this given that ibv_create_cq() takes ibv_context* as an argument? Will it internally just use the ibv_context to look up the device? What happens when that ibv_context is destroyed, but I want the PD to remain open (e.g. connection destroyed, others still open)? Can I create CQs and PDs using ibv_device at initialization time, so I don't have to wait for the first connection on each device to come in? Thanks! AJ From sashak at voltaire.com Fri Sep 12 18:18:09 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 04:18:09 +0300 Subject: [ofa-general] [PATCH] infiniband-diags/ibtracert: fix port by direct path resolving Message-ID: <20080913011809.GC17315@sashak.voltaire.com> Then option '-D' is used ports provided to ibtracert in direct path format. This option was broken (bug #1136) due to incorrect resolution - lack of lid. This addresses bug #1136. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibtracert.c | 23 +++++++++++++++++++++++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index eb9329c..21edfba 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -673,6 +673,20 @@ free_name: free(nodename); } +static int resolve_lid(ib_portid_t *portid, const void *srcport) +{ + uint8_t portinfo[64]; + uint16_t lid; + + if (!smp_query_via(portinfo, portid, IB_ATTR_PORT_INFO, 0, 0, srcport)) + return -1; + mad_decode_field(portinfo, IB_PORT_LID_F, &lid); + + ib_portid_set(portid, lid, 0, 0); + + return 0; +} + static void usage(void) { @@ -806,6 +820,15 @@ main(int argc, char **argv) if (ib_resolve_portid_str(&dest_portid, argv[1], dest_type, sm_id) < 0) IBERROR("can't resolve destination port %s", argv[1]); + if (dest_type == IB_DEST_DRPATH) { + if (resolve_lid(&src_portid, NULL) < 0) + IBERROR("cannot resolve lid for port \'%s\'", + portid2str(&src_portid)); + if (resolve_lid(&dest_portid, NULL) < 0) + IBERROR("cannot resolve lid for port \'%s\'", + portid2str(&dest_portid)); + } + if (dest_portid.lid == 0 || src_portid.lid == 0) { IBWARN("bad src/dest lid"); usage(); -- 1.6.0.1.196.g01914 From sashak at voltaire.com Fri Sep 12 19:12:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 05:12:00 +0300 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup In-Reply-To: <1221168479.19185.135.camel@cardanus.llnl.gov> References: <20080911201126.GK25831@sashak.voltaire.com> <1221168479.19185.135.camel@cardanus.llnl.gov> Message-ID: <20080913021200.GE17315@sashak.voltaire.com> Hi Al, On 14:27 Thu 11 Sep , Al Chu wrote: > > Although the %post script below may not be 100% portable, I think it's > pretty typical for system daemon rpms. A quick "rpm -q --scripts > " shows its pretty common for system daemons on RHEL. It > should be tweaked for portability rather than being removed. The issue is that it starts opensm service on boot automatically after installation (without user requesting this with 'chkconfig' or so) - see bug #1181 - https://bugs.openfabrics.org/show_bug.cgi?id=1181 > Personally, I've never done "/sbin/service FOO condrestart" in rpm > scripts. I do "%{initrddir}/FOO condrestart". Maybe that's more > portable?? So script itself should support 'condrestart' command? Sasha > > Al > > On Thu, 2008-09-11 at 23:11 +0300, Sasha Khapyorsky wrote: > > This addresses bug#1181. > > > > Comment out opensm service auto-startup setup at %post section. > > > > Signed-off-by: Sasha Khapyorsky > > --- > > > > I don't really know why it was done this way originally. So please send > > any comments and/or objections. > > > > opensm/opensm.spec.in | 10 +++++----- > > 1 files changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in > > index 2e3abfc..fc7677d 100644 > > --- a/opensm/opensm.spec.in > > +++ b/opensm/opensm.spec.in > > @@ -104,11 +104,11 @@ install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh > > rm -rf $RPM_BUILD_ROOT > > > > %post > > -if [ $1 = 1 ]; then > > - /sbin/chkconfig --add opensmd > > -else > > - /sbin/service opensmd condrestart > > -fi > > +#if [ $1 = 1 ]; then > > +# /sbin/chkconfig --add opensmd > > +#else > > +# /sbin/service opensmd condrestart > > +#fi > > > > %preun > > if [ $1 = 0 ]; then > -- > Albert Chu > chu11 at llnl.gov > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From chu11 at llnl.gov Fri Sep 12 20:52:22 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 12 Sep 2008 23:52:22 -0400 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup In-Reply-To: <20080913021200.GE17315@sashak.voltaire.com> References: <20080911201126.GK25831@sashak.voltaire.com> <1221168479.19185.135.camel@cardanus.llnl.gov> <20080913021200.GE17315@sashak.voltaire.com> Message-ID: <1221277942.3059.46.camel@whatsup> Hey Sasha, > The issue is that it starts opensm service on boot automatically after > installation (without user requesting this with 'chkconfig' or so) - see > bug #1181 - https://bugs.openfabrics.org/show_bug.cgi?id=1181 I don't know how this issue is handled in Suse, but I don't think it should be handled by removing the post-install script in the rpm spec file. I believe the post install script is typical for redhat/fedora systems. On my RHEL server, a quick look shows most of the popular daemons automatically add the daemon to chkconfig. # > rpm -q --scripts vixie-cron postinstall scriptlet (using /bin/sh): /sbin/chkconfig --add crond # > rpm -q --scripts openssh-server postinstall scriptlet (using /bin/sh): /sbin/chkconfig --add sshd # > rpm -q --scripts httpd postinstall scriptlet (using /bin/sh): # Register the httpd service /sbin/chkconfig --add httpd # > rpm -q --scripts mysql-server postinstall scriptlet (using /bin/sh): if [ $1 = 1 ]; then /sbin/chkconfig --add mysqld fi Whether the daemon should be started up automatically on boot is configured in the init.d script by specifying what run levels it should be configured on/off automatically. I see in the git master opensm it seems to be off by default: # > grep chkconfig redhat-opensm.init.in # chkconfig: - 15 85 So perhaps we need to bug some Suse knowledgeable people on how to do this properly. Because I think this patch will break RHEL behavior. > So script itself should support 'condrestart' command? This is the way I've personally done it. I can't say what the most common method is, but it seems fairly common. A grep in /etc/init.d on my RHEL system shows it is all over the place. Al On Sat, 2008-09-13 at 05:12 +0300, Sasha Khapyorsky wrote: > Hi Al, > > On 14:27 Thu 11 Sep , Al Chu wrote: > > > > Although the %post script below may not be 100% portable, I think it's > > pretty typical for system daemon rpms. A quick "rpm -q --scripts > > " shows its pretty common for system daemons on RHEL. It > > should be tweaked for portability rather than being removed. > > The issue is that it starts opensm service on boot automatically after > installation (without user requesting this with 'chkconfig' or so) - see > bug #1181 - https:// bugs.openfabrics.org/show_bug.cgi?id=1181 > > > Personally, I've never done "/sbin/service FOO condrestart" in rpm > > scripts. I do "%{initrddir}/FOO condrestart". Maybe that's more > > portable?? > > So script itself should support 'condrestart' command? > > Sasha > > > > > Al > > > > On Thu, 2008-09-11 at 23:11 +0300, Sasha Khapyorsky wrote: > > > This addresses bug#1181. > > > > > > Comment out opensm service auto-startup setup at %post section. > > > > > > Signed-off-by: Sasha Khapyorsky > > > --- > > > > > > I don't really know why it was done this way originally. So please send > > > any comments and/or objections. > > > > > > opensm/opensm.spec.in | 10 +++++----- > > > 1 files changed, 5 insertions(+), 5 deletions(-) > > > > > > diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in > > > index 2e3abfc..fc7677d 100644 > > > --- a/opensm/opensm.spec.in > > > +++ b/opensm/opensm.spec.in > > > @@ -104,11 +104,11 @@ install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh > > > rm -rf $RPM_BUILD_ROOT > > > > > > %post > > > -if [ $1 = 1 ]; then > > > - /sbin/chkconfig --add opensmd > > > -else > > > - /sbin/service opensmd condrestart > > > -fi > > > +#if [ $1 = 1 ]; then > > > +# /sbin/chkconfig --add opensmd > > > +#else > > > +# /sbin/service opensmd condrestart > > > +#fi > > > > > > %preun > > > if [ $1 = 0 ]; then > > -- > > Albert Chu > > chu11 at llnl.gov > > 925-422-5311 > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Fri Sep 12 21:05:43 2008 From: chu11 at llnl.gov (Al Chu) Date: Sat, 13 Sep 2008 00:05:43 -0400 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup In-Reply-To: <1221277942.3059.46.camel@whatsup> References: <20080911201126.GK25831@sashak.voltaire.com> <1221168479.19185.135.camel@cardanus.llnl.gov> <20080913021200.GE17315@sashak.voltaire.com> <1221277942.3059.46.camel@whatsup> Message-ID: <1221278743.3059.54.camel@whatsup> Hey Sasha, I suddenly remembered that I've had to support a daemon on Suse before and looked at that project's init.d script :-) I *think* I remember how this is handled in Suse now. It's a different set of comments at the top of the init.d script. In opensm/scripts/opensm.init.in I see this: ### BEGIN INIT INFO # Provides: opensm # Required-Start: $syslog # Default-Start: 2 3 5 # Default-Stop: 0 1 6 # Description: Manage OpenSM ### END INIT INFO I think this indicates that by default opensm should start on boot on run levels 2 3 5. Which I guess is what we don't want. I'm going to take a guess that the following patch will fix the problem. Patch is completely untested (I don't have a suse system). So hopefully someone else can try it out. Al On Fri, 2008-09-12 at 23:52 -0400, Al Chu wrote: > Hey Sasha, > > > The issue is that it starts opensm service on boot automatically after > > installation (without user requesting this with 'chkconfig' or so) - see > > bug #1181 - https:// bugs.openfabrics.org/show_bug.cgi?id=1181 > > I don't know how this issue is handled in Suse, but I don't think it > should be handled by removing the post-install script in the rpm spec > file. I believe the post install script is typical for redhat/fedora > systems. On my RHEL server, a quick look shows most of the popular > daemons automatically add the daemon to chkconfig. > > # > rpm -q --scripts vixie-cron > postinstall scriptlet (using /bin/sh): > /sbin/chkconfig --add crond > > # > rpm -q --scripts openssh-server > postinstall scriptlet (using /bin/sh): > /sbin/chkconfig --add sshd > > # > rpm -q --scripts httpd > postinstall scriptlet (using /bin/sh): > # Register the httpd service > /sbin/chkconfig --add httpd > > # > rpm -q --scripts mysql-server > postinstall scriptlet (using /bin/sh): > if [ $1 = 1 ]; then > /sbin/chkconfig --add mysqld > fi > > Whether the daemon should be started up automatically on boot is > configured in the init.d script by specifying what run levels it should > be configured on/off automatically. I see in the git master opensm it > seems to be off by default: > > # > grep chkconfig redhat-opensm.init.in > # chkconfig: - 15 85 > > So perhaps we need to bug some Suse knowledgeable people on how to do > this properly. Because I think this patch will break RHEL behavior. > > > So script itself should support 'condrestart' command? > > This is the way I've personally done it. I can't say what the most > common method is, but it seems fairly common. A grep in /etc/init.d on > my RHEL system shows it is all over the place. > > Al > > On Sat, 2008-09-13 at 05:12 +0300, Sasha Khapyorsky wrote: > > Hi Al, > > > > On 14:27 Thu 11 Sep , Al Chu wrote: > > > > > > Although the %post script below may not be 100% portable, I think it's > > > pretty typical for system daemon rpms. A quick "rpm -q --scripts > > > " shows its pretty common for system daemons on RHEL. It > > > should be tweaked for portability rather than being removed. > > > > The issue is that it starts opensm service on boot automatically after > > installation (without user requesting this with 'chkconfig' or so) - see > > bug #1181 - https:// bugs.openfabrics.org/show_bug.cgi?id=1181 > > > > > Personally, I've never done "/sbin/service FOO condrestart" in rpm > > > scripts. I do "%{initrddir}/FOO condrestart". Maybe that's more > > > portable?? > > > > So script itself should support 'condrestart' command? > > > > Sasha > > > > > > > > Al > > > > > > On Thu, 2008-09-11 at 23:11 +0300, Sasha Khapyorsky wrote: > > > > This addresses bug#1181. > > > > > > > > Comment out opensm service auto-startup setup at %post section. > > > > > > > > Signed-off-by: Sasha Khapyorsky > > > > --- > > > > > > > > I don't really know why it was done this way originally. So please send > > > > any comments and/or objections. > > > > > > > > opensm/opensm.spec.in | 10 +++++----- > > > > 1 files changed, 5 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in > > > > index 2e3abfc..fc7677d 100644 > > > > --- a/opensm/opensm.spec.in > > > > +++ b/opensm/opensm.spec.in > > > > @@ -104,11 +104,11 @@ install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh > > > > rm -rf $RPM_BUILD_ROOT > > > > > > > > %post > > > > -if [ $1 = 1 ]; then > > > > - /sbin/chkconfig --add opensmd > > > > -else > > > > - /sbin/service opensmd condrestart > > > > -fi > > > > +#if [ $1 = 1 ]; then > > > > +# /sbin/chkconfig --add opensmd > > > > +#else > > > > +# /sbin/service opensmd condrestart > > > > +#fi > > > > > > > > %preun > > > > if [ $1 = 0 ]; then > > > -- > > > Albert Chu > > > chu11 at llnl.gov > > > 925-422-5311 > > > Computer Scientist > > > High Performance Systems Division > > > Lawrence Livermore National Laboratory > > > > > > -- > Albert Chu > chu11 at llnl.gov > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-do-not-start-opensm-on-boot-automatically.patch Type: application/mbox Size: 770 bytes Desc: not available URL: From sashak at voltaire.com Sat Sep 13 03:04:45 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 13:04:45 +0300 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup In-Reply-To: <1221278743.3059.54.camel@whatsup> References: <20080911201126.GK25831@sashak.voltaire.com> <1221168479.19185.135.camel@cardanus.llnl.gov> <20080913021200.GE17315@sashak.voltaire.com> <1221277942.3059.46.camel@whatsup> <1221278743.3059.54.camel@whatsup> Message-ID: <20080913100445.GF17315@sashak.voltaire.com> Hi Al, On 00:05 Sat 13 Sep , Al Chu wrote: > > I *think* I remember how this is handled in Suse now. It's a different > set of comments at the top of the init.d script. In > opensm/scripts/opensm.init.in I see this: > > ### BEGIN INIT INFO > # Provides: opensm > # Required-Start: $syslog > # Default-Start: 2 3 5 > # Default-Stop: 0 1 6 > # Description: Manage OpenSM > ### END INIT INFO It was my original thought too. But actually those fields are used as recommendation to chkconfig --add, chkconfig --del, etc.. > I think this indicates that by default opensm should start on boot on > run levels 2 3 5. Which I guess is what we don't want. I'm going to > take a guess that the following patch will fix the problem. Patch is > completely untested (I don't have a suse system). So hopefully someone > else can try it out. The patch is good since it drops unneeded assumption about configured runlevels. Without this system defaults will be used by chkconfig, and I guess it is more portable. Unfortunately it doesn't solve the original issue. Sasha From vlad at lists.openfabrics.org Sat Sep 13 03:08:51 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 13 Sep 2008 03:08:51 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20080913-0200 daily build status Message-ID: <20080913100851.485E7E608FF@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Failed: From sashak at voltaire.com Sat Sep 13 09:25:29 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 19:25:29 +0300 Subject: [ofa-general] [PATCH] opensm/opensm.spec: comment out service auto-startup setup In-Reply-To: <1221277942.3059.46.camel@whatsup> References: <20080911201126.GK25831@sashak.voltaire.com> <1221168479.19185.135.camel@cardanus.llnl.gov> <20080913021200.GE17315@sashak.voltaire.com> <1221277942.3059.46.camel@whatsup> Message-ID: <20080913162529.GH17315@sashak.voltaire.com> Hi Al, On 23:52 Fri 12 Sep , Al Chu wrote: > > I don't know how this issue is handled in Suse, but I don't think it > should be handled by removing the post-install script in the rpm spec > file. What I can find is 'chkconfig --add', requested run levels are described in optional '# Default-Start: ' tag. Something like # Default-Start: none will disable setup and 'chkconfig --add' will do nothing. This could work as immediate solution and keep RH setup untouched. > I believe the post install script is typical for redhat/fedora > systems. On my RHEL server, a quick look shows most of the popular > daemons automatically add the daemon to chkconfig. With RH it is different, startup script must have tag: # chkconfig:

> # > rpm -q --scripts vixie-cron > postinstall scriptlet (using /bin/sh): > /sbin/chkconfig --add crond > > # > rpm -q --scripts openssh-server > postinstall scriptlet (using /bin/sh): > /sbin/chkconfig --add sshd > > # > rpm -q --scripts httpd > postinstall scriptlet (using /bin/sh): > # Register the httpd service > /sbin/chkconfig --add httpd > > # > rpm -q --scripts mysql-server > postinstall scriptlet (using /bin/sh): > if [ $1 = 1 ]; then > /sbin/chkconfig --add mysqld > fi > > Whether the daemon should be started up automatically on boot is > configured in the init.d script by specifying what run levels it should > be configured on/off automatically. > > I see in the git master opensm it > seems to be off by default: > > # > grep chkconfig redhat-opensm.init.in > # chkconfig: - 15 85 Correct, and in post-install time 'chkconfig --add' will do nothing (almost, it will register the service with no startup run-levels). When user will want to setup startup on boot she will need to edit startup script and re-run 'chkconfig --add'. If so what is a clear benefit in running 'chkconfig --add' at post-install time? I don't know. Maybe only convention. Sasha > So perhaps we need to bug some Suse knowledgeable people on how to do > this properly. Because I think this patch will break RHEL behavior. > > > So script itself should support 'condrestart' command? > > This is the way I've personally done it. I can't say what the most > common method is, but it seems fairly common. A grep in /etc/init.d on > my RHEL system shows it is all over the place. > > Al > > On Sat, 2008-09-13 at 05:12 +0300, Sasha Khapyorsky wrote: > > Hi Al, > > > > On 14:27 Thu 11 Sep , Al Chu wrote: > > > > > > Although the %post script below may not be 100% portable, I think it's > > > pretty typical for system daemon rpms. A quick "rpm -q --scripts > > > " shows its pretty common for system daemons on RHEL. It > > > should be tweaked for portability rather than being removed. > > > > The issue is that it starts opensm service on boot automatically after > > installation (without user requesting this with 'chkconfig' or so) - see > > bug #1181 - https:// bugs.openfabrics.org/show_bug.cgi?id=1181 > > > > > Personally, I've never done "/sbin/service FOO condrestart" in rpm > > > scripts. I do "%{initrddir}/FOO condrestart". Maybe that's more > > > portable?? > > > > So script itself should support 'condrestart' command? > > > > Sasha > > > > > > > > Al > > > > > > On Thu, 2008-09-11 at 23:11 +0300, Sasha Khapyorsky wrote: > > > > This addresses bug#1181. > > > > > > > > Comment out opensm service auto-startup setup at %post section. > > > > > > > > Signed-off-by: Sasha Khapyorsky > > > > --- > > > > > > > > I don't really know why it was done this way originally. So please send > > > > any comments and/or objections. > > > > > > > > opensm/opensm.spec.in | 10 +++++----- > > > > 1 files changed, 5 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in > > > > index 2e3abfc..fc7677d 100644 > > > > --- a/opensm/opensm.spec.in > > > > +++ b/opensm/opensm.spec.in > > > > @@ -104,11 +104,11 @@ install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh > > > > rm -rf $RPM_BUILD_ROOT > > > > > > > > %post > > > > -if [ $1 = 1 ]; then > > > > - /sbin/chkconfig --add opensmd > > > > -else > > > > - /sbin/service opensmd condrestart > > > > -fi > > > > +#if [ $1 = 1 ]; then > > > > +# /sbin/chkconfig --add opensmd > > > > +#else > > > > +# /sbin/service opensmd condrestart > > > > +#fi > > > > > > > > %preun > > > > if [ $1 = 0 ]; then > > > -- > > > Albert Chu > > > chu11 at llnl.gov > > > 925-422-5311 > > > Computer Scientist > > > High Performance Systems Division > > > Lawrence Livermore National Laboratory > > > > > > -- > Albert Chu > chu11 at llnl.gov > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From sashak at voltaire.com Sat Sep 13 09:39:36 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 19:39:36 +0300 Subject: [ofa-general] [PATCH] opensm: do not start opensm on boot automatically In-Reply-To: <20080913162529.GH17315@sashak.voltaire.com> References: <20080911201126.GK25831@sashak.voltaire.com> <1221168479.19185.135.camel@cardanus.llnl.gov> <20080913021200.GE17315@sashak.voltaire.com> <1221277942.3059.46.camel@whatsup> <20080913162529.GH17315@sashak.voltaire.com> Message-ID: <20080913163936.GI17315@sashak.voltaire.com> Do not start opensm on boot automatically on not RH systems. Signed-off-by: Sasha Khapyorsky --- opensm/scripts/opensm.init.in | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in index 7673bfa..550a0c1 100644 --- a/opensm/scripts/opensm.init.in +++ b/opensm/scripts/opensm.init.in @@ -8,8 +8,8 @@ ### BEGIN INIT INFO # Provides: opensm # Required-Start: $syslog -# Default-Start: -# Default-Stop: 0 1 2 3 5 6 +# Default-Start: none +# Default-Stop: 0 1 6 # Description: Manage OpenSM ### END INIT INFO # -- 1.5.4.rc2.60.gb2e62 From sashak at voltaire.com Sat Sep 13 11:20:29 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 21:20:29 +0300 Subject: [ofa-general] [PATCH] opensm/redhat-opensm.init.in: make config file optional Message-ID: <20080913182029.GK17315@sashak.voltaire.com> This file is not installed by default, so make it optional for the script. Signed-off-by: Sasha Khapyorsky --- opensm/scripts/redhat-opensm.init.in | 6 ++---- 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in index 5526e44..d4cc580 100755 --- a/opensm/scripts/redhat-opensm.init.in +++ b/opensm/scripts/redhat-opensm.init.in @@ -47,12 +47,10 @@ exec_prefix=@exec_prefix@ . /etc/rc.d/init.d/functions CONFIG=@sysconfdir@/sysconfig/opensm.conf -if [ ! -f $CONFIG ]; then - exit 0 +if [ -f $CONFIG ]; then + . $CONFIG fi -. $CONFIG - prog=@sbindir@/opensm bin=${prog##*/} -- 1.5.4.rc2.60.gb2e62 From sashak at voltaire.com Sat Sep 13 11:29:02 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 13 Sep 2008 21:29:02 +0300 Subject: [ofa-general] [PATCH] opensm/opensm.spec.in: don't install old format conf file Message-ID: <20080913182902.GL17315@sashak.voltaire.com> OpenSM uses this name (